Files

T

Eric Coissac 0b3fcf3cf0 feat: add PersistentBitVec and upgrade PersistentCompactIntVec format

Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.

2026-05-14 09:01:36 +08:00

7.5 KiB

Raw Blame History

PersistentCompactIntVec

Purpose

PersistentCompactIntVec stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.

Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).

Design

Two-tier structure:

Primary array — [u8; n], stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value 255 is a sentinel meaning "look in overflow".
Overflow section — sorted list of (slot: u32, value: u32) pairs for all slots where the true value ≥ 255, with a sparse L1-fitting index for fast lookup.

primary[slot] < 255  →  return primary[slot]
primary[slot] == 255 →  binary search in overflow

Single-file format

Everything is stored in a single .pciv file. Write order matches computation order: the header placeholder is written first, then primary (known at new()), then overflow data and index (known at close()), then the header is overwritten at offset 0.

offset 0:
  magic:      [u8; 4]  = b"PCIV"
  n:          u64       number of slots
  n_overflow: u32       number of overflow entries
  step:       u32       sparse index step (0 = no index)
  n_index:    u32       number of index entries

offset 24:
  primary:    [u8; n]   one byte per slot, 255 = overflow sentinel

offset 24 + n:
  data:       [(slot: u32, value: u32); n_overflow]   sorted by slot

offset 24 + n + n_overflow × 8:
  index:      [(slot: u32, pos: u32); n_index]         sparse index

The index entries point into data: index[i] = (slot of data[i×step], i×step).

Lifecycle

Builder (`PersistentCompactIntVecBuilder`)

Used during construction. The primary section is mmap'd immediately at construction time (both for new and build_from), so the file exists and is addressable from the start. The overflow is held in a HashMap<u64, u32> in RAM.

struct PersistentCompactIntVecBuilder {
    path:     PathBuf,
    mmap:     MmapMut,           // primary section live in the file from the start
    n:        usize,
    overflow: HashMap<u64, u32>, // values ≥ 255
}

`new(n: usize, path: &Path) -> io::Result<Self>`

Creates the file, pre-allocates HEADER_SIZE + n zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for set / get.

`build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`

Copies the source PCIV file to path (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a HashMap. Initialisation cost: O(file copy) + O(n_overflow), not O(n).

At close(), the primary section is not rewritten: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.

`set(slot: u64, value: u32)` / `get(slot: u64) -> u32`

Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).

Element-wise operations — `min`, `max`, `add`, `diff`

Each takes a &PersistentCompactIntVec of equal length and updates self in place via set:

builder.min(&other);   // self[i] = min(self[i], other[i])
builder.max(&other);   // self[i] = max(self[i], other[i])
builder.add(&other);   // self[i] = self[i].checked_add(other[i])  (panics on u32 overflow)
builder.diff(&other);  // self[i] = self[i].saturating_sub(other[i])

All iterate other with other.iter() (merge-scan, O(n_other)).

`close(self) -> io::Result<()>`

Flush and drop the mmap (primary changes are now on disk).
Sort the overflow HashMap into Vec<(u32, u32)>.
Truncate the file to HEADER_SIZE + n (removes old data+index if build_from was used).
Append sorted overflow data, then sparse index.
Seek to offset 0, overwrite the header with final values.

Reader (`PersistentCompactIntVec`)

Used at query time. The whole file is mmap'd; only the sparse index is copied into a Vec at open time (≤ 32 KB, L1-resident).

struct PersistentCompactIntVec {
    mmap:           Mmap,
    n:              usize,
    n_overflow:     usize,
    step:           u32,
    index:          Vec<(u32, u32)>,  // L1-resident
    primary_offset: usize,            // = 24 (HEADER_SIZE)
    data_offset:    usize,            // = 24 + n
    path:           PathBuf,
}

`open(path: &Path) -> io::Result<Self>`

Mmaps the file, parses the 24-byte header, copies the sparse index entries into a Vec. The primary and data sections stay mmap'd.

`get(slot: u64) -> u32` — random access

primary[slot] < 255  →  return it directly

step == 0:
    binary_search(data[0..n_overflow], slot)

step > 0:
    i = upper_bound(index[..].slot, slot) − 1     // in L1-resident Vec
    binary_search(data[index[i].pos .. index[i+1].pos], slot)

`iter() -> Iter<'_>` — sequential scan, O(n)

Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.

Iter implements ExactSizeIterator. &PersistentCompactIntVec implements IntoIterator.

Aggregate

fn sum(&self) -> u64   // Σ self[i] as u64, via iter()

Distance methods

All take &other of equal length, iterate both with zip(self.iter(), other.iter()), and return f64.

Method	Formula
`bray_dist`	`1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)`
`relfreq_bray_dist`	Bray-Curtis on relative frequencies: `1 − Σmin(pᵢ,qᵢ)` where `pᵢ = aᵢ/Σa`
`euclidean_dist`	`√Σ(aᵢ − bᵢ)²`
`relfreq_euclidean_dist`	Euclidean on relative frequencies
`hellinger_euclidean_dist`	`√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq)
`hellinger_dist`	`hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1]
`threshold_jaccard_dist(&other, threshold: u32)`	`1 −
`jaccard_dist`	`threshold_jaccard_dist(&other, 1)`

Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.

Step computation

Chosen at close() once n_overflow is known:

L1_entries = 32 768 / 8 = 4096

step = 0                               if n_overflow ≤ 4096
step = ⌈n_overflow / 4096⌉            otherwise

For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.

Complexity

Operation	Time	Notes
`set` / `get` (builder)	O(1)	mmap byte + HashMap
`get` (reader, no overflow)	O(1)	single mmap byte
`get` (reader, with index)	O(log step)	≤ 2 memory regions
`get` (reader, no index)	O(log n_overflow)	data fits in a few cache lines
`iter()` full scan	O(n + n_overflow)	merge-scan, no binary search
`sum`, distances	O(n)	via `iter()` / `zip(iter(), iter())`
`min` / `max` / `add` / `diff`	O(n)	via `other.iter()` + builder `set`
`close`	O(n_overflow log n_overflow)	sort + sequential write
`open`	O(n_index)	index copy into Vec
`build_from`	O(file_size) + O(n_overflow)	OS copy + HashMap load

7.5 KiB Raw Blame History Unescape Escape

PersistentCompactIntVec

Purpose

Design

Single-file format

Lifecycle

Builder (PersistentCompactIntVecBuilder)

new(n: usize, path: &Path) -> io::Result<Self>

build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>

set(slot: u64, value: u32) / get(slot: u64) -> u32

Element-wise operations — min, max, add, diff

close(self) -> io::Result<()>

Reader (PersistentCompactIntVec)

open(path: &Path) -> io::Result<Self>

get(slot: u64) -> u32 — random access

iter() -> Iter<'_> — sequential scan, O(n)