# obicompactvec — Complete Reference ## Module structure ``` src/obicompactvec/src/ lib.rs public re-exports views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation) bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter reader.rs PersistentCompactIntVec (read-only) builder.rs PersistentCompactIntVecBuilder (read-write) tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed) tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed) bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder colgroup.rs ColGroup, MatrixGroupOps trait format.rs file format constants, encode/decode helpers layer_meta.rs LayerMeta (column metadata) meta.rs matrix metadata ``` ```mermaid graph TD views --> bitvec views --> builder views --> tempbitvec views --> tempintvec views --> bitmatrix views --> intmatrix format --> reader format --> builder reader --> intmatrix reader --> tempintvec builder --> intmatrix builder --> tempintvec bitvec --> tempbitvec bitvec --> bitmatrix tempintvec --> intmatrix tempintvec --> bitmatrix tempbitvec --> intmatrix tempbitvec --> bitmatrix colgroup --> intmatrix colgroup --> bitmatrix layer_meta --> bitmatrix layer_meta --> intmatrix meta --> bitmatrix meta --> intmatrix ``` --- ## Compact int encoding All integer vectors use the same two-tier encoding regardless of storage backend. **Primary array** — one `u8` per slot: - Values **0–254** are stored directly. No overhead. - Value **255 is a sentinel**: the slot's actual value is ≥ 255 and lives in the overflow store. **Overflow store** — maps slot index to a `u32` value ≥ 255: - In `PersistentCompactIntVecBuilder`: a `HashMap` in RAM. - In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search. ```mermaid flowchart LR slot --> P["primary[slot]: u8"] P -->|"< 255"| V["value = byte (0–254)"] P -->|"= 255 sentinel"| OV["overflow store"] OV -->|"Builder"| HM["HashMap<usize, u32>\nin RAM"] OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"] ``` **Key property — sentinel 255 = +∞ on `u8`:** - `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow - `max(a, 255) = 255` → correct sentinel when either side is overflow - Only the **both-overflow** case requires reading actual values from the overflow store. In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow. --- ## View types The previous trait hierarchy (`BitSlice`, `BitSliceMut`, `IntSlice`, `IntSliceMut`) has been replaced by two concrete zero-copy view structs with inherent methods. Views are **`Copy`** — passing them is free. All read operations live on these two types. ### `BitSliceView<'a>` ```rust #[derive(Clone, Copy)] pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize } ``` Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero. | Method | Cost | |---|---| | `len()`, `is_empty()` | O(1) | | `get(slot)` | O(1) | | `count_ones()` | POPCNT per word, O(n/64) | | `count_zeros()` | `n − count_ones()`, O(n/64) | | `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration | | `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount`, `(a\|b).popcount` per word, O(n/64) | | `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) | | `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) | `BitSliceIter<'a>`: word-level scan; one word per 64 iterations. ### `IntSliceView<'a>` ```rust #[derive(Clone, Copy)] pub struct IntSliceView<'a> { pub(crate) primary: &'a [u8], pub(crate) overflow_raw: &'a [u8], // sorted [(slot:u64, value:u32)] entries pub(crate) n_overflow: usize, pub(crate) n: usize, } ``` `overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()`/`freeze()` time. | Method | Cost | |---|---| | `len()`, `is_empty()` | O(1) | | `primary_bytes()` | O(1) | | `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration | | `get(slot)` | O(1) primary; binary search O(log k) for overflow slots | | `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) | | `sum()` | byte scan + overflow, O(n + k) | | `count_nonzero()` | byte scan, O(n) | | Distance methods (`bray_dist`, `euclidean_dist`, `jaccard_dist`, …) | O(n + k) | `IntSliceViewIter<'a>`: merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle. **Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap`, not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0`. This is intentional — the view is primarily useful after `freeze()`. During building, callers that need overflow use `overflow_entries()` directly. --- ## Concrete types ```mermaid classDiagram class BitSliceView { +words: &[u64] +n: usize +get(slot) bool +count_ones() u64 +iter() BitSliceIter +jaccard_dist/hamming_dist(other: BitSliceView) } class IntSliceView { +primary: &[u8] +overflow_raw: &[u8] +n_overflow: usize +n: usize +get(slot) u32 +iter() IntSliceViewIter +overflow_entries() Iterator +bray_dist/euclidean_dist/…(other: IntSliceView) } class PersistentBitVec { -mmap: Mmap -n: usize +view() BitSliceView +get(slot) bool +count_ones/zeros() u64 +iter() BitIter +partial_jaccard_dist(&Self) (u64,u64) +jaccard_dist/hamming_dist(&Self) … } class PersistentBitVecBuilder { -mmap: MmapMut -n: usize +view() BitSliceView +set(slot, bool) +or/and/xor/not(BitSliceView) +copy_from(BitSliceView) +close() / finish() → PersistentBitVec } class PersistentCompactIntVec { -mmap: Mmap -n: usize -n_overflow: usize -step: usize -index: Vec~(usize,usize)~ +view() IntSliceView +get(slot) u32 +iter() Iter +sum/count_nonzero() u64 +bray_dist/euclidean_dist/… (&Self) } class PersistentCompactIntVecBuilder { -mmap: MmapMut -n: usize -overflow: HashMap~usize,u32~ +view() IntSliceView +set(slot, u32) / get(slot) u32 +inc / inc_present / inc_present_fast +inc_predicate / inc_predicate_fast +add/min/max/diff/mask_with(…View) +primary_bytes/primary_bytes_mut() +close() / finish() → PersistentCompactIntVec } PersistentBitVec --> BitSliceView : view() PersistentBitVecBuilder --> BitSliceView : view() PersistentCompactIntVec --> IntSliceView : view() PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only) PersistentBitVecBuilder --> PersistentBitVec : close() then open() PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open() ``` ### `PersistentBitVec` / `PersistentBitVecBuilder` `PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()`, `count_zeros()`, `partial_jaccard_dist(&Self)`, `jaccard_dist(&Self)`, `hamming_dist(&Self)`. `BitIter<'a>` — exported iterator for `PersistentBitVec::iter()`: ```rust pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize } ``` `PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>`: | Method | Cost | |---|---| | `set(slot, bool)` | O(1) | | `view() -> BitSliceView<'_>` | O(1) | | `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly | | `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) | | `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) | ### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder` `PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers. `PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories: **Point mutations:** | Method | Note | |---|---| | `set(slot, u32)` | writes primary[slot] or 255+overflow | | `get(slot) -> u32` | reads primary byte or HashMap | | `inc(slot)` | `get` + `set`, O(1) | **Bulk computation methods** — accept view arguments: | Method | Semantics | Overflow | |---|---|---| | `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc`, safe for any group size | | `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached | | `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe | | `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached | | `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback | | `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below | | `max(IntSliceView)` | pre-pass + byte max | see algorithm below | | `diff(IntSliceView)` | saturating sub | self<255 hot path | | `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) | **`inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast`, or chunk size < 255 for `inc_predicate_fast`). Violation is caught by `debug_assert` in dev builds. **`min` algorithm:** Exploits 255 = +∞: byte-level min is correct unless both sides are overflow. ``` snapshot self_ov: Vec<(slot,val)> snapshot other_ov: HashMap clear_overflow() Pass 1 — byte min, SIMD-vectorizable, O(n) Pass 2 — both-overflow fixup, O(k_self): for (slot, self_val) in self_ov: if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot])) ``` **`max` algorithm:** Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass. ``` Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries(): set(slot, max(self.get(slot), other_val)) Pass 1 — byte max, SIMD-vectorizable, O(n) ``` --- ## Matrix types Four matrix types, two encodings × two formats: | | Columnar format | Packed format | |---|---|---| | **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) | | **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) | Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly: ```rust // PersistentBitMatrix pub fn col_view(&self, c: usize) -> BitSliceView<'_> // PersistentCompactIntMatrix pub fn col_view(&self, c: usize) -> IntSliceView<'_> ``` No wrapper enums (`BitColView`, `IntColView`): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method. `pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format. --- ## Aggregation traits (matrix level) ### ColumnWeights ```rust trait ColumnWeights: Send + Sync { fn col_weights(&self) -> Array1; // sum per column fn partial_kmer_counts(&self) -> Array1; // default = col_weights() } ``` `partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count. ### CountPartials Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`. **Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter. **`partial_threshold_jaccard` returns `(inter, union)`** because `union[i,j]` depends on both columns simultaneously. Provided finalisations: | Finalisation | Formula | |---|---| | `bray_dist_matrix()` | `1 − 2·partial_bray[i,j] / (w[i] + w[j])` | | `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` | | `threshold_jaccard_dist_matrix(t)` | `1 − inter[i,j] / union[i,j]` | | `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` | | `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` | | `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` | | `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` | ### BitPartials Required: `partial_jaccard() -> (Array2, Array2)`, `partial_hamming() -> Array2`. Both additive across layers and partitions. --- ## Temp-file-backed types **All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory. ### Lifecycle ``` TempCompactIntVecBuilder::new(n) → writable mmap in TempDir ↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …) .freeze() → TempCompactIntVec (read-only mmap + TempDir) ↓ (optional) .make_persistent(path) → PersistentCompactIntVec (permanent file) ``` Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec`. **Drop order**: `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed. ### TempCompactIntVec / TempCompactIntVecBuilder ```rust pub struct TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir, // dropped after vec } pub(crate) struct TempCompactIntVecBuilder { builder: PersistentCompactIntVecBuilder, temp: TempDir, } ``` `TempCompactIntVec`: read access via `get(slot)`, `sum()`, `iter()`, `view() -> IntSliceView<'_>`. `TempCompactIntVecBuilder`: full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast`, `inc_predicate_fast`, `add`, `min`, `max`, `diff`, `mask_with`) are exposed as `pub(crate)`. ### TempBitVec / TempBitVecBuilder ```rust pub struct TempBitVec { vec: PersistentBitVec, _temp: TempDir, } pub(crate) struct TempBitVecBuilder { builder: PersistentBitVecBuilder, temp: TempDir, } ``` `TempBitVec`: read access via `get(slot)`, `count_ones()`, `view() -> BitSliceView<'_>`, `iter()`. `TempBitVecBuilder`: exposes `set(slot, bool)`, `or(BitSliceView)`, and: ```rust pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool) ``` `or_where` — two passes, no intermediate allocation: ``` Pass 1 — primary bytes, O(n): for slot in 0..n: b = col.primary_bytes()[slot] if b < 255 AND pred(b as u32): self.set(slot, true) Pass 2 — overflow, O(k): for (slot, val) in col.overflow_entries(): if pred(val): self.set(slot, true) ``` --- ## Filter / Select API ### ColGroup ```rust pub struct ColGroup { pub name: String, pub indices: Vec } ``` Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned. ### Composition axis - **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges). - **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.). ### MatrixGroupOps Group operations expose only **additive intermediates** backed by temp files. Final predicates are applied at the index level after accumulation. ```rust pub trait MatrixGroupOps { fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32) -> io::Result; fn partial_group_sum(&self, g: &ColGroup) -> io::Result; fn partial_group_any(&self, g: &ColGroup, threshold: u32) -> io::Result; } ``` Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)`. **`partial_group_presence_count` — chunking for large groups:** When `g.indices.len() < 255`: per-slot counts stay within `u8` range. Use `inc_present_fast` (bit matrix) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int matrix) — raw u8 increment, no overflow map written. When `g.indices.len() ≥ 255`: process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via `.add(chunk_frozen.view())`. ``` fast path (< 255 cols): builder = TempCompactIntVecBuilder::new(n) for c in group: builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold) builder.freeze() slow path (≥ 255 cols): result = TempCompactIntVecBuilder::new(n) for chunk in group.chunks(254): chunk_b = TempCompactIntVecBuilder::new(n) for c in chunk: chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold) frozen = chunk_b.freeze() result.add(frozen.view()) result.freeze() ``` **`partial_group_any`** uses `or_where` on `TempBitVecBuilder`: ``` result = TempBitVecBuilder::new(n) for c in group: result.or_where(matrix.col_view(c), |v| v >= threshold) result.freeze() ``` **Non-additive predicates** (`group_all`, `group_at_least(k)`) are composed at the index level: ```rust // "present in >= 2 ingroup columns with count >= 3, absent from all outgroup" let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?; let in_mask = presence.view().geq(2); // IntSliceView method let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?; let out_mask = out_sum.view().leq(0); let mut mask_b = TempBitVecBuilder::new(n)?; mask_b.copy_from(in_mask); mask_b.and(out_mask); ``` ### mask_with Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder`). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones. ``` for (w_idx, word) in mask.words(): if word == u64::MAX: continue // skip all-ones words zeros = !word while zeros != 0: bit = trailing_zeros(zeros) s = w_idx * 64 + bit if primary[s] != 0: set(s, 0) // clears overflow entry too zeros &= zeros − 1 ``` Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).