Files
obikmer/docmd/implementation/obicompactvec.md
T
Eric Coissac 7eea71fdcd docs(obicompactvec): update API docs and algorithm descriptions
Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
2026-06-22 09:46:19 +02:00

528 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# obicompactvec — Complete Reference
## Module structure
```
src/obicompactvec/src/
lib.rs public re-exports
views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation)
bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter
reader.rs PersistentCompactIntVec (read-only)
builder.rs PersistentCompactIntVecBuilder (read-write)
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
colgroup.rs ColGroup, MatrixGroupOps trait
format.rs file format constants, encode/decode helpers
layer_meta.rs LayerMeta (column metadata)
meta.rs matrix metadata
```
```mermaid
graph TD
views --> bitvec
views --> builder
views --> tempbitvec
views --> tempintvec
views --> bitmatrix
views --> intmatrix
format --> reader
format --> builder
reader --> intmatrix
reader --> tempintvec
builder --> intmatrix
builder --> tempintvec
bitvec --> tempbitvec
bitvec --> bitmatrix
tempintvec --> intmatrix
tempintvec --> bitmatrix
tempbitvec --> intmatrix
tempbitvec --> bitmatrix
colgroup --> intmatrix
colgroup --> bitmatrix
layer_meta --> bitmatrix
layer_meta --> intmatrix
meta --> bitmatrix
meta --> intmatrix
```
---
## Compact int encoding
All integer vectors use the same two-tier encoding regardless of storage backend.
**Primary array** — one `u8` per slot:
- Values **0254** are stored directly. No overhead.
- Value **255 is a sentinel**: the slot's actual value is ≥ 255 and lives in the overflow store.
**Overflow store** — maps slot index to a `u32` value ≥ 255:
- In `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
- In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.
```mermaid
flowchart LR
slot --> P["primary[slot]: u8"]
P -->|"< 255"| V["value = byte (0254)"]
P -->|"= 255 sentinel"| OV["overflow store"]
OV -->|"Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
```
**Key property — sentinel 255 = +∞ on `u8`:**
- `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
- `max(a, 255) = 255` → correct sentinel when either side is overflow
- Only the **both-overflow** case requires reading actual values from the overflow store.
In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.
---
## View types
The previous trait hierarchy (`BitSlice`, `BitSliceMut`, `IntSlice`, `IntSliceMut`) has been replaced by two concrete zero-copy view structs with inherent methods. Views are **`Copy`** — passing them is free. All read operations live on these two types.
### `BitSliceView<'a>`
```rust
#[derive(Clone, Copy)]
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }
```
Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero.
| Method | Cost |
|---|---|
| `len()`, `is_empty()` | O(1) |
| `get(slot)` | O(1) |
| `count_ones()` | POPCNT per word, O(n/64) |
| `count_zeros()` | `n count_ones()`, O(n/64) |
| `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration |
| `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount`, `(a\|b).popcount` per word, O(n/64) |
| `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) |
| `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) |
`BitSliceIter<'a>`: word-level scan; one word per 64 iterations.
### `IntSliceView<'a>`
```rust
#[derive(Clone, Copy)]
pub struct IntSliceView<'a> {
pub(crate) primary: &'a [u8],
pub(crate) overflow_raw: &'a [u8], // sorted [(slot:u64, value:u32)] entries
pub(crate) n_overflow: usize,
pub(crate) n: usize,
}
```
`overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()`/`freeze()` time.
| Method | Cost |
|---|---|
| `len()`, `is_empty()` | O(1) |
| `primary_bytes()` | O(1) |
| `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration |
| `get(slot)` | O(1) primary; binary search O(log k) for overflow slots |
| `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) |
| `sum()` | byte scan + overflow, O(n + k) |
| `count_nonzero()` | byte scan, O(n) |
| Distance methods (`bray_dist`, `euclidean_dist`, `jaccard_dist`, …) | O(n + k) |
`IntSliceViewIter<'a>`: merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle.
**Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap`, not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0`. This is intentional — the view is primarily useful after `freeze()`. During building, callers that need overflow use `overflow_entries()` directly.
---
## Concrete types
```mermaid
classDiagram
class BitSliceView {
+words: &[u64]
+n: usize
+get(slot) bool
+count_ones() u64
+iter() BitSliceIter
+jaccard_dist/hamming_dist(other: BitSliceView)
}
class IntSliceView {
+primary: &[u8]
+overflow_raw: &[u8]
+n_overflow: usize
+n: usize
+get(slot) u32
+iter() IntSliceViewIter
+overflow_entries() Iterator
+bray_dist/euclidean_dist/…(other: IntSliceView)
}
class PersistentBitVec {
-mmap: Mmap
-n: usize
+view() BitSliceView
+get(slot) bool
+count_ones/zeros() u64
+iter() BitIter
+partial_jaccard_dist(&Self) (u64,u64)
+jaccard_dist/hamming_dist(&Self) …
}
class PersistentBitVecBuilder {
-mmap: MmapMut
-n: usize
+view() BitSliceView
+set(slot, bool)
+or/and/xor/not(BitSliceView)
+copy_from(BitSliceView)
+close() / finish() → PersistentBitVec
}
class PersistentCompactIntVec {
-mmap: Mmap
-n: usize
-n_overflow: usize
-step: usize
-index: Vec~(usize,usize)~
+view() IntSliceView
+get(slot) u32
+iter() Iter
+sum/count_nonzero() u64
+bray_dist/euclidean_dist/… (&Self)
}
class PersistentCompactIntVecBuilder {
-mmap: MmapMut
-n: usize
-overflow: HashMap~usize,u32~
+view() IntSliceView
+set(slot, u32) / get(slot) u32
+inc / inc_present / inc_present_fast
+inc_predicate / inc_predicate_fast
+add/min/max/diff/mask_with(…View)
+primary_bytes/primary_bytes_mut()
+close() / finish() → PersistentCompactIntVec
}
PersistentBitVec --> BitSliceView : view()
PersistentBitVecBuilder --> BitSliceView : view()
PersistentCompactIntVec --> IntSliceView : view()
PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
PersistentBitVecBuilder --> PersistentBitVec : close() then open()
PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
```
### `PersistentBitVec` / `PersistentBitVecBuilder`
`PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()`, `count_zeros()`, `partial_jaccard_dist(&Self)`, `jaccard_dist(&Self)`, `hamming_dist(&Self)`.
`BitIter<'a>` — exported iterator for `PersistentBitVec::iter()`:
```rust
pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
```
`PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>`:
| Method | Cost |
|---|---|
| `set(slot, bool)` | O(1) |
| `view() -> BitSliceView<'_>` | O(1) |
| `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly |
| `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) |
| `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) |
### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`
`PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers.
`PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories:
**Point mutations:**
| Method | Note |
|---|---|
| `set(slot, u32)` | writes primary[slot] or 255+overflow |
| `get(slot) -> u32` | reads primary byte or HashMap |
| `inc(slot)` | `get` + `set`, O(1) |
**Bulk computation methods** — accept view arguments:
| Method | Semantics | Overflow |
|---|---|---|
| `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc`, safe for any group size |
| `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached |
| `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe |
| `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached |
| `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback |
| `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below |
| `max(IntSliceView)` | pre-pass + byte max | see algorithm below |
| `diff(IntSliceView)` | saturating sub | self<255 hot path |
| `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) |
**`inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast`, or chunk size < 255 for `inc_predicate_fast`). Violation is caught by `debug_assert` in dev builds.
**`min` algorithm:**
Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
```
snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
for (slot, self_val) in self_ov:
if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
```
**`max` algorithm:**
Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
```
Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)
```
---
## Matrix types
Four matrix types, two encodings × two formats:
| | Columnar format | Packed format |
|---|---|---|
| **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) |
| **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) |
Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly:
```rust
// PersistentBitMatrix
pub fn col_view(&self, c: usize) -> BitSliceView<'_>
// PersistentCompactIntMatrix
pub fn col_view(&self, c: usize) -> IntSliceView<'_>
```
No wrapper enums (`BitColView`, `IntColView`): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method.
`pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format.
---
## Aggregation traits (matrix level)
### ColumnWeights
```rust
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>; // sum per column
fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}
```
`partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count.
### CountPartials
Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`.
**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter.
**`partial_threshold_jaccard` returns `(inter, union)`** because `union[i,j]` depends on both columns simultaneously.
Provided finalisations:
| Finalisation | Formula |
|---|---|
| `bray_dist_matrix()` | `1 2·partial_bray[i,j] / (w[i] + w[j])` |
| `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
| `threshold_jaccard_dist_matrix(t)` | `1 inter[i,j] / union[i,j]` |
| `relfreq_bray_dist_matrix()` | `1 partial_relfreq_bray[i,j]` |
| `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
| `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
| `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |
### BitPartials
Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)`, `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.
---
## Temp-file-backed types
**All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
### Lifecycle
```
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …)
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
↓ (optional)
.make_persistent(path) → PersistentCompactIntVec (permanent file)
```
Same pattern for `TempBitVecBuilder``TempBitVec``PersistentBitVec`.
**Drop order**: `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed.
### TempCompactIntVec / TempCompactIntVecBuilder
```rust
pub struct TempCompactIntVec {
vec: PersistentCompactIntVec,
_temp: TempDir, // dropped after vec
}
pub(crate) struct TempCompactIntVecBuilder {
builder: PersistentCompactIntVecBuilder,
temp: TempDir,
}
```
`TempCompactIntVec`: read access via `get(slot)`, `sum()`, `iter()`, `view() -> IntSliceView<'_>`.
`TempCompactIntVecBuilder`: full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast`, `inc_predicate_fast`, `add`, `min`, `max`, `diff`, `mask_with`) are exposed as `pub(crate)`.
### TempBitVec / TempBitVecBuilder
```rust
pub struct TempBitVec {
vec: PersistentBitVec,
_temp: TempDir,
}
pub(crate) struct TempBitVecBuilder {
builder: PersistentBitVecBuilder,
temp: TempDir,
}
```
`TempBitVec`: read access via `get(slot)`, `count_ones()`, `view() -> BitSliceView<'_>`, `iter()`.
`TempBitVecBuilder`: exposes `set(slot, bool)`, `or(BitSliceView)`, and:
```rust
pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)
```
`or_where` — two passes, no intermediate allocation:
```
Pass 1 — primary bytes, O(n):
for slot in 0..n:
b = col.primary_bytes()[slot]
if b < 255 AND pred(b as u32): self.set(slot, true)
Pass 2 — overflow, O(k):
for (slot, val) in col.overflow_entries():
if pred(val): self.set(slot, true)
```
---
## Filter / Select API
### ColGroup
```rust
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
```
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
### Composition axis
- **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.).
### MatrixGroupOps
Group operations expose only **additive intermediates** backed by temp files. Final predicates are applied at the index level after accumulation.
```rust
pub trait MatrixGroupOps {
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempCompactIntVec>;
fn partial_group_sum(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>;
}
```
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)`.
**`partial_group_presence_count` — chunking for large groups:**
When `g.indices.len() < 255`: per-slot counts stay within `u8` range. Use `inc_present_fast` (bit matrix) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int matrix) — raw u8 increment, no overflow map written.
When `g.indices.len() ≥ 255`: process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via `.add(chunk_frozen.view())`.
```
fast path (< 255 cols):
builder = TempCompactIntVecBuilder::new(n)
for c in group:
builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
builder.freeze()
slow path (≥ 255 cols):
result = TempCompactIntVecBuilder::new(n)
for chunk in group.chunks(254):
chunk_b = TempCompactIntVecBuilder::new(n)
for c in chunk:
chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
frozen = chunk_b.freeze()
result.add(frozen.view())
result.freeze()
```
**`partial_group_any`** uses `or_where` on `TempBitVecBuilder`:
```
result = TempBitVecBuilder::new(n)
for c in group:
result.or_where(matrix.col_view(c), |v| v >= threshold)
result.freeze()
```
**Non-additive predicates** (`group_all`, `group_at_least(k)`) are composed at the index level:
```rust
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
let in_mask = presence.view().geq(2); // IntSliceView method
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
let out_mask = out_sum.view().leq(0);
let mut mask_b = TempBitVecBuilder::new(n)?;
mask_b.copy_from(in_mask);
mask_b.and(out_mask);
```
### mask_with
Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder`). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
```
for (w_idx, word) in mask.words():
if word == u64::MAX: continue // skip all-ones words
zeros = !word
while zeros != 0:
bit = trailing_zeros(zeros)
s = w_idx * 64 + bit
if primary[s] != 0: set(s, 0) // clears overflow entry too
zeros &= zeros 1
```
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).