Expands MatrixGroupOps with partial_group_min/max helpers for bitwise reductions and introduces add_col_from methods to persist external vectors as matrix columns. Refactors column aggregation in the partitioner to leverage these group operations directly, replacing iterative row processing with simplified builder lifecycle management and explicit metadata serialization.
20 KiB
obicompactvec — Complete Reference
Module structure
src/obicompactvec/src/
lib.rs public re-exports
views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation)
bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter
reader.rs PersistentCompactIntVec (read-only)
builder.rs PersistentCompactIntVecBuilder (read-write)
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
colgroup.rs ColGroup, MatrixGroupOps trait
format.rs file format constants, encode/decode helpers
layer_meta.rs LayerMeta (column metadata)
meta.rs matrix metadata
graph TD
views --> bitvec
views --> builder
views --> tempbitvec
views --> tempintvec
views --> bitmatrix
views --> intmatrix
format --> reader
format --> builder
reader --> intmatrix
reader --> tempintvec
builder --> intmatrix
builder --> tempintvec
bitvec --> tempbitvec
bitvec --> bitmatrix
tempintvec --> intmatrix
tempintvec --> bitmatrix
tempbitvec --> intmatrix
tempbitvec --> bitmatrix
colgroup --> intmatrix
colgroup --> bitmatrix
layer_meta --> bitmatrix
layer_meta --> intmatrix
meta --> bitmatrix
meta --> intmatrix
Compact int encoding
All integer vectors use the same two-tier encoding regardless of storage backend.
Primary array — one u8 per slot:
- Values 0–254 are stored directly. No overhead.
- Value 255 is a sentinel: the slot's actual value is ≥ 255 and lives in the overflow store.
Overflow store — maps slot index to a u32 value ≥ 255:
- In
PersistentCompactIntVecBuilder: aHashMap<usize, u32>in RAM. - In
PersistentCompactIntVec(reader): a sorted[(slot: u64, value: u32)]array in the mmap, with a sparse L1-resident index for binary search.
flowchart LR
slot --> P["primary[slot]: u8"]
P -->|"< 255"| V["value = byte (0–254)"]
P -->|"= 255 sentinel"| OV["overflow store"]
OV -->|"Builder"| HM["HashMap<usize, u32>\nin RAM"]
OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
Key property — sentinel 255 = +∞ on u8:
min(a, 255) = afor alla ≤ 254→ correct when only one side is overflowmax(a, 255) = 255→ correct sentinel when either side is overflow- Only the both-overflow case requires reading actual values from the overflow store.
In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.
View types
The previous trait hierarchy (BitSlice, BitSliceMut, IntSlice, IntSliceMut) has been replaced by two concrete zero-copy view structs with inherent methods. Views are Copy — passing them is free. All read operations live on these two types.
BitSliceView<'a>
#[derive(Clone, Copy)]
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }
Bit i is at words[i >> 6] bit i & 63 (LSB-first). Padding bits in the last word are zero.
| Method | Cost |
|---|---|
len(), is_empty() |
O(1) |
get(slot) |
O(1) |
count_ones() |
POPCNT per word, O(n/64) |
count_zeros() |
n − count_ones(), O(n/64) |
iter() -> BitSliceIter<'a> |
O(1) setup, O(n) iteration |
partial_jaccard_dist(other: BitSliceView) |
(a&b).popcount, (a|b).popcount per word, O(n/64) |
jaccard_dist(other: BitSliceView) |
from partial, O(n/64) |
hamming_dist(other: BitSliceView) |
(a^b).popcount per word, O(n/64) |
BitSliceIter<'a>: word-level scan; one word per 64 iterations.
IntSliceView<'a>
#[derive(Clone, Copy)]
pub struct IntSliceView<'a> {
pub(crate) primary: &'a [u8],
pub(crate) overflow_raw: &'a [u8], // sorted [(slot:u64, value:u32)] entries
pub(crate) n_overflow: usize,
pub(crate) n: usize,
}
overflow_raw contains n_overflow entries of OVERFLOW_ENTRY_SIZE bytes each, sorted by slot. The sort invariant is established at close()/freeze() time.
| Method | Cost |
|---|---|
len(), is_empty() |
O(1) |
primary_bytes() |
O(1) |
overflow_entries() -> impl Iterator<(usize,u32)> |
O(n_overflow) iteration |
get(slot) |
O(1) primary; binary search O(log k) for overflow slots |
iter() -> IntSliceViewIter<'a> |
merge scan, O(n + k) |
sum() |
byte scan + overflow, O(n + k) |
count_nonzero() |
byte scan, O(n) |
Distance methods (bray_dist, euclidean_dist, jaccard_dist, …) |
O(n + k) |
IntSliceViewIter<'a>: merge scan using overflow_pos index. Requires sorted overflow — guaranteed by the construction lifecycle.
Builder view() vs reader view(): PersistentCompactIntVecBuilder stores overflow as an unsorted HashMap, not raw bytes. Its view() returns an IntSliceView with overflow_raw = &[] and n_overflow = 0. This is intentional — the view is primarily useful after freeze(). During building, callers that need overflow use overflow_entries() directly.
Concrete types
classDiagram
class BitSliceView {
+words: &[u64]
+n: usize
+get(slot) bool
+count_ones() u64
+iter() BitSliceIter
+jaccard_dist/hamming_dist(other: BitSliceView)
}
class IntSliceView {
+primary: &[u8]
+overflow_raw: &[u8]
+n_overflow: usize
+n: usize
+get(slot) u32
+iter() IntSliceViewIter
+overflow_entries() Iterator
+bray_dist/euclidean_dist/…(other: IntSliceView)
}
class PersistentBitVec {
-mmap: Mmap
-n: usize
+view() BitSliceView
+get(slot) bool
+count_ones/zeros() u64
+iter() BitIter
+partial_jaccard_dist(&Self) (u64,u64)
+jaccard_dist/hamming_dist(&Self) …
}
class PersistentBitVecBuilder {
-mmap: MmapMut
-n: usize
+view() BitSliceView
+set(slot, bool)
+or/and/xor/not(BitSliceView)
+copy_from(BitSliceView)
+close() / finish() → PersistentBitVec
}
class PersistentCompactIntVec {
-mmap: Mmap
-n: usize
-n_overflow: usize
-step: usize
-index: Vec~(usize,usize)~
+view() IntSliceView
+get(slot) u32
+iter() Iter
+sum/count_nonzero() u64
+bray_dist/euclidean_dist/… (&Self)
}
class PersistentCompactIntVecBuilder {
-mmap: MmapMut
-n: usize
-overflow: HashMap~usize,u32~
+view() IntSliceView
+set(slot, u32) / get(slot) u32
+inc / inc_present / inc_present_fast
+inc_predicate / inc_predicate_fast
+add/min/max/diff/mask_with(…View)
+primary_bytes/primary_bytes_mut()
+close() / finish() → PersistentCompactIntVec
}
PersistentBitVec --> BitSliceView : view()
PersistentBitVecBuilder --> BitSliceView : view()
PersistentCompactIntVec --> IntSliceView : view()
PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
PersistentBitVecBuilder --> PersistentBitVec : close() then open()
PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
PersistentBitVec / PersistentBitVecBuilder
PersistentBitVec is the read-only type. view() returns a BitSliceView<'_> over the mmap word array. Direct inherent methods delegate to the view: count_ones(), count_zeros(), partial_jaccard_dist(&Self), jaccard_dist(&Self), hamming_dist(&Self).
BitIter<'a> — exported iterator for PersistentBitVec::iter():
pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
PersistentBitVecBuilder is the read-write type. Mutation operations accept BitSliceView<'_>:
| Method | Cost |
|---|---|
set(slot, bool) |
O(1) |
view() -> BitSliceView<'_> |
O(1) |
or/and/xor(BitSliceView) |
word-level, O(n/64), SIMD-friendly |
not() |
w ^= u64::MAX per word, re-masks last word |
copy_from(BitSliceView) |
copy_from_slice |
PersistentCompactIntVec / PersistentCompactIntVecBuilder
PersistentCompactIntVec is the read-only type. view() returns an IntSliceView<'_> over the mmap primary and overflow arrays. Inherent iter() is a merge scan (Iter struct). Inherent sum() and count_nonzero() use fast byte-scan helpers.
PersistentCompactIntVecBuilder is the read-write type. Mutation methods on the builder fall into two categories:
Point mutations:
| Method | Note |
|---|---|
set(slot, u32) |
writes primary[slot] or 255+overflow |
get(slot) -> u32 |
reads primary byte or HashMap |
inc(slot) |
get + set, O(1) |
Bulk computation methods — accept view arguments:
| Method | Semantics | Overflow |
|---|---|---|
inc_present(BitSliceView) |
+= 1 at each 1-bit |
via inc, safe for any group size |
inc_present_fast(BitSliceView) |
same, raw u8 += 1 |
debug_assert no 255 reached |
inc_predicate(IntSliceView, pred) |
+= 1 where pred(col[s]) |
two-pass, safe |
inc_predicate_fast(IntSliceView, pred) |
same, raw u8 | debug_assert no 255 reached |
add(IntSliceView) |
self[s] += other[s] |
primary fast path + overflow fallback |
min(IntSliceView) |
byte min + both-overflow fixup | see algorithm below |
max(IntSliceView) |
pre-pass + byte max | see algorithm below |
diff(IntSliceView) |
saturating sub | self<255 hot path |
mask_with(BitSliceView) |
zeros slots where mask bit = 0 | O(n_zeros) |
inc_present_fast / inc_predicate_fast invariant: caller guarantees no counter reaches 255 during the operation (group size < 255 for inc_present_fast, or chunk size < 255 for inc_predicate_fast). Violation is caught by debug_assert in dev builds.
min algorithm:
Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
for (slot, self_val) in self_ov:
if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
max algorithm:
Cannot do byte max first — max(255, b<255)=255 overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)
Matrix types
Four matrix types, two encodings × two formats:
| Columnar format | Packed format | |
|---|---|---|
| Bit | PersistentBitMatrix (Columnar variant) |
PersistentBitMatrix (Packed variant) |
| Int | PersistentCompactIntMatrix (Columnar variant) |
PersistentCompactIntMatrix (Packed variant) |
Both matrix types are enums (Columnar / Packed / Implicit for bit) behind a transparent API. col_view(c) returns the appropriate view directly:
// PersistentBitMatrix
pub fn col_view(&self, c: usize) -> BitSliceView<'_>
// PersistentCompactIntMatrix
pub fn col_view(&self, c: usize) -> IntSliceView<'_>
No wrapper enums (BitColView, IntColView): the caller receives a Copy view struct immediately usable with any view method or bulk builder method.
pack_compact_int_matrix and pack_bit_matrix convert columnar → packed format.
Aggregation traits (matrix level)
ColumnWeights
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>; // sum per column
fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}
partial_kmer_counts is overridden for count matrices to return count_nonzero per column (distinct kmers) rather than total count.
CountPartials
Abstract required methods: partial_bray, partial_euclidean, partial_threshold_jaccard, partial_relfreq_bray, partial_relfreq_euclidean, partial_hellinger.
Additivity rule: self-contained partials (partial_bray, partial_euclidean, partial_threshold_jaccard) can be element-wise summed across all (partition, layer) pairs. Normalised partials (partial_relfreq_*, partial_hellinger) require the global col_weights (accumulated across all layers and all partitions) as parameter.
partial_threshold_jaccard returns (inter, union) because union[i,j] depends on both columns simultaneously.
Provided finalisations:
| Finalisation | Formula |
|---|---|
bray_dist_matrix() |
1 − 2·partial_bray[i,j] / (w[i] + w[j]) |
euclidean_dist_matrix() |
√partial_euclidean[i,j] |
threshold_jaccard_dist_matrix(t) |
1 − inter[i,j] / union[i,j] |
relfreq_bray_dist_matrix() |
1 − partial_relfreq_bray[i,j] |
relfreq_euclidean_dist_matrix() |
√partial_relfreq_euclidean[i,j] |
hellinger_dist_matrix() |
√partial_hellinger[i,j] / √2 |
hellinger_euclidean_dist_matrix() |
√partial_hellinger[i,j] |
BitPartials
Required: partial_jaccard() -> (Array2<u64>, Array2<u64>), partial_hamming() -> Array2<u64>. Both additive across layers and partitions.
Temp-file-backed types
All inter-function results use temp-file-backed types so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
Lifecycle
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …)
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
↓ (optional)
.make_persistent(path) → PersistentCompactIntVec (permanent file)
Same pattern for TempBitVecBuilder → TempBitVec → PersistentBitVec.
Drop order: TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir } — Rust drops fields in declaration order. vec (mmap) released before _temp (directory deleted). No explicit drop() needed.
TempCompactIntVec / TempCompactIntVecBuilder
pub struct TempCompactIntVec {
vec: PersistentCompactIntVec,
_temp: TempDir, // dropped after vec
}
pub(crate) struct TempCompactIntVecBuilder {
builder: PersistentCompactIntVecBuilder,
temp: TempDir,
}
TempCompactIntVec: read access via get(slot), sum(), iter(), view() -> IntSliceView<'_>.
TempCompactIntVecBuilder: full delegation to inner PersistentCompactIntVecBuilder — all bulk computation methods (inc_present_fast, inc_predicate_fast, add, min, max, diff, mask_with) are exposed as pub(crate).
TempBitVec / TempBitVecBuilder
pub struct TempBitVec {
vec: PersistentBitVec,
_temp: TempDir,
}
pub(crate) struct TempBitVecBuilder {
builder: PersistentBitVecBuilder,
temp: TempDir,
}
TempBitVec: read access via get(slot), count_ones(), view() -> BitSliceView<'_>, iter().
TempBitVecBuilder: exposes set(slot, bool), or(BitSliceView), and:
pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)
or_where — two passes, no intermediate allocation:
Pass 1 — primary bytes, O(n):
for slot in 0..n:
b = col.primary_bytes()[slot]
if b < 255 AND pred(b as u32): self.set(slot, true)
Pass 2 — overflow, O(k):
for (slot, val) in col.overflow_entries():
if pred(val): self.set(slot, true)
Filter / Select API
ColGroup
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
Defined once at the index level from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
Composition axis
- Across partitions: kmer space is partitioned → partial results concatenated (disjoint kmer ranges).
- Across layers: same kmer space, different counts → partial results aggregated (add, OR, etc.).
MatrixGroupOps
Five required primitives + two default methods derived from them. All return temp-file-backed types.
pub trait MatrixGroupOps {
// required
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempCompactIntVec>;
fn partial_group_sum(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>;
fn partial_group_min(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
fn partial_group_max(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
// defaults derived from partial_group_presence_count
fn partial_group_all(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>; // slot=1 iff count == g.indices.len()
fn partial_group_none(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>; // slot=1 iff count == 0
}
Implemented for both PersistentCompactIntMatrix and PersistentBitMatrix.
For bit matrices: values are 0/1, so partial_group_sum = partial_group_presence_count(g, 1); partial_group_min is AND (set first column then mask-with remaining); partial_group_max is OR via partial_group_any + inc_present.
partial_group_presence_count — chunking for large groups:
When g.indices.len() < 255: per-slot counts stay within u8 range. Use inc_present_fast (bit) or inc_predicate_fast(col_view(c), |v| v >= threshold) (int) — raw u8 increment, no overflow entry written.
When g.indices.len() ≥ 255: process in chunks of 254 columns, accumulate via .add(chunk_frozen.view()).
partial_group_min (int matrix): copy first column via .add(col_view(first)) (start from 0 ⇒ copy), then .min(col_view(c)) for remaining.
partial_group_max (int matrix): .max(col_view(c)) for all columns (start from 0 ⇒ first column acts as copy).
partial_group_any uses or_where on TempBitVecBuilder (two-pass: primary bytes then overflow entries).
partial_group_all / partial_group_none (default): call partial_group_presence_count, then iterate slots to produce the bit result. O(n) extra pass, not chunked.
add_col_from — matrix builder integration
Both matrix builders accept temp-file results directly:
// PersistentBitMatrixBuilder
fn add_col_from(&mut self, src: &TempBitVec) -> io::Result<()>
fn add_col_from_int(&mut self, src: &TempCompactIntVec) -> io::Result<()> // nonzero → 1
// PersistentCompactIntMatrixBuilder
fn add_col_from(&mut self, src: &TempCompactIntVec) -> io::Result<()>
fn add_col_from_bit(&mut self, src: &TempBitVec) -> io::Result<()> // bit → 0/1 u32
add_col_from copies the temp file to the matrix directory and increments n_cols; close() writes meta.json with the final column count. No separate write_meta step needed.
mask_with
Direct method on PersistentCompactIntVecBuilder (and delegation via TempCompactIntVecBuilder). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
for (w_idx, word) in mask.words():
if word == u64::MAX: continue // skip all-ones words
zeros = !word
while zeros != 0:
bit = trailing_zeros(zeros)
s = w_idx * 64 + bit
if primary[s] != 0: set(s, 0) // clears overflow entry too
zeros &= zeros − 1
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).