Files
obikmer/docmd/implementation/obicompactvec.md
T
Eric Coissac 4c4524766c feat(matrix): add partial group reductions and column persistence
Expands MatrixGroupOps with partial_group_min/max helpers for bitwise reductions and introduces add_col_from methods to persist external vectors as matrix columns. Refactors column aggregation in the partitioner to leverage these group operations directly, replacing iterative row processing with simplified builder lifecycle management and explicit metadata serialization.
2026-06-22 09:49:04 +02:00

20 KiB
Raw Blame History

obicompactvec — Complete Reference

Module structure

src/obicompactvec/src/
  lib.rs            public re-exports
  views.rs          BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
  traits.rs         ColumnWeights, CountPartials, BitPartials (matrix aggregation)
  bitvec.rs         PersistentBitVec, PersistentBitVecBuilder, BitIter
  reader.rs         PersistentCompactIntVec (read-only)
  builder.rs        PersistentCompactIntVecBuilder (read-write)
  tempintvec.rs     TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
  tempbitvec.rs     TempBitVec, TempBitVecBuilder (temp-file-backed)
  bitmatrix.rs      PersistentBitMatrix, PersistentBitMatrixBuilder
  intmatrix.rs      PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
  colgroup.rs       ColGroup, MatrixGroupOps trait
  format.rs         file format constants, encode/decode helpers
  layer_meta.rs     LayerMeta (column metadata)
  meta.rs           matrix metadata
graph TD
    views --> bitvec
    views --> builder
    views --> tempbitvec
    views --> tempintvec
    views --> bitmatrix
    views --> intmatrix
    format --> reader
    format --> builder
    reader --> intmatrix
    reader --> tempintvec
    builder --> intmatrix
    builder --> tempintvec
    bitvec --> tempbitvec
    bitvec --> bitmatrix
    tempintvec --> intmatrix
    tempintvec --> bitmatrix
    tempbitvec --> intmatrix
    tempbitvec --> bitmatrix
    colgroup --> intmatrix
    colgroup --> bitmatrix
    layer_meta --> bitmatrix
    layer_meta --> intmatrix
    meta --> bitmatrix
    meta --> intmatrix

Compact int encoding

All integer vectors use the same two-tier encoding regardless of storage backend.

Primary array — one u8 per slot:

  • Values 0254 are stored directly. No overhead.
  • Value 255 is a sentinel: the slot's actual value is ≥ 255 and lives in the overflow store.

Overflow store — maps slot index to a u32 value ≥ 255:

  • In PersistentCompactIntVecBuilder: a HashMap<usize, u32> in RAM.
  • In PersistentCompactIntVec (reader): a sorted [(slot: u64, value: u32)] array in the mmap, with a sparse L1-resident index for binary search.
flowchart LR
    slot --> P["primary[slot]: u8"]
    P -->|"< 255"| V["value = byte (0254)"]
    P -->|"= 255 sentinel"| OV["overflow store"]
    OV -->|"Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
    OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]

Key property — sentinel 255 = +∞ on u8:

  • min(a, 255) = a for all a ≤ 254 → correct when only one side is overflow
  • max(a, 255) = 255 → correct sentinel when either side is overflow
  • Only the both-overflow case requires reading actual values from the overflow store.

In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.


View types

The previous trait hierarchy (BitSlice, BitSliceMut, IntSlice, IntSliceMut) has been replaced by two concrete zero-copy view structs with inherent methods. Views are Copy — passing them is free. All read operations live on these two types.

BitSliceView<'a>

#[derive(Clone, Copy)]
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }

Bit i is at words[i >> 6] bit i & 63 (LSB-first). Padding bits in the last word are zero.

Method Cost
len(), is_empty() O(1)
get(slot) O(1)
count_ones() POPCNT per word, O(n/64)
count_zeros() n count_ones(), O(n/64)
iter() -> BitSliceIter<'a> O(1) setup, O(n) iteration
partial_jaccard_dist(other: BitSliceView) (a&b).popcount, (a|b).popcount per word, O(n/64)
jaccard_dist(other: BitSliceView) from partial, O(n/64)
hamming_dist(other: BitSliceView) (a^b).popcount per word, O(n/64)

BitSliceIter<'a>: word-level scan; one word per 64 iterations.

IntSliceView<'a>

#[derive(Clone, Copy)]
pub struct IntSliceView<'a> {
    pub(crate) primary:      &'a [u8],
    pub(crate) overflow_raw: &'a [u8],   // sorted [(slot:u64, value:u32)] entries
    pub(crate) n_overflow:   usize,
    pub(crate) n:            usize,
}

overflow_raw contains n_overflow entries of OVERFLOW_ENTRY_SIZE bytes each, sorted by slot. The sort invariant is established at close()/freeze() time.

Method Cost
len(), is_empty() O(1)
primary_bytes() O(1)
overflow_entries() -> impl Iterator<(usize,u32)> O(n_overflow) iteration
get(slot) O(1) primary; binary search O(log k) for overflow slots
iter() -> IntSliceViewIter<'a> merge scan, O(n + k)
sum() byte scan + overflow, O(n + k)
count_nonzero() byte scan, O(n)
Distance methods (bray_dist, euclidean_dist, jaccard_dist, …) O(n + k)

IntSliceViewIter<'a>: merge scan using overflow_pos index. Requires sorted overflow — guaranteed by the construction lifecycle.

Builder view() vs reader view(): PersistentCompactIntVecBuilder stores overflow as an unsorted HashMap, not raw bytes. Its view() returns an IntSliceView with overflow_raw = &[] and n_overflow = 0. This is intentional — the view is primarily useful after freeze(). During building, callers that need overflow use overflow_entries() directly.


Concrete types

classDiagram
    class BitSliceView {
        +words: &[u64]
        +n: usize
        +get(slot) bool
        +count_ones() u64
        +iter() BitSliceIter
        +jaccard_dist/hamming_dist(other: BitSliceView)
    }
    class IntSliceView {
        +primary: &[u8]
        +overflow_raw: &[u8]
        +n_overflow: usize
        +n: usize
        +get(slot) u32
        +iter() IntSliceViewIter
        +overflow_entries() Iterator
        +bray_dist/euclidean_dist/…(other: IntSliceView)
    }
    class PersistentBitVec {
        -mmap: Mmap
        -n: usize
        +view() BitSliceView
        +get(slot) bool
        +count_ones/zeros() u64
        +iter() BitIter
        +partial_jaccard_dist(&Self) (u64,u64)
        +jaccard_dist/hamming_dist(&Self) …
    }
    class PersistentBitVecBuilder {
        -mmap: MmapMut
        -n: usize
        +view() BitSliceView
        +set(slot, bool)
        +or/and/xor/not(BitSliceView)
        +copy_from(BitSliceView)
        +close() / finish() → PersistentBitVec
    }
    class PersistentCompactIntVec {
        -mmap: Mmap
        -n: usize
        -n_overflow: usize
        -step: usize
        -index: Vec~(usize,usize)~
        +view() IntSliceView
        +get(slot) u32
        +iter() Iter
        +sum/count_nonzero() u64
        +bray_dist/euclidean_dist/… (&Self)
    }
    class PersistentCompactIntVecBuilder {
        -mmap: MmapMut
        -n: usize
        -overflow: HashMap~usize,u32~
        +view() IntSliceView
        +set(slot, u32) / get(slot) u32
        +inc / inc_present / inc_present_fast
        +inc_predicate / inc_predicate_fast
        +add/min/max/diff/mask_with(…View)
        +primary_bytes/primary_bytes_mut()
        +close() / finish() → PersistentCompactIntVec
    }

    PersistentBitVec --> BitSliceView : view()
    PersistentBitVecBuilder --> BitSliceView : view()
    PersistentCompactIntVec --> IntSliceView : view()
    PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
    PersistentBitVecBuilder --> PersistentBitVec : close() then open()
    PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()

PersistentBitVec / PersistentBitVecBuilder

PersistentBitVec is the read-only type. view() returns a BitSliceView<'_> over the mmap word array. Direct inherent methods delegate to the view: count_ones(), count_zeros(), partial_jaccard_dist(&Self), jaccard_dist(&Self), hamming_dist(&Self).

BitIter<'a> — exported iterator for PersistentBitVec::iter():

pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }

PersistentBitVecBuilder is the read-write type. Mutation operations accept BitSliceView<'_>:

Method Cost
set(slot, bool) O(1)
view() -> BitSliceView<'_> O(1)
or/and/xor(BitSliceView) word-level, O(n/64), SIMD-friendly
not() w ^= u64::MAX per word, re-masks last word
copy_from(BitSliceView) copy_from_slice

PersistentCompactIntVec / PersistentCompactIntVecBuilder

PersistentCompactIntVec is the read-only type. view() returns an IntSliceView<'_> over the mmap primary and overflow arrays. Inherent iter() is a merge scan (Iter struct). Inherent sum() and count_nonzero() use fast byte-scan helpers.

PersistentCompactIntVecBuilder is the read-write type. Mutation methods on the builder fall into two categories:

Point mutations:

Method Note
set(slot, u32) writes primary[slot] or 255+overflow
get(slot) -> u32 reads primary byte or HashMap
inc(slot) get + set, O(1)

Bulk computation methods — accept view arguments:

Method Semantics Overflow
inc_present(BitSliceView) += 1 at each 1-bit via inc, safe for any group size
inc_present_fast(BitSliceView) same, raw u8 += 1 debug_assert no 255 reached
inc_predicate(IntSliceView, pred) += 1 where pred(col[s]) two-pass, safe
inc_predicate_fast(IntSliceView, pred) same, raw u8 debug_assert no 255 reached
add(IntSliceView) self[s] += other[s] primary fast path + overflow fallback
min(IntSliceView) byte min + both-overflow fixup see algorithm below
max(IntSliceView) pre-pass + byte max see algorithm below
diff(IntSliceView) saturating sub self<255 hot path
mask_with(BitSliceView) zeros slots where mask bit = 0 O(n_zeros)

inc_present_fast / inc_predicate_fast invariant: caller guarantees no counter reaches 255 during the operation (group size < 255 for inc_present_fast, or chunk size < 255 for inc_predicate_fast). Violation is caught by debug_assert in dev builds.

min algorithm:

Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.

snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
  for (slot, self_val) in self_ov:
    if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))

max algorithm:

Cannot do byte max first — max(255, b<255)=255 overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.

Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
  set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)

Matrix types

Four matrix types, two encodings × two formats:

Columnar format Packed format
Bit PersistentBitMatrix (Columnar variant) PersistentBitMatrix (Packed variant)
Int PersistentCompactIntMatrix (Columnar variant) PersistentCompactIntMatrix (Packed variant)

Both matrix types are enums (Columnar / Packed / Implicit for bit) behind a transparent API. col_view(c) returns the appropriate view directly:

// PersistentBitMatrix
pub fn col_view(&self, c: usize) -> BitSliceView<'_>

// PersistentCompactIntMatrix
pub fn col_view(&self, c: usize) -> IntSliceView<'_>

No wrapper enums (BitColView, IntColView): the caller receives a Copy view struct immediately usable with any view method or bulk builder method.

pack_compact_int_matrix and pack_bit_matrix convert columnar → packed format.


Aggregation traits (matrix level)

ColumnWeights

trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;         // sum per column
    fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}

partial_kmer_counts is overridden for count matrices to return count_nonzero per column (distinct kmers) rather than total count.

CountPartials

Abstract required methods: partial_bray, partial_euclidean, partial_threshold_jaccard, partial_relfreq_bray, partial_relfreq_euclidean, partial_hellinger.

Additivity rule: self-contained partials (partial_bray, partial_euclidean, partial_threshold_jaccard) can be element-wise summed across all (partition, layer) pairs. Normalised partials (partial_relfreq_*, partial_hellinger) require the global col_weights (accumulated across all layers and all partitions) as parameter.

partial_threshold_jaccard returns (inter, union) because union[i,j] depends on both columns simultaneously.

Provided finalisations:

Finalisation Formula
bray_dist_matrix() 1 2·partial_bray[i,j] / (w[i] + w[j])
euclidean_dist_matrix() √partial_euclidean[i,j]
threshold_jaccard_dist_matrix(t) 1 inter[i,j] / union[i,j]
relfreq_bray_dist_matrix() 1 partial_relfreq_bray[i,j]
relfreq_euclidean_dist_matrix() √partial_relfreq_euclidean[i,j]
hellinger_dist_matrix() √partial_hellinger[i,j] / √2
hellinger_euclidean_dist_matrix() √partial_hellinger[i,j]

BitPartials

Required: partial_jaccard() -> (Array2<u64>, Array2<u64>), partial_hamming() -> Array2<u64>. Both additive across layers and partitions.


Temp-file-backed types

All inter-function results use temp-file-backed types so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.

Lifecycle

TempCompactIntVecBuilder::new(n)   →  writable mmap in TempDir
     ↓  (inc_present_fast / inc_predicate_fast / add / mask_with / …)
 .freeze()                          →  TempCompactIntVec  (read-only mmap + TempDir)
     ↓  (optional)
 .make_persistent(path)             →  PersistentCompactIntVec  (permanent file)

Same pattern for TempBitVecBuilderTempBitVecPersistentBitVec.

Drop order: TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir } — Rust drops fields in declaration order. vec (mmap) released before _temp (directory deleted). No explicit drop() needed.

TempCompactIntVec / TempCompactIntVecBuilder

pub struct TempCompactIntVec {
    vec:   PersistentCompactIntVec,
    _temp: TempDir,        // dropped after vec
}

pub(crate) struct TempCompactIntVecBuilder {
    builder: PersistentCompactIntVecBuilder,
    temp:    TempDir,
}

TempCompactIntVec: read access via get(slot), sum(), iter(), view() -> IntSliceView<'_>.

TempCompactIntVecBuilder: full delegation to inner PersistentCompactIntVecBuilder — all bulk computation methods (inc_present_fast, inc_predicate_fast, add, min, max, diff, mask_with) are exposed as pub(crate).

TempBitVec / TempBitVecBuilder

pub struct TempBitVec {
    vec:   PersistentBitVec,
    _temp: TempDir,
}

pub(crate) struct TempBitVecBuilder {
    builder: PersistentBitVecBuilder,
    temp:    TempDir,
}

TempBitVec: read access via get(slot), count_ones(), view() -> BitSliceView<'_>, iter().

TempBitVecBuilder: exposes set(slot, bool), or(BitSliceView), and:

pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)

or_where — two passes, no intermediate allocation:

Pass 1 — primary bytes, O(n):
  for slot in 0..n:
    b = col.primary_bytes()[slot]
    if b < 255 AND pred(b as u32): self.set(slot, true)

Pass 2 — overflow, O(k):
  for (slot, val) in col.overflow_entries():
    if pred(val): self.set(slot, true)

Filter / Select API

ColGroup

pub struct ColGroup { pub name: String, pub indices: Vec<usize> }

Defined once at the index level from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.

Composition axis

  • Across partitions: kmer space is partitioned → partial results concatenated (disjoint kmer ranges).
  • Across layers: same kmer space, different counts → partial results aggregated (add, OR, etc.).

MatrixGroupOps

Five required primitives + two default methods derived from them. All return temp-file-backed types.

pub trait MatrixGroupOps {
    // required
    fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempCompactIntVec>;
    fn partial_group_sum(&self, g: &ColGroup)
        -> io::Result<TempCompactIntVec>;
    fn partial_group_any(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempBitVec>;
    fn partial_group_min(&self, g: &ColGroup)
        -> io::Result<TempCompactIntVec>;
    fn partial_group_max(&self, g: &ColGroup)
        -> io::Result<TempCompactIntVec>;

    // defaults derived from partial_group_presence_count
    fn partial_group_all(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempBitVec>;   // slot=1 iff count == g.indices.len()
    fn partial_group_none(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempBitVec>;   // slot=1 iff count == 0
}

Implemented for both PersistentCompactIntMatrix and PersistentBitMatrix.

For bit matrices: values are 0/1, so partial_group_sum = partial_group_presence_count(g, 1); partial_group_min is AND (set first column then mask-with remaining); partial_group_max is OR via partial_group_any + inc_present.

partial_group_presence_count — chunking for large groups:

When g.indices.len() < 255: per-slot counts stay within u8 range. Use inc_present_fast (bit) or inc_predicate_fast(col_view(c), |v| v >= threshold) (int) — raw u8 increment, no overflow entry written.

When g.indices.len() ≥ 255: process in chunks of 254 columns, accumulate via .add(chunk_frozen.view()).

partial_group_min (int matrix): copy first column via .add(col_view(first)) (start from 0 ⇒ copy), then .min(col_view(c)) for remaining.

partial_group_max (int matrix): .max(col_view(c)) for all columns (start from 0 ⇒ first column acts as copy).

partial_group_any uses or_where on TempBitVecBuilder (two-pass: primary bytes then overflow entries).

partial_group_all / partial_group_none (default): call partial_group_presence_count, then iterate slots to produce the bit result. O(n) extra pass, not chunked.

add_col_from — matrix builder integration

Both matrix builders accept temp-file results directly:

// PersistentBitMatrixBuilder
fn add_col_from(&mut self, src: &TempBitVec)         -> io::Result<()>
fn add_col_from_int(&mut self, src: &TempCompactIntVec) -> io::Result<()>  // nonzero → 1

// PersistentCompactIntMatrixBuilder
fn add_col_from(&mut self, src: &TempCompactIntVec)  -> io::Result<()>
fn add_col_from_bit(&mut self, src: &TempBitVec)     -> io::Result<()>  // bit → 0/1 u32

add_col_from copies the temp file to the matrix directory and increments n_cols; close() writes meta.json with the final column count. No separate write_meta step needed.

mask_with

Direct method on PersistentCompactIntVecBuilder (and delegation via TempCompactIntVecBuilder). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.

for (w_idx, word) in mask.words():
  if word == u64::MAX: continue   // skip all-ones words
  zeros = !word
  while zeros != 0:
    bit = trailing_zeros(zeros)
    s = w_idx * 64 + bit
    if primary[s] != 0: set(s, 0)   // clears overflow entry too
    zeros &= zeros  1

Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).