Files
obikmer/docmd/implementation/obicompactvec.md
T
Eric Coissac 7eea71fdcd docs(obicompactvec): update API docs and algorithm descriptions
Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
2026-06-22 09:46:19 +02:00

19 KiB
Raw Blame History

obicompactvec — Complete Reference

Module structure

src/obicompactvec/src/
  lib.rs            public re-exports
  views.rs          BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
  traits.rs         ColumnWeights, CountPartials, BitPartials (matrix aggregation)
  bitvec.rs         PersistentBitVec, PersistentBitVecBuilder, BitIter
  reader.rs         PersistentCompactIntVec (read-only)
  builder.rs        PersistentCompactIntVecBuilder (read-write)
  tempintvec.rs     TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
  tempbitvec.rs     TempBitVec, TempBitVecBuilder (temp-file-backed)
  bitmatrix.rs      PersistentBitMatrix, PersistentBitMatrixBuilder
  intmatrix.rs      PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
  colgroup.rs       ColGroup, MatrixGroupOps trait
  format.rs         file format constants, encode/decode helpers
  layer_meta.rs     LayerMeta (column metadata)
  meta.rs           matrix metadata
graph TD
    views --> bitvec
    views --> builder
    views --> tempbitvec
    views --> tempintvec
    views --> bitmatrix
    views --> intmatrix
    format --> reader
    format --> builder
    reader --> intmatrix
    reader --> tempintvec
    builder --> intmatrix
    builder --> tempintvec
    bitvec --> tempbitvec
    bitvec --> bitmatrix
    tempintvec --> intmatrix
    tempintvec --> bitmatrix
    tempbitvec --> intmatrix
    tempbitvec --> bitmatrix
    colgroup --> intmatrix
    colgroup --> bitmatrix
    layer_meta --> bitmatrix
    layer_meta --> intmatrix
    meta --> bitmatrix
    meta --> intmatrix

Compact int encoding

All integer vectors use the same two-tier encoding regardless of storage backend.

Primary array — one u8 per slot:

  • Values 0254 are stored directly. No overhead.
  • Value 255 is a sentinel: the slot's actual value is ≥ 255 and lives in the overflow store.

Overflow store — maps slot index to a u32 value ≥ 255:

  • In PersistentCompactIntVecBuilder: a HashMap<usize, u32> in RAM.
  • In PersistentCompactIntVec (reader): a sorted [(slot: u64, value: u32)] array in the mmap, with a sparse L1-resident index for binary search.
flowchart LR
    slot --> P["primary[slot]: u8"]
    P -->|"< 255"| V["value = byte (0254)"]
    P -->|"= 255 sentinel"| OV["overflow store"]
    OV -->|"Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
    OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]

Key property — sentinel 255 = +∞ on u8:

  • min(a, 255) = a for all a ≤ 254 → correct when only one side is overflow
  • max(a, 255) = 255 → correct sentinel when either side is overflow
  • Only the both-overflow case requires reading actual values from the overflow store.

In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.


View types

The previous trait hierarchy (BitSlice, BitSliceMut, IntSlice, IntSliceMut) has been replaced by two concrete zero-copy view structs with inherent methods. Views are Copy — passing them is free. All read operations live on these two types.

BitSliceView<'a>

#[derive(Clone, Copy)]
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }

Bit i is at words[i >> 6] bit i & 63 (LSB-first). Padding bits in the last word are zero.

Method Cost
len(), is_empty() O(1)
get(slot) O(1)
count_ones() POPCNT per word, O(n/64)
count_zeros() n count_ones(), O(n/64)
iter() -> BitSliceIter<'a> O(1) setup, O(n) iteration
partial_jaccard_dist(other: BitSliceView) (a&b).popcount, (a|b).popcount per word, O(n/64)
jaccard_dist(other: BitSliceView) from partial, O(n/64)
hamming_dist(other: BitSliceView) (a^b).popcount per word, O(n/64)

BitSliceIter<'a>: word-level scan; one word per 64 iterations.

IntSliceView<'a>

#[derive(Clone, Copy)]
pub struct IntSliceView<'a> {
    pub(crate) primary:      &'a [u8],
    pub(crate) overflow_raw: &'a [u8],   // sorted [(slot:u64, value:u32)] entries
    pub(crate) n_overflow:   usize,
    pub(crate) n:            usize,
}

overflow_raw contains n_overflow entries of OVERFLOW_ENTRY_SIZE bytes each, sorted by slot. The sort invariant is established at close()/freeze() time.

Method Cost
len(), is_empty() O(1)
primary_bytes() O(1)
overflow_entries() -> impl Iterator<(usize,u32)> O(n_overflow) iteration
get(slot) O(1) primary; binary search O(log k) for overflow slots
iter() -> IntSliceViewIter<'a> merge scan, O(n + k)
sum() byte scan + overflow, O(n + k)
count_nonzero() byte scan, O(n)
Distance methods (bray_dist, euclidean_dist, jaccard_dist, …) O(n + k)

IntSliceViewIter<'a>: merge scan using overflow_pos index. Requires sorted overflow — guaranteed by the construction lifecycle.

Builder view() vs reader view(): PersistentCompactIntVecBuilder stores overflow as an unsorted HashMap, not raw bytes. Its view() returns an IntSliceView with overflow_raw = &[] and n_overflow = 0. This is intentional — the view is primarily useful after freeze(). During building, callers that need overflow use overflow_entries() directly.


Concrete types

classDiagram
    class BitSliceView {
        +words: &[u64]
        +n: usize
        +get(slot) bool
        +count_ones() u64
        +iter() BitSliceIter
        +jaccard_dist/hamming_dist(other: BitSliceView)
    }
    class IntSliceView {
        +primary: &[u8]
        +overflow_raw: &[u8]
        +n_overflow: usize
        +n: usize
        +get(slot) u32
        +iter() IntSliceViewIter
        +overflow_entries() Iterator
        +bray_dist/euclidean_dist/…(other: IntSliceView)
    }
    class PersistentBitVec {
        -mmap: Mmap
        -n: usize
        +view() BitSliceView
        +get(slot) bool
        +count_ones/zeros() u64
        +iter() BitIter
        +partial_jaccard_dist(&Self) (u64,u64)
        +jaccard_dist/hamming_dist(&Self) …
    }
    class PersistentBitVecBuilder {
        -mmap: MmapMut
        -n: usize
        +view() BitSliceView
        +set(slot, bool)
        +or/and/xor/not(BitSliceView)
        +copy_from(BitSliceView)
        +close() / finish() → PersistentBitVec
    }
    class PersistentCompactIntVec {
        -mmap: Mmap
        -n: usize
        -n_overflow: usize
        -step: usize
        -index: Vec~(usize,usize)~
        +view() IntSliceView
        +get(slot) u32
        +iter() Iter
        +sum/count_nonzero() u64
        +bray_dist/euclidean_dist/… (&Self)
    }
    class PersistentCompactIntVecBuilder {
        -mmap: MmapMut
        -n: usize
        -overflow: HashMap~usize,u32~
        +view() IntSliceView
        +set(slot, u32) / get(slot) u32
        +inc / inc_present / inc_present_fast
        +inc_predicate / inc_predicate_fast
        +add/min/max/diff/mask_with(…View)
        +primary_bytes/primary_bytes_mut()
        +close() / finish() → PersistentCompactIntVec
    }

    PersistentBitVec --> BitSliceView : view()
    PersistentBitVecBuilder --> BitSliceView : view()
    PersistentCompactIntVec --> IntSliceView : view()
    PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
    PersistentBitVecBuilder --> PersistentBitVec : close() then open()
    PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()

PersistentBitVec / PersistentBitVecBuilder

PersistentBitVec is the read-only type. view() returns a BitSliceView<'_> over the mmap word array. Direct inherent methods delegate to the view: count_ones(), count_zeros(), partial_jaccard_dist(&Self), jaccard_dist(&Self), hamming_dist(&Self).

BitIter<'a> — exported iterator for PersistentBitVec::iter():

pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }

PersistentBitVecBuilder is the read-write type. Mutation operations accept BitSliceView<'_>:

Method Cost
set(slot, bool) O(1)
view() -> BitSliceView<'_> O(1)
or/and/xor(BitSliceView) word-level, O(n/64), SIMD-friendly
not() w ^= u64::MAX per word, re-masks last word
copy_from(BitSliceView) copy_from_slice

PersistentCompactIntVec / PersistentCompactIntVecBuilder

PersistentCompactIntVec is the read-only type. view() returns an IntSliceView<'_> over the mmap primary and overflow arrays. Inherent iter() is a merge scan (Iter struct). Inherent sum() and count_nonzero() use fast byte-scan helpers.

PersistentCompactIntVecBuilder is the read-write type. Mutation methods on the builder fall into two categories:

Point mutations:

Method Note
set(slot, u32) writes primary[slot] or 255+overflow
get(slot) -> u32 reads primary byte or HashMap
inc(slot) get + set, O(1)

Bulk computation methods — accept view arguments:

Method Semantics Overflow
inc_present(BitSliceView) += 1 at each 1-bit via inc, safe for any group size
inc_present_fast(BitSliceView) same, raw u8 += 1 debug_assert no 255 reached
inc_predicate(IntSliceView, pred) += 1 where pred(col[s]) two-pass, safe
inc_predicate_fast(IntSliceView, pred) same, raw u8 debug_assert no 255 reached
add(IntSliceView) self[s] += other[s] primary fast path + overflow fallback
min(IntSliceView) byte min + both-overflow fixup see algorithm below
max(IntSliceView) pre-pass + byte max see algorithm below
diff(IntSliceView) saturating sub self<255 hot path
mask_with(BitSliceView) zeros slots where mask bit = 0 O(n_zeros)

inc_present_fast / inc_predicate_fast invariant: caller guarantees no counter reaches 255 during the operation (group size < 255 for inc_present_fast, or chunk size < 255 for inc_predicate_fast). Violation is caught by debug_assert in dev builds.

min algorithm:

Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.

snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
  for (slot, self_val) in self_ov:
    if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))

max algorithm:

Cannot do byte max first — max(255, b<255)=255 overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.

Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
  set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)

Matrix types

Four matrix types, two encodings × two formats:

Columnar format Packed format
Bit PersistentBitMatrix (Columnar variant) PersistentBitMatrix (Packed variant)
Int PersistentCompactIntMatrix (Columnar variant) PersistentCompactIntMatrix (Packed variant)

Both matrix types are enums (Columnar / Packed / Implicit for bit) behind a transparent API. col_view(c) returns the appropriate view directly:

// PersistentBitMatrix
pub fn col_view(&self, c: usize) -> BitSliceView<'_>

// PersistentCompactIntMatrix
pub fn col_view(&self, c: usize) -> IntSliceView<'_>

No wrapper enums (BitColView, IntColView): the caller receives a Copy view struct immediately usable with any view method or bulk builder method.

pack_compact_int_matrix and pack_bit_matrix convert columnar → packed format.


Aggregation traits (matrix level)

ColumnWeights

trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;         // sum per column
    fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}

partial_kmer_counts is overridden for count matrices to return count_nonzero per column (distinct kmers) rather than total count.

CountPartials

Abstract required methods: partial_bray, partial_euclidean, partial_threshold_jaccard, partial_relfreq_bray, partial_relfreq_euclidean, partial_hellinger.

Additivity rule: self-contained partials (partial_bray, partial_euclidean, partial_threshold_jaccard) can be element-wise summed across all (partition, layer) pairs. Normalised partials (partial_relfreq_*, partial_hellinger) require the global col_weights (accumulated across all layers and all partitions) as parameter.

partial_threshold_jaccard returns (inter, union) because union[i,j] depends on both columns simultaneously.

Provided finalisations:

Finalisation Formula
bray_dist_matrix() 1 2·partial_bray[i,j] / (w[i] + w[j])
euclidean_dist_matrix() √partial_euclidean[i,j]
threshold_jaccard_dist_matrix(t) 1 inter[i,j] / union[i,j]
relfreq_bray_dist_matrix() 1 partial_relfreq_bray[i,j]
relfreq_euclidean_dist_matrix() √partial_relfreq_euclidean[i,j]
hellinger_dist_matrix() √partial_hellinger[i,j] / √2
hellinger_euclidean_dist_matrix() √partial_hellinger[i,j]

BitPartials

Required: partial_jaccard() -> (Array2<u64>, Array2<u64>), partial_hamming() -> Array2<u64>. Both additive across layers and partitions.


Temp-file-backed types

All inter-function results use temp-file-backed types so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.

Lifecycle

TempCompactIntVecBuilder::new(n)   →  writable mmap in TempDir
     ↓  (inc_present_fast / inc_predicate_fast / add / mask_with / …)
 .freeze()                          →  TempCompactIntVec  (read-only mmap + TempDir)
     ↓  (optional)
 .make_persistent(path)             →  PersistentCompactIntVec  (permanent file)

Same pattern for TempBitVecBuilderTempBitVecPersistentBitVec.

Drop order: TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir } — Rust drops fields in declaration order. vec (mmap) released before _temp (directory deleted). No explicit drop() needed.

TempCompactIntVec / TempCompactIntVecBuilder

pub struct TempCompactIntVec {
    vec:   PersistentCompactIntVec,
    _temp: TempDir,        // dropped after vec
}

pub(crate) struct TempCompactIntVecBuilder {
    builder: PersistentCompactIntVecBuilder,
    temp:    TempDir,
}

TempCompactIntVec: read access via get(slot), sum(), iter(), view() -> IntSliceView<'_>.

TempCompactIntVecBuilder: full delegation to inner PersistentCompactIntVecBuilder — all bulk computation methods (inc_present_fast, inc_predicate_fast, add, min, max, diff, mask_with) are exposed as pub(crate).

TempBitVec / TempBitVecBuilder

pub struct TempBitVec {
    vec:   PersistentBitVec,
    _temp: TempDir,
}

pub(crate) struct TempBitVecBuilder {
    builder: PersistentBitVecBuilder,
    temp:    TempDir,
}

TempBitVec: read access via get(slot), count_ones(), view() -> BitSliceView<'_>, iter().

TempBitVecBuilder: exposes set(slot, bool), or(BitSliceView), and:

pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)

or_where — two passes, no intermediate allocation:

Pass 1 — primary bytes, O(n):
  for slot in 0..n:
    b = col.primary_bytes()[slot]
    if b < 255 AND pred(b as u32): self.set(slot, true)

Pass 2 — overflow, O(k):
  for (slot, val) in col.overflow_entries():
    if pred(val): self.set(slot, true)

Filter / Select API

ColGroup

pub struct ColGroup { pub name: String, pub indices: Vec<usize> }

Defined once at the index level from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.

Composition axis

  • Across partitions: kmer space is partitioned → partial results concatenated (disjoint kmer ranges).
  • Across layers: same kmer space, different counts → partial results aggregated (add, OR, etc.).

MatrixGroupOps

Group operations expose only additive intermediates backed by temp files. Final predicates are applied at the index level after accumulation.

pub trait MatrixGroupOps {
    fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempCompactIntVec>;

    fn partial_group_sum(&self, g: &ColGroup)
        -> io::Result<TempCompactIntVec>;

    fn partial_group_any(&self, g: &ColGroup, threshold: u32)
        -> io::Result<TempBitVec>;
}

Implemented for both PersistentCompactIntMatrix and PersistentBitMatrix. For bit matrices, partial_group_sum delegates to partial_group_presence_count(g, 1).

partial_group_presence_count — chunking for large groups:

When g.indices.len() < 255: per-slot counts stay within u8 range. Use inc_present_fast (bit matrix) or inc_predicate_fast(col_view(c), |v| v >= threshold) (int matrix) — raw u8 increment, no overflow map written.

When g.indices.len() ≥ 255: process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via .add(chunk_frozen.view()).

fast path (< 255 cols):
  builder = TempCompactIntVecBuilder::new(n)
  for c in group:
    builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
  builder.freeze()

slow path (≥ 255 cols):
  result = TempCompactIntVecBuilder::new(n)
  for chunk in group.chunks(254):
    chunk_b = TempCompactIntVecBuilder::new(n)
    for c in chunk:
      chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
    frozen = chunk_b.freeze()
    result.add(frozen.view())
  result.freeze()

partial_group_any uses or_where on TempBitVecBuilder:

result = TempBitVecBuilder::new(n)
for c in group:
  result.or_where(matrix.col_view(c), |v| v >= threshold)
result.freeze()

Non-additive predicates (group_all, group_at_least(k)) are composed at the index level:

// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
let in_mask  = presence.view().geq(2);   // IntSliceView method

let out_sum  = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
let out_mask = out_sum.view().leq(0);

let mut mask_b = TempBitVecBuilder::new(n)?;
mask_b.copy_from(in_mask);
mask_b.and(out_mask);

mask_with

Direct method on PersistentCompactIntVecBuilder (and delegation via TempCompactIntVecBuilder). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.

for (w_idx, word) in mask.words():
  if word == u64::MAX: continue   // skip all-ones words
  zeros = !word
  while zeros != 0:
    bit = trailing_zeros(zeros)
    s = w_idx * 64 + bit
    if primary[s] != 0: set(s, 0)   // clears overflow entry too
    zeros &= zeros  1

Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).