Files
obikmer/docmd/implementation/persistent_compact_int_vec.md
T
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00

12 KiB
Raw Blame History

PersistentCompactIntVec and PersistentCompactIntMatrix

Purpose

PersistentCompactIntVec stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.

Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).

PersistentCompactIntMatrix wraps multiple PersistentCompactIntVec columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.


PersistentCompactIntVec — single-column file

Design

Two-tier structure:

  1. Primary array[u8; n], stored at offset 40 in the PCIV file and mmap'd. Values 0254 are stored directly. Value 255 is a sentinel meaning "look in overflow".
  2. Overflow section — sorted list of (slot: u64, value: u32) pairs for all slots where the true value ≥ 255, with a sparse L1-fitting index for fast lookup.
primary[slot] < 255  →  return primary[slot]
primary[slot] == 255 →  binary search in overflow

File format

Single .pciv file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.

offset 0:
  magic:      [u8; 4]   = b"PCIV"
  _pad:       [u8; 4]   = 0
  n:          u64        number of slots
  n_overflow: u64        number of overflow entries
  n_index:    u64        number of sparse index entries
  step:       u64        sparse index step (0 = no index)

offset 40:
  primary:    [u8; n]    one byte per slot, 255 = overflow sentinel

offset 40 + n:
  data:       [(slot: u64, value: u32); n_overflow]   12 bytes each, sorted by slot

offset 40 + n + n_overflow × 12:
  index:      [(slot: u64, pos: u64); n_index]         16 bytes each, sparse index

The index entries point into data: index[i] = (slot of data[i×step], i×step).

All integer fields are little-endian. Slot indices are stored as u64 in the file; they are usize in Rust code.

Lifecycle

Builder (PersistentCompactIntVecBuilder)

Used during construction. The primary section is mmap'd immediately at construction time (both for new and build_from), so the file exists and is addressable from the start. The overflow is held in a HashMap<usize, u32> in RAM.

struct PersistentCompactIntVecBuilder {
    path:     PathBuf,
    mmap:     MmapMut,            // primary section live in the file from the start
    n:        usize,
    overflow: HashMap<usize, u32>, // values ≥ 255
}

new(n: usize, path: &Path) -> io::Result<Self>

Creates the file, pre-allocates HEADER_SIZE + n zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for set / get.

build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>

Copies the source PCIV file to path (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a HashMap. Initialisation cost: O(file copy) + O(n_overflow), not O(n).

At close(), the primary section is not rewritten: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.

set(slot: usize, value: u32) / get(slot: usize) -> u32

Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).

Element-wise operations — min, max, add, diff

Each takes a &PersistentCompactIntVec of equal length and updates self in place via set:

builder.min(&other);   // self[i] = min(self[i], other[i])
builder.max(&other);   // self[i] = max(self[i], other[i])
builder.add(&other);   // self[i] = self[i].checked_add(other[i])  (panics on u32 overflow)
builder.diff(&other);  // self[i] = self[i].saturating_sub(other[i])

All iterate other with other.iter() (merge-scan, O(n_other)).

close(self) -> io::Result<()>

  1. Flush and drop the mmap (primary changes are now on disk).
  2. Sort the overflow HashMap into Vec<(usize, u32)>.
  3. Truncate the file to HEADER_SIZE + n (removes old data+index if build_from was used).
  4. Append sorted overflow data, then sparse index.
  5. Seek to offset 0, overwrite the header with final values.

Reader (PersistentCompactIntVec)

Used at query time. The whole file is mmap'd; only the sparse index is copied into a Vec at open time (≤ 32 KB, L1-resident).

struct PersistentCompactIntVec {
    mmap:           Mmap,
    n:              usize,
    n_overflow:     usize,
    step:           usize,
    index:          Vec<(usize, usize)>,  // (slot, pos) — L1-resident
    primary_offset: usize,               // = 40 (HEADER_SIZE)
    data_offset:    usize,               // = 40 + n
    path:           PathBuf,
}

open(path: &Path) -> io::Result<Self>

Mmaps the file, parses the 40-byte header, copies the sparse index entries into a Vec. The primary and data sections stay mmap'd.

get(slot: usize) -> u32 — random access

primary[slot] < 255  →  return it directly

step == 0:
    binary_search(data[0..n_overflow], slot)

step > 0:
    i = upper_bound(index[..].slot, slot)  1     // in L1-resident Vec
    binary_search(data[index[i].pos .. index[i+1].pos], slot)

iter() -> Iter<'_> — sequential scan, O(n)

Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.

Iter implements ExactSizeIterator. &PersistentCompactIntVec implements IntoIterator.

Aggregate

fn sum(&self) -> u64   // Σ self[i] as u64, via iter()

Distance methods

All take &other of equal length, iterate both with zip(self.iter(), other.iter()), and return f64.

Method Formula
bray_dist 1 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)
relfreq_bray_dist Bray-Curtis on relative frequencies: 1 Σmin(pᵢ,qᵢ) where pᵢ = aᵢ/Σa
euclidean_dist √Σ(aᵢ bᵢ)²
relfreq_euclidean_dist Euclidean on relative frequencies
hellinger_euclidean_dist √Σ(√pᵢ √qᵢ)² — Euclidean on sqrt(relfreq)
hellinger_dist hellinger_euclidean_dist / √2 — standard Hellinger distance ∈ [0, 1]
threshold_jaccard_dist(&other, threshold: u32) 1 |A∩B| / |AB| where presence iff count ≥ threshold
jaccard_dist threshold_jaccard_dist(&other, 1)

Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.

Step computation

Chosen at close() once n_overflow is known:

L1_INDEX_ENTRIES = 2048

step = 0                                if n_overflow ≤ 2048
step = ⌈n_overflow / 2048⌉             otherwise

Complexity

Operation Time Notes
set / get (builder) O(1) mmap byte + HashMap
get (reader, no overflow) O(1) single mmap byte
get (reader, with index) O(log step) ≤ 2 memory regions
get (reader, no index) O(log n_overflow) data fits in a few cache lines
iter() full scan O(n + n_overflow) merge-scan, no binary search
sum, distances O(n) via iter() / zip(iter(), iter())
min / max / add / diff O(n) via other.iter() + builder set
close O(n_overflow log n_overflow) sort + sequential write
open O(n_index) index copy into Vec
build_from O(file_size) + O(n_overflow) OS copy + HashMap load

PersistentCompactIntMatrix — column-major directory

Design

A directory containing meta.json and N column files col_000000.pciv, col_000001.pciv, …, each a PersistentCompactIntVec. This is the type used by LayerData — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.

counts/
  meta.json          {"n": <n_slots>, "n_cols": <N>}
  col_000000.pciv
  col_000001.pciv
  ...

Builder (PersistentCompactIntMatrixBuilder)

struct PersistentCompactIntMatrixBuilder {
    dir:    PathBuf,
    n:      usize,
    n_cols: usize,
}

new(n: usize, dir: &Path) -> io::Result<Self>

Creates the directory (including parents). Does not write meta.json yet.

add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>

Creates col_NNNNNN.pciv for the next column and returns its builder. The caller fills the column and calls builder.close() before calling add_col again.

close(self) -> io::Result<()>

Writes meta.json with the final n and n_cols. Must be called after all column builders are closed.

Reader (PersistentCompactIntMatrix)

struct PersistentCompactIntMatrix {
    cols: Vec<PersistentCompactIntVec>,
    n:    usize,
}

open(dir: &Path) -> io::Result<Self>

Reads meta.json, opens all col_NNNNNN.pciv files.

row(slot: usize) -> Box<[u32]>

Returns the full row: [col_0[slot], col_1[slot], …, col_{N-1}[slot]]. One mmap access per column. O(N).

col(c: usize) -> &PersistentCompactIntVec

Direct access to a single column for column-oriented operations (distance computations, iteration).

LayerData implementation

impl LayerData for PersistentCompactIntMatrix {
    type Item = Box<[u32]>;
    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
    fn read(&self, slot: usize) -> Box<[u32]>    { self.row(slot) }
}

Aggregation traits — obicompactvec::traits

PersistentCompactIntMatrix implements two aggregation traits used by LayeredStore<S> for cross-layer and cross-partition distance computations.

ColumnWeights

impl ColumnWeights for PersistentCompactIntMatrix {
    fn col_weights(&self) -> Array1<u64>   // = self.sum()
}

col_weights()[c] = sum of all values in column c across all slots.

CountPartials

impl CountPartials for PersistentCompactIntMatrix {
    // Self-contained partials (additive across layers, no external parameter)
    fn partial_bray(&self)                                      -> Array2<u64>
    fn partial_euclidean(&self)                                 -> Array2<f64>
    fn partial_threshold_jaccard(&self, threshold: u32)         -> (Array2<u64>, Array2<u64>)

    // Normalised partials (require global col_weights across all layers/partitions)
    fn partial_relfreq_bray(&self, global: &Array1<u64>)        -> Array2<f64>
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)   -> Array2<f64>
    fn partial_hellinger(&self, global: &Array1<u64>)           -> Array2<f64>

    // Provided finalisations (default implementations on the trait)
    fn bray_dist_matrix(&self)                                  -> Array2<f64>
    fn euclidean_dist_matrix(&self)                             -> Array2<f64>
    fn threshold_jaccard_dist_matrix(&self, threshold: u32)     -> Array2<f64>
    fn relfreq_bray_dist_matrix(&self)                          -> Array2<f64>
    fn relfreq_euclidean_dist_matrix(&self)                     -> Array2<f64>
    fn hellinger_dist_matrix(&self)                             -> Array2<f64>
}

Self-contained partials are additively decomposable: summing partial_bray() across all (partition, layer) pairs and finalising gives the same result as computing on the combined data.

Normalised partials require the global column weights (sum across all layers and all partitions). The global parameter must reflect the complete index, not a per-layer sum. The provided relfreq_bray_dist_matrix() etc. call col_weights() first (pass 1) then the normalised partial (pass 2); when called on a LayeredStore<LayeredStore<…>> these two-pass calls cascade automatically through the blanket impls.

partial_bray returns Array2<u64> (sum_min only, not a tuple). The denominator is always reconstructible as col_weights()[i] + col_weights()[j].

partial_threshold_jaccard returns (inter, union) as a pair because union[i,j] is not reconstructible from per-column statistics — it depends on both columns simultaneously.