Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

14 KiB
Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.


Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

Mode Description Payload type Storage
1. Set membership test only ()
2. Count occurrences per kmer per sample PersistentCompactIntMatrix counts/ directory
3. Presence/absence which genomes contain each kmer PersistentBitMatrix presence/ directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.


Evidence kinds

Each layer carries one of two evidence bundles, recorded in layer_meta.json at build time:

pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}

EvidenceKind is stored in LayerMeta (one per layer directory). open() reads it to decide which evidence files to load.

  • Exact: writes evidence.bin + unitigs.bin.idx. Zero false positives. Requires random-access .idx at query time.
  • Approx: writes fingerprint.bin only. False-positive rate per kmer query = 1/2^b. z is the Findere consecutive-kmer parameter: z consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L k z + 2 is the number of windows in a read of length L. No .idx written or required.

MphfLayer — autonomous kmer → slot mapping

MphfLayer encapsulates the MPHF and evidence store for one layer. It is independent of any payload.

pub struct MphfLayer {
    mphf: Mphf,
    ev:   LayerEvidence,   // loaded at open() time
    n:    usize,
}

LayerEvidence is an internal enum, not public:

enum LayerEvidence {
    Exact  { evidence: Evidence, unitigs: UnitigFileReader },
    Approx { fingerprint: FingerprintVec },
}

Query API

Three public query methods, all returning Option<usize> (slot index):

pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
  • find dispatches transparently to find_exact or find_approx based on the evidence variant loaded at open().
  • find_exact panics if the layer holds approximate evidence; zero false positives.
  • find_approx panics if the layer holds exact evidence; FP rate 1/2^b per kmer.

open() requires unitigs.bin.idx (random access into unitigs). open_sequential() on UnitigFileReader does not require the .idx and is used during build passes.

Build surface

// Full MPHF + exact evidence build (two-pass, parallel)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>

// Evidence-only builds (MPHF already present in dir)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z)      -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize>  // dispatch

MphfLayer::build runs two sequential passes over unitigs.bin:

  1. Pass 1 (parallel via rayon): iterate all canonical kmers, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
  2. Pass 2 (sequential): iterate again, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.

build always produces exact evidence. For approximate evidence, use build_approx_evidence after MPHF construction.

For empty layers (n = 0), all build variants return Ok(0) immediately after creating empty output files.


Layer<D: LayerData> — MPHF + payload

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). Build signatures differ between modes and are not part of the trait.

Type Item Description
() () mode 1 — membership only
PersistentCompactIntMatrix Box<[u32]> mode 2 — count matrix (one u32 per column per slot)
PersistentBitMatrix Box<[bool]> mode 3 — presence matrix (one bit per genome per slot)

Build signatures

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, block_bits: u8,
                 count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, block_bits: u8,
                          counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(out_dir: &Path, block_bits: u8,
                          n_genomes: usize,
                          present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}

All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Modes 2 and 3 pre-read n_kmers from unitigs.bin via UnitigFileReader::open_sequential to size the matrix builder before calling MphfLayer::build.

Evidence build helpers on Layer

impl<D: LayerData> Layer<D> {
    pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
    pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8)  -> OLMResult<usize>
    pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
}

These delegate directly to the corresponding MphfLayer methods and are provided so call sites can remain typed at the Layer<D> level.


FingerprintVec and FingerprintVecWriter

Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.

fingerprint.bin format:
  magic:   b"FPVF"  (4 bytes)
  b:       u8       (bits per fingerprint, 1..=64)
  padding: [0u8; 3]
  n:       u64 LE   (number of slots)
  data:    packed bits, ceil(n*b/8) bytes, Lsb0 order
impl FingerprintVec {
    pub fn open(path: &Path) -> OLMResult<Self>
    pub fn get(&self, slot: usize) -> u64
    pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
    pub fn n(&self) -> usize
    pub fn b(&self) -> u8
}

matches(slot, hash) extracts the b-bit fingerprint stored at slot and compares it to the low b bits of hash. It is the core operation of find_approx.


LayeredMap<D> — collection of layers

LayeredMap<D> wraps Vec<Layer<D>> for a single partition directory.

pub struct LayeredMap<D: LayerData = ()> {
    root:   PathBuf,
    meta:   PartitionMeta,
    layers: Vec<Layer<D>>,
}

PartitionMeta (meta.json at the partition root) stores n_layers.

Common methods

pub fn open(root: &Path)   -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self>
pub fn n_layers(&self)     -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>

query probes layers in order and returns (layer_index, Hit) on the first match. Expected probe depth: 1 for kmers in layer 0.

push_layer

push_layer builds the next layer from a unitigs.bin already written via next_layer_writer, using DEFAULT_BLOCK_BITS:

// mode 1
impl LayeredMap<()> {
    pub fn push_layer(&mut self) -> OLMResult<usize>
}

// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
    pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

Mode 3 (PersistentBitMatrix) has no push_layer on LayeredMap; callers build directly via Layer<PersistentBitMatrix>::build_presence.


LayeredStore<S> and aggregation traits

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> {  }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> {  }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> {  }  // element-wise Σ partials

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

Leaf implementors (in obicompactvec):

Type Traits
PersistentCompactIntMatrix ColumnWeights (via sum()) + CountPartials
PersistentBitMatrix ColumnWeights (via count_ones()) + BitPartials

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.


On-disk structure

partition_root/                    ← LayeredMap (one partition)
  meta.json                        — {"n_layers": N}
  layer_0/                         ← Layer
    layer_meta.json                — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
    mphf.bin                       — ptr_hash MPHF (epserde format)
    unitigs.bin                    — packed 2-bit nucleotide sequences
    unitigs.bin.idx                — UIDX index (exact evidence only)
    evidence.bin                   — [u32; n], LE  (exact evidence only)
    fingerprint.bin                — packed b-bit array  (approx evidence only)
    counts/                        [mode 2] PersistentCompactIntMatrix
      meta.json
      col_000000.pciv
    presence/                      [mode 3] PersistentBitMatrix
      meta.json
      col_000000.pbiv  …
  layer_1/
    …

unitigs.bin.idx is required by open() (random access). open_sequential() on UnitigFileReader omits it and is used during build passes and approx-evidence construction.


Evidence encoding (exact)

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)

chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id (requires unitigs.bin.idx for random access).

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.


ptr_hash configuration

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5): 2× slower construction than Linear/λ=3.0, ~20% less space.


Column append and merge support

These methods extend existing layers with new genome columns without touching the MPHF.

Layer-level genome column append

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}

impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

Both delegate to the corresponding PersistentBitMatrix::append_column / PersistentCompactIntMatrix::append_column. They write a new column file (col_NNNNNN.pbiv / col_NNNNNN.pciv) and update meta.json to increment n_cols. value_of is called once per slot (0..n).

Presence matrix initialisation

impl Layer<()> {
    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}

Called on the first merge of a Presence-mode index. Creates presence/ with meta.json {"n": n_kmers, "n_cols": 1} and col_000000.pbiv set entirely to true. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.

Why the MPHF is never rebuilt

The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new .pciv/.pbiv file and a single meta.json update.


Add-layer algorithm

When adding dataset B to an existing index:

  1. For each partition, probe existing layers for kmers of B routed to that partition.
  2. Collect kmers absent from all layers → B \ index.
  3. Write B \ index to a new unitigs.bin via next_layer_writer().
  4. Call Layer<D>::build (or build_presence) on the new layer directory.
  5. Call push_layer (or append_layer) to register the new layer in meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.


Dependencies

crate role
ptr_hash 1.1 MPHF per layer
cacheline-ef 1.1 compact remap inside ptr_hash
epserde 0.8 zero-copy MPHF serialisation
memmap2 0.9 mmap of evidence and fingerprint files
bitvec packed b-bit fingerprint storage
obiskio unitig file writer/reader + .idx build
obicompactvec payload types + aggregation traits
rayon 1 parallel MPHF construction pass
serde / serde_json LayerMeta + PartitionMeta serialisation