Files
Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 15:08:29 +02:00

14 KiB
Raw Permalink Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.


Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

Mode Description Payload type Storage
1. Set membership test only ()
2. Count occurrences per kmer per sample PersistentCompactIntMatrix counts/ directory
3. Presence/absence which genomes contain each kmer PersistentBitMatrix presence/ directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.


Index mode (homogeneity invariant)

A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at LayeredMap::open() from PartitionMeta.mode and passed to each Layer::open() — no per-layer file is read.

#[derive(Serialize, Deserialize, Default)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum IndexMode {
    #[default]
    Exact,
    Approx { b: u8, z: u8 },
    Hybrid { b: u8, z: u8 },
}

IndexMode is stored once in PartitionMeta (meta.json at partition root). There is no layer_meta.json.

  • Exact: writes evidence.bin + unitigs.bin.idx. Zero false positives.
  • Approx: writes fingerprint.bin only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No .idx written or required.
  • Hybrid: writes both fingerprint.bin and evidence.bin + .idx. find() uses the fingerprint (fast, O(1)); find_strict() uses exact evidence.

MphfLayer — autonomous kmer → slot mapping

MphfLayer encapsulates the MPHF and evidence store for one layer. It is independent of any payload.

pub struct MphfLayer {
    mphf: Mphf,
    ev:   LayerEvidence,   // loaded at open() time
    n:    usize,
}

LayerEvidence is an internal enum, not public:

enum LayerEvidence {
    Exact  { evidence: Evidence, unitigs: UnitigFileReader },
    Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
    Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
}

MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read.

Query API

Two public query methods, both returning Option<usize> (slot index):

pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
  • find: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
  • find_strict: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no .idx).

There are no find_exact/find_approx methods; panicking dispatch is eliminated.

Build surface

// Full MPHF + evidence build (two-pass)
pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>

// Evidence-only post-hoc builds (MPHF already present)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z)      -> OLMResult<usize>

MphfLayer::build runs two passes over unitigs.bin:

  1. Pass 1 (parallel via rayon): a CanonicalKmerIter (clonable, Arc<Mmap>, no file reopening) is passed to new_from_par_iter via par_bridge(). Produces mphf.bin. No .idx is read or created at this stage.
  2. Pass 2 (sequential): fill evidence files; call fill_slot(slot, kmer) per kmer. .idx is written last for Exact/Hybrid modes (query-time only).

There is no build_evidence dispatch wrapper — callers invoke build_exact_evidence or build_approx_evidence directly.

For empty layers (n = 0), all build variants return Ok(0) immediately after creating empty output files.


Layer<D: LayerData> — MPHF + payload

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). Build signatures differ between modes and are not part of the trait.

Type Item Description
() () mode 1 — membership only
PersistentCompactIntMatrix Box<[u32]> mode 2 — count matrix (one u32 per column per slot)
PersistentBitMatrix Box<[bool]> mode 3 — presence matrix (one bit per genome per slot)

Build signatures

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                 count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                          counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                          n_genomes: usize,
                          present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}

All build impls delegate to MphfLayer::build via a mode-specific fill_slot callback. The mode parameter is forwarded directly — no LayerMeta is written.

Evidence-only post-hoc builds are accessible directly on Layer<D>:

impl<D: LayerData> Layer<D> {
    pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
    pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8)  -> OLMResult<usize>
}

There is no build_evidence dispatch wrapper.


FingerprintVec and FingerprintVecWriter

Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.

fingerprint.bin format:
  magic:   b"FPVF"  (4 bytes)
  b:       u8       (bits per fingerprint, 1..=64)
  padding: [0u8; 3]
  n:       u64 LE   (number of slots)
  data:    packed bits, ceil(n*b/8) bytes, Lsb0 order
impl FingerprintVec {
    pub fn open(path: &Path) -> OLMResult<Self>
    pub fn get(&self, slot: usize) -> u64
    pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
    pub fn n(&self) -> usize
    pub fn b(&self) -> u8
}

matches(slot, hash) extracts the b-bit fingerprint stored at slot and compares it to the low b bits of hash. It is the core operation of find_approx.


LayeredMap<D> — collection of layers

LayeredMap<D> wraps Vec<Layer<D>> for a single partition directory.

pub struct LayeredMap<D: LayerData = ()> {
    root:   PathBuf,
    meta:   PartitionMeta,
    layers: Vec<Layer<D>>,
}

PartitionMeta (meta.json at the partition root) stores n_layers.

Common methods

pub fn open(root: &Path)              -> OLMResult<Self>
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
pub fn n_layers(&self)                -> usize
pub fn layer(&self, i: usize)         -> &Layer<D>
pub fn mode(&self)                    -> &IndexMode
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self)       -> OLMResult<UnitigFileWriter>

open reads PartitionMeta once, extracts mode, and passes it to every Layer::open — no per-layer file is read. create stores the given mode in PartitionMeta.

query probes layers in order and returns (layer_index, Hit) on the first match. Expected probe depth: 1 for kmers in layer 0.

push_layer

push_layer builds the next layer from a unitigs.bin already written via next_layer_writer, using DEFAULT_BLOCK_BITS:

// mode 1
impl LayeredMap<()> {
    pub fn push_layer(&mut self) -> OLMResult<usize>
}

// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
    pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

Mode 3 (PersistentBitMatrix) has no push_layer on LayeredMap; callers build directly via Layer<PersistentBitMatrix>::build_presence.


LayeredStore<S> and aggregation traits

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> {  }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> {  }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> {  }  // element-wise Σ partials

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

Leaf implementors (in obicompactvec):

Type Traits
PersistentCompactIntMatrix ColumnWeights (via sum()) + CountPartials
PersistentBitMatrix ColumnWeights (via count_ones()) + BitPartials

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.


On-disk structure

partition_root/                    ← LayeredMap (one partition)
  meta.json                        — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
  layer_0/                         ← Layer
    mphf.bin                       — ptr_hash MPHF (epserde format)
    unitigs.bin                    — packed 2-bit nucleotide sequences
    unitigs.bin.idx                — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
    evidence.bin                   — [u32; n], LE  (Exact/Hybrid only)
    fingerprint.bin                — packed b-bit array  (Approx/Hybrid only)
    counts/                        [mode 2] PersistentCompactIntMatrix
      meta.json
      col_000000.pciv
    presence/                      [mode 3] PersistentBitMatrix
      meta.json
      col_000000.pbiv  …
  layer_1/
    …

There is no layer_meta.json. The mode is stored once in PartitionMeta and is valid for all layers. unitigs.bin.idx is built at the end of build_exact_evidence — never during MPHF construction — and is consumed at query time only.


Evidence encoding (exact)

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)

chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id (requires unitigs.bin.idx for random access).

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.


ptr_hash configuration

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5): 2× slower construction than Linear/λ=3.0, ~20% less space.


Column append and merge support

These methods extend existing layers with new genome columns without touching the MPHF.

Layer-level genome column append

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}

impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

Both delegate to the corresponding PersistentBitMatrix::append_column / PersistentCompactIntMatrix::append_column. They write a new column file (col_NNNNNN.pbiv / col_NNNNNN.pciv) and update meta.json to increment n_cols. value_of is called once per slot (0..n).

Presence matrix initialisation

impl Layer<()> {
    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}

Called on the first merge of a Presence-mode index. Creates presence/ with meta.json {"n": n_kmers, "n_cols": 1} and col_000000.pbiv set entirely to true. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.

Why the MPHF is never rebuilt

The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new .pciv/.pbiv file and a single meta.json update.


Add-layer algorithm

When adding dataset B to an existing index:

  1. For each partition, probe existing layers for kmers of B routed to that partition.
  2. Collect kmers absent from all layers → B \ index.
  3. Write B \ index to a new unitigs.bin via next_layer_writer().
  4. Call Layer<D>::build (or build_presence) on the new layer directory.
  5. Call push_layer (or append_layer) to register the new layer in meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.


Dependencies

crate role
ptr_hash 1.1 MPHF per layer
cacheline-ef 1.1 compact remap inside ptr_hash
epserde 0.8 zero-copy MPHF serialisation
memmap2 0.9 mmap of evidence and fingerprint files
bitvec packed b-bit fingerprint storage
obiskio unitig file writer/reader + .idx build
obicompactvec payload types + aggregation traits
rayon 1 parallel MPHF construction pass
serde / serde_json PartitionMeta serialisation