Files

T

Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode

Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.

2026-05-26 15:08:29 +02:00

9.8 KiB

Raw Permalink Blame History

MPHF selection — two-phase indexing architecture

Why two phases are needed

Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.

Phase 1 — provisional MPHF + kmer spectrum

Implemented in obikpartitionner::KmerPartition::count_kmer() → count_partition().

External sort: read the dereplicated superkmer file; extract the raw u64 canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: sorted_unique.bin — a flat array of f0 distinct sorted u64 values. Exact kmer count f0 is known at this point.
Build provisional MPHF (ptr_hash, same configuration as phase 2) over sorted_unique.bin using new_from_par_iter. Delete sorted_unique.bin immediately after. Persist to mphf1.bin.
Create counts1.bin: PersistentCompactIntVec with f0 slots, zero-initialised.
Accumulation pass: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute slot = mphf.index(kmer.raw()) and increment counts1[slot] by the superkmer's COUNT.
Build kmer frequency spectrum from counts1: histogram {count → n_kmers}, totals f0 (distinct kmers) and f1 (total abundance). Written to kmer_spectrum_raw.json per partition, then merged globally.

Files produced per partition:

part_XXXXX/
  mphf1.bin               — ptr_hash provisional MPHF (discarded after phase 2)
  counts1.bin             — PersistentCompactIntVec, f0 × u32 kmer counts
  kmer_spectrum_raw.json  — local frequency spectrum

Phase 2 — definitive MPHF

After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see Construction pipeline), the exact filtered kmer set is available via unitigs.bin.

MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot) is called on the unitig directory:

Pass 1 (parallel): a CanonicalKmerIter — clonable via Arc<Mmap>, no file reopening — is passed directly to new_from_par_iter via par_bridge(). No .idx is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces mphf.bin.
Pass 2 (sequential): iterate with iter_indexed_canonical_kmers; fill evidence files; call fill_slot(slot, kmer) callback per kmer. For Exact/Hybrid, .idx is written at the end of this pass — never earlier.

mphf1.bin and counts1.bin are no longer needed after phase 2 and can be deleted.

MPHF candidates

boomphf (BBHash algorithm, maintained by 10X Genomics):

~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
Supports streaming construction (no exact count needed)
Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available

ptr_hash (PtrHash algorithm, Groot Koerkamp, SEA 2025):

~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
Requires exact key count at construction — available at both phases after pass 1
Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file

FMPH/FMPHGO (ph crate, Beling, ACM JEA 2023):

~2.1 bits/key — most compact; good query speed; deterministic construction
GOFunction (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well

MPHF choice per phase

Both phases: ptr_hash, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the ph crate dependency.

boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.

Space at scale

For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):

MPHF	bits/key	Total MPHF size
boomphf	3.7	~47 GB
ptr_hash	2.4	~31 GB
FMPH	2.1	~27 GB

For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

ptr_hash configuration (phase 2)

type Mphf = PtrHash<
    u64,                              // key: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Hasher — Xx64: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). FxHash (single multiply) distributes these poorly; Xx64 (XXH3-64, seeded) handles structured input correctly.

Bucket function — CubicEps: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space. default_compact (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.

Remap — CachelineEfVec: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for Vec<u32>). One cacheline per query; space win dominates at billion-scale key counts.

Multilayer index architecture

Layer structure

Each layer is a self-contained unit. See obilayeredmap for the full on-disk layout. The MPHF-relevant files are:

layer_i/
  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence source)
  unitigs.bin.idx  — random-access block index (block_bits controls granularity)
  mphf.bin         — ptr_hash phase-2 MPHF
  evidence.bin     — n × (chunk_id: 25 bits | rank: 7 bits) per slot  [exact mode]
  fingerprint.bin  — n × b-bit fingerprints per slot                  [approx mode]
  [no layer_meta.json — mode stored once in partition-level meta.json]

Layers are disjoint: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:

For each kmer in B: probe existing layers. If found, the kmer is already indexed.
Collect kmers of B not present in any layer → set B \ A.
Build layer 1 from B \ A (dereplicate → count → De Bruijn → unitigs → MphfLayer::build).

Evidence modes

Three evidence modes are supported via IndexMode, stored once in PartitionMeta at partition root. There is no layer_meta.json.

Exact (IndexMode::Exact): evidence.bin stores one (chunk_id, rank) pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. .idx required at query time.

Approx (IndexMode::Approx { b, z }): fingerprint.bin stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No .idx written or needed.

Hybrid (IndexMode::Hybrid { b, z }): both fingerprint.bin and evidence.bin + .idx. find() uses the fingerprint (O(1)); find_strict() uses exact evidence (O(1)).

Build functions

MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
    Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin  (no .idx used)
    Pass 2: sequential iter → fill evidence files + call fill_slot
            .idx written last for Exact/Hybrid (query-time only)

MphfLayer::build_exact_evidence(dir, block_bits)
    Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
    Uses open_sequential(); no .idx required on entry

MphfLayer::build_approx_evidence(dir, b, z)
    Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
    Uses open_sequential(); never writes .idx

There is no build_evidence dispatch wrapper. Callers choose the appropriate post-hoc build directly.

In obikpartitionner, build_index_layer receives block_bits: u8 from IndexConfig::block_bits and forwards it directly to Layer::build and Layer::build_approx_evidence.

Membership verification

ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:

Exact: decode (chunk_id, rank) from evidence.bin; reconstruct the kmer via unitigs.verify_canonical_kmer; compare to query.
Approx: compare kmer.seq_hash() to the b-bit fingerprint stored at the slot.

A mismatch in either mode means the kmer is absent from this layer; probe the next layer.

Query algorithm

fn query(kmer) → Option<(layer_index, slot)>:
    for (i, layer) in layers.iter().enumerate():
        slot = layer.mphf.index(kmer)
        if layer.evidence.matches(slot, kmer):   // exact or approx dispatch
            return Some((i, slot))
    return None

MphfLayer::find dispatches on LayerEvidence at O(1) — no panicking find_exact/find_approx methods. find_strict always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.

Merging layers

Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.

9.8 KiB Raw Permalink Blame History Unescape Escape