Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac 1881e98bad feat(bitvec): add partial Jaccard, fix padding, optimize constructor
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
2026-05-15 15:04:10 +08:00

12 KiB
Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: collection → partition → layer. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.


Four usage modes

The MPHF + evidence infrastructure is fixed for all modes. The payload — data associated with each slot — is orthogonal and varies by mode.

Mode Description Payload type Storage
1. Set membership test only ()
2. Count occurrences per kmer per sample PersistentCompactIntMatrix counts/ directory
3. Presence/absence matrix which genomes contain each kmer PersistentBitMatrix presence/ directory
4. Count matrix occurrences per kmer per genome PersistentCompactIntMatrix counts/ directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate. Mode 3 has a build path (Layer::<PersistentBitMatrix>::build_presence); mode 4 is not yet implemented.

Payload for modes 2/4: PersistentCompactIntMatrix

PersistentCompactIntMatrix is a column-major matrix stored in a directory: one col_NNNNNN.pciv file per column, plus a meta.json. Each column is a PersistentCompactIntVec — a mmap'd PCIV file with a u8 primary array (255 = overflow sentinel), a sorted overflow section of (slot: u64, value: u32) entries, and a sparse L1-fitting index.

Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). read(slot) returns Box<[u32]> — the full row across all columns.

Payload for mode 3: PersistentBitMatrix

PersistentBitMatrix is a column-major bit matrix stored in a directory: one col_NNNNNN.pbiv per genome, plus meta.json. Each column is a PersistentBitVec — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). read(slot) returns Box<[bool]> — the presence vector across all genomes.

Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.


Payload architecture

The payload is orthogonal to the MPHF + evidence layer. Layer is parameterised by D: LayerData:

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    data:     D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated Context type with no benefit over specialized impl blocks.

Implemented concrete types:

Type Item Description
() () mode 1 — membership only
PersistentCompactIntMatrix Box<[u32]> modes 2/4 — one count per column
PersistentBitMatrix Box<[bool]> mode 3 — one presence bit per column

LayeredMap mirrors the same parameterisation: LayeredMap<D: LayerData = ()>.


Three-level hierarchy

index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin
      unitigs.bin
      unitigs.bin.idx
      evidence.bin
      counts/              [modes 2/4]
        meta.json          {"n": N, "n_cols": 1}
        col_000000.pciv
      presence/            [mode 3]
        meta.json          {"n": N, "n_cols": G}
        col_000000.pbiv
        col_000001.pbiv
        ...
    layer_1/
      ...
  part_00001/
    layer_0/
    ...

Collection (index_root/): global metadata — kmer size k, number of partitions, layer count, sample registry.

Partition (part_XXXXX/): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in part_XXXXX. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in meta.json.

Layer (layer_N/): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.


Layer file layout

layer_N/
  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
  counts/             — [modes 2/4] PersistentCompactIntMatrix
  presence/           — [mode 3] PersistentBitMatrix

unitigs.bin is the packed-2-bit sequence file produced by obiskio::UnitigFileWriter. The companion .idx file stores: magic UIDX, n_unitigs: u32, n_kmers: u64, seqls: [u8; n_unitigs] (kmer count 1 per chunk), and packed_offsets: [u32; n_unitigs + 1] (byte offsets into unitigs.bin, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

Evidence encoding

Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.

slot s  →  unitig_id: u25  |  rank: u7

Packed into a u32 (29 bits used, 3 spare). Decoding:

kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice

rank is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A u7 (0127) is sufficient.


ptr_hash configuration

The MPHF per layer is configured as:

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: balanced (2.4 bits/key, λ=3.5)
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
    Xx64,                             // hasher: XXH3-64 with seed, handles structured keys
    Vec<u8>,                          // pilots
>;

Hasher choice — Xx64: k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). FxHash (single multiply) distributes these poorly. Xx64 (XXH3 64-bit, seeded) handles structured input correctly.

Bucket function — CubicEps with PtrHashParams::<CubicEps>::default(): λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than Linear/λ=3.0 (the default_fast preset), 20% less space. default_compact (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.

Remap — CachelineEfVec: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for Vec<u32>). Already a transitive dependency of ptr_hash. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.


Build path

The build path is not part of LayerData. Each mode exposes its own impl Layer<D>::build with the exact signature it needs. Two private module-level helpers avoid code duplication:

build_mphf(out_dir, n) -> OLMResult<Mphf>: first pass — opens unitigs.bin, iterates all canonical kmers in parallel via new_from_par_iter, stores mphf.bin. O(n).

build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>: second pass — opens unitigs.bin again, fills evidence.bin and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls fill_slot(slot, kmer) once per kmer for the mode-specific payload. O(n).

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
}

// modes 2/4
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(
        out_dir: &Path,
        n_genomes: usize,
        present_in: impl Fn(CanonicalKmer, usize) -> bool,
    ) -> OLMResult<usize>
}

Mode 2 creates a PersistentCompactIntMatrixBuilder with 1 column and fills it via build_second_pass. Mode 3 creates a PersistentBitMatrixBuilder with n_genomes columns and fills all columns in a single pass.

Any duplicate slot or out-of-bounds index detected during build_second_pass returns OLMError::Mphf. new_from_par_iter avoids materialising all keys as Vec<u64>.


Query path

A kmer query routes through all three levels:

  1. Partition routing: hash canonical minimiser of the query kmer → partition index → open part_XXXXX/.
  2. Layer probing: iterate layers in order; for each layer compute slot = mphf.index(kmer), decode evidence, compare to query. First match wins.
  3. Data access: layer.data.read(slot) returns D::Item.
// pseudo-code
fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
    for (i, layer) in self.layers.iter().enumerate():
        slot = layer.mphf.index(&kmer.raw())
        if layer.evidence.decode(slot) == kmer:
            return Some((i, Hit { slot, data: layer.data.read(slot) }))
    return None

Expected probe depth: 1 for kmers in layer 0, increasing for later layers.

For mode 2, hit.data is Box<[u32]> with 1 element; hit.data[0] is the count. For mode 3, hit.data is Box<[bool]> with G elements, one per genome.


Add-layer algorithm

When adding dataset B to an existing index:

  1. For each partition, iterate kmers of B routed to that partition.
  2. Probe existing layers; collect kmers absent from all layers → B \ index.
  3. Build a new layer from B \ index.
  4. Append the new layer directory under each part_XXXXX/.
  5. Update meta.json (layer count, sample registry).

Each partition's new layer is built independently; the operation is fully parallel across partitions.


Dependencies

crate role
ptr_hash 1.1 MPHF per layer (epserde serialisation)
cacheline-ef 1.1 compact remap storage inside ptr_hash
epserde 0.8 zero-copy serialisation of MPHF
memmap2 mmap of layer files
obiskio unitig file writer/reader
obicompactvec payload types: PersistentCompactIntMatrix, PersistentBitMatrix

Relationship to target architecture

The target architecture (see Kmer index architecture) separates MphfLayer from data stores entirely and introduces a PartitionedIndex with parallel dispatch and an Aggregator pattern. The current implementation is a stepping stone: obicompactvec types are already fully decoupled from the MPHF; the remaining refactoring is within obilayeredmap itself.


Open questions

  • Mode 4: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses PersistentCompactIntMatrix with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
  • Layer merge: merging two LayeredMap instances into a single-layer index requires full rebuild. Define API and cost model.
  • Canonical kmer orientation: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
  • try_new_from_par_iter: ptr_hash::new_from_par_iter silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A try_new_from_par_iter PR upstream would close this gap.