Files

T

Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF

Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.

2026-05-13 10:09:46 +08:00

10 KiB

Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: collection → partition → layer. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

Four usage modes

The MPHF + evidence infrastructure is fixed for all modes. The payload — data associated with each slot — is orthogonal and varies by mode.

Mode	Description	Payload
1. Set	membership test only	none
2. Set with count	occurrences per kmer per sample	compact integer vector
3. Presence/absence matrix	which genomes contain each kmer	bit matrix n_kmers × n_genomes
4. Count matrix	occurrences per kmer per genome	integer matrix n_kmers × n_genomes

Mode 4 is architecturally identical to mode 3 but with counts instead of bits; the main open question is scale — a naive dense representation reaches hundreds of GB for large genome collections, which may require a sparse encoding.

Payload for mode 2: compact count vector

Counts follow a roughly geometric distribution: the large majority are below 128, almost all below 16 000, with rare large values in highly covered regions. Options for a mmap-compatible random-access count vector:

Vec<u16> — 2 bytes/slot, covers 0–65 535, O(1) access, trivially mmap-able. Practical upper bound sufficient for most metagenomic use cases.
Bit-packed fixed width — choose B = ⌈log₂(max_count)⌉ globally (e.g. B=14 for 99.9% coverage at 1.75 bytes/slot). O(1) access via bit-shift arithmetic.
Block-varint (PForDelta, StreamVByte) — good compression but random access requires a separate offset index; no mature Rust crate for mmap use.

Decision not yet made. Vec<u16> is the default baseline pending profiling.

Payload for mode 3: presence/absence matrix

Column-major bit matrix: column j (genome j) is a contiguous n_slots-bit array. This layout makes per-genome operations (union, intersection, diff) cache-friendly. For 10⁹ slots × 100 genomes ≈ 12.5 GB — large but mmap-able with no resident-memory cost at open time.

Payload architecture

The payload is orthogonal to the MPHF + evidence layer. Layer will be parameterised by a payload type D:

struct Layer<D = ()> {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    data:     D,
}

Where D implements a LayerData trait covering open/close/get. Concrete instances:

() — mode 1
CountVec — mode 2 (u16 or bit-packed)
BitMatrix — mode 3
CountMatrix — mode 4

This parameterisation is not yet implemented; the current code uses a fixed Counts field.

Three-level hierarchy

index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin
      unitigs.bin
      unitigs.bin.idx
      evidence.bin
      counts.bin          [mode 2 only]
      presence.bin        [mode 3/4 only]
    layer_1/
      ...
  part_00001/
    layer_0/
    ...

Collection (index_root/): global metadata — kmer size k, number of partitions, layer count, sample registry.

Partition (part_XXXXX/): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in part_XXXXX. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in meta.json.

Layer (layer_N/): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.

Layer file layout

layer_N/
  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
  counts.bin          — [optional] u16 per MPHF slot (mode 2)
  presence.bin        — [optional] bit matrix n_slots × n_samples (mode 3/4)

unitigs.bin is the packed-2-bit sequence file produced by obiskio::UnitigFileWriter. The companion .idx file stores: magic UIDX, n_unitigs: u32, n_kmers: u64, seqls: [u8; n_unitigs] (kmer count − 1 per chunk), and packed_offsets: [u32; n_unitigs + 1] (byte offsets into unitigs.bin, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

Evidence encoding

Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.

slot s  →  unitig_id: u25  |  rank: u7

Packed into a u32 (29 bits used, 3 spare). Decoding:

kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice

rank is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A u7 (0–127) is sufficient.

ptr_hash configuration

The MPHF per layer is configured as:

type Mphf = PtrHash<
    u64,                          // key type: canonical kmer raw encoding
    CubicEps,                     // bucket fn: balanced (2.4 bits/key, λ=3.5)
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
    Xx64,                         // hasher: XXH3-64 with seed, handles structured keys
    Vec<u8>,                      // pilots
>;

Hasher choice — Xx64: k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). FxHash (single multiply) distributes these poorly. Xx64 (XXH3 64-bit, seeded) handles structured input correctly.

Bucket function — CubicEps with PtrHashParams::<CubicEps>::default(): λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than Linear/λ=3.0 (the default_fast preset), 20% less space. default_compact (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.

Remap — CachelineEfVec: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for Vec<u32>). Already a transitive dependency of ptr_hash. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.

Construction: new_from_par_iter avoids materialising all keys as Vec<u64>. MPHF correctness is verified inline during the second pass (evidence/counts fill) using an n/8-byte bitset; any duplicate slot or out-of-bounds index returns OLMError::Mphf.

Query path

A kmer query routes through all three levels:

Partition routing: hash canonical minimiser of the query kmer → partition index → open part_XXXXX/.
Layer probing: iterate layers in order within the partition; for each layer compute slot = mphf.index(kmer), then verify evidence.decode(slot) == kmer. First match wins.
Data access: read payload at slot from the matching layer (counts, presence row, etc.).

fn query(kmer) → Option<Hit>:
    part = partition_of(kmer)
    for (i, layer) in part.layers.iter().enumerate():
        slot = layer.mphf.index(&kmer.raw())
        if layer.evidence.decode(slot) == kmer:
            return Some(Hit { layer: i, slot })
    return None

Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.

Add-layer algorithm

When adding dataset B to an existing index:

For each partition, iterate kmers of B routed to that partition.
Probe existing layers; collect kmers absent from all layers → B \ index.
Build a new layer from B \ index.
Append the new layer directory under each part_XXXXX/.
Update meta.json (layer count, sample registry).

Each partition's new layer is built independently; the operation is fully parallel across partitions.

Dependencies

crate	role
`ptr_hash 1.1`	MPHF per layer (epserde serialisation)
`cacheline-ef 1.1`	compact remap storage inside ptr_hash
`epserde 0.8`	zero-copy serialisation of MPHF
`memmap2`	mmap of layer files
`obiskio`	unitig file writer/reader

Flat arrays (evidence, counts, presence) use a simple custom binary format (raw bytes, fixed element size) opened via mmap — no additional serialisation crate required.

Open questions

Mode 4 scale: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation (store only non-zero entries) may be required; the access pattern and density threshold are not yet defined.
Count vector encoding (mode 2): Vec<u16> vs bit-packed (B=14). Decision pending; depends on whether counts > 65 535 occur in practice for the target datasets.
Presence matrix layout: column-major favours per-genome operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
Layer merge: merging two LayeredMap instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
Canonical kmer orientation: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
try_new_from_par_iter: ptr_hash::new_from_par_iter silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A try_new_from_par_iter PR upstream would close this gap.

10 KiB Raw Blame History Unescape Escape