f2de79acde
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
204 lines
10 KiB
Markdown
204 lines
10 KiB
Markdown
# obilayeredmap — layered kmer index crate
|
||
|
||
## Purpose
|
||
|
||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||
|
||
---
|
||
|
||
## Four usage modes
|
||
|
||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||
|
||
| Mode | Description | Payload |
|
||
|---|---|---|
|
||
| 1. Set | membership test only | none |
|
||
| 2. Set with count | occurrences per kmer per sample | compact integer vector |
|
||
| 3. Presence/absence matrix | which genomes contain each kmer | bit matrix n_kmers × n_genomes |
|
||
| 4. Count matrix | occurrences per kmer per genome | integer matrix n_kmers × n_genomes |
|
||
|
||
Mode 4 is architecturally identical to mode 3 but with counts instead of bits; the main open question is scale — a naive dense representation reaches hundreds of GB for large genome collections, which may require a sparse encoding.
|
||
|
||
### Payload for mode 2: compact count vector
|
||
|
||
Counts follow a roughly geometric distribution: the large majority are below 128, almost all below 16 000, with rare large values in highly covered regions. Options for a mmap-compatible random-access count vector:
|
||
|
||
- **`Vec<u16>`** — 2 bytes/slot, covers 0–65 535, O(1) access, trivially mmap-able. Practical upper bound sufficient for most metagenomic use cases.
|
||
- **Bit-packed fixed width** — choose B = ⌈log₂(max\_count)⌉ globally (e.g. B=14 for 99.9% coverage at 1.75 bytes/slot). O(1) access via bit-shift arithmetic.
|
||
- **Block-varint (PForDelta, StreamVByte)** — good compression but random access requires a separate offset index; no mature Rust crate for mmap use.
|
||
|
||
Decision not yet made. `Vec<u16>` is the default baseline pending profiling.
|
||
|
||
### Payload for mode 3: presence/absence matrix
|
||
|
||
Column-major bit matrix: column j (genome j) is a contiguous `n_slots`-bit array. This layout makes per-genome operations (union, intersection, diff) cache-friendly. For 10⁹ slots × 100 genomes ≈ 12.5 GB — large but mmap-able with no resident-memory cost at open time.
|
||
|
||
---
|
||
|
||
## Payload architecture
|
||
|
||
The payload is orthogonal to the MPHF + evidence layer. `Layer` will be parameterised by a payload type `D`:
|
||
|
||
```rust
|
||
struct Layer<D = ()> {
|
||
mphf: Mphf,
|
||
evidence: Evidence,
|
||
unitigs: UnitigFileReader,
|
||
data: D,
|
||
}
|
||
```
|
||
|
||
Where `D` implements a `LayerData` trait covering open/close/get. Concrete instances:
|
||
|
||
- `()` — mode 1
|
||
- `CountVec` — mode 2 (u16 or bit-packed)
|
||
- `BitMatrix` — mode 3
|
||
- `CountMatrix` — mode 4
|
||
|
||
This parameterisation is not yet implemented; the current code uses a fixed `Counts` field.
|
||
|
||
---
|
||
|
||
## Three-level hierarchy
|
||
|
||
```
|
||
index_root/ ← LayeredMap (collection)
|
||
meta.json
|
||
part_00000/ ← Partition
|
||
layer_0/ ← Layer
|
||
mphf.bin
|
||
unitigs.bin
|
||
unitigs.bin.idx
|
||
evidence.bin
|
||
counts.bin [mode 2 only]
|
||
presence.bin [mode 3/4 only]
|
||
layer_1/
|
||
...
|
||
part_00001/
|
||
layer_0/
|
||
...
|
||
```
|
||
|
||
**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
|
||
|
||
**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
|
||
|
||
**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
|
||
|
||
---
|
||
|
||
## Layer file layout
|
||
|
||
```
|
||
layer_N/
|
||
mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
|
||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||
counts.bin — [optional] u16 per MPHF slot (mode 2)
|
||
presence.bin — [optional] bit matrix n_slots × n_samples (mode 3/4)
|
||
```
|
||
|
||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||
|
||
### Evidence encoding
|
||
|
||
Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
|
||
|
||
```
|
||
slot s → unitig_id: u25 | rank: u7
|
||
```
|
||
|
||
Packed into a `u32` (29 bits used, 3 spare). Decoding:
|
||
|
||
```
|
||
kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
|
||
```
|
||
|
||
`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
|
||
|
||
---
|
||
|
||
## ptr_hash configuration
|
||
|
||
The MPHF per layer is configured as:
|
||
|
||
```rust
|
||
type Mphf = PtrHash<
|
||
u64, // key type: canonical kmer raw encoding
|
||
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
|
||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
|
||
Xx64, // hasher: XXH3-64 with seed, handles structured keys
|
||
Vec<u8>, // pilots
|
||
>;
|
||
```
|
||
|
||
**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
|
||
|
||
**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
|
||
|
||
**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
|
||
|
||
**Construction:** `new_from_par_iter` avoids materialising all keys as `Vec<u64>`. MPHF correctness is verified inline during the second pass (evidence/counts fill) using an n/8-byte bitset; any duplicate slot or out-of-bounds index returns `OLMError::Mphf`.
|
||
|
||
---
|
||
|
||
## Query path
|
||
|
||
A kmer query routes through all three levels:
|
||
|
||
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
|
||
2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.index(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
|
||
3. **Data access**: read payload at `slot` from the matching layer (counts, presence row, etc.).
|
||
|
||
```
|
||
fn query(kmer) → Option<Hit>:
|
||
part = partition_of(kmer)
|
||
for (i, layer) in part.layers.iter().enumerate():
|
||
slot = layer.mphf.index(&kmer.raw())
|
||
if layer.evidence.decode(slot) == kmer:
|
||
return Some(Hit { layer: i, slot })
|
||
return None
|
||
```
|
||
|
||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.
|
||
|
||
---
|
||
|
||
## Add-layer algorithm
|
||
|
||
When adding dataset B to an existing index:
|
||
|
||
1. For each partition, iterate kmers of B routed to that partition.
|
||
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
|
||
3. Build a new layer from `B \ index`.
|
||
4. Append the new layer directory under each `part_XXXXX/`.
|
||
5. Update `meta.json` (layer count, sample registry).
|
||
|
||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
| crate | role |
|
||
|---|---|
|
||
| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
|
||
| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
|
||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||
| `memmap2` | mmap of layer files |
|
||
| `obiskio` | unitig file writer/reader |
|
||
|
||
Flat arrays (evidence, counts, presence) use a simple custom binary format (raw bytes, fixed element size) opened via mmap — no additional serialisation crate required.
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation (store only non-zero entries) may be required; the access pattern and density threshold are not yet defined.
|
||
- **Count vector encoding (mode 2)**: `Vec<u16>` vs bit-packed (B=14). Decision pending; depends on whether counts > 65 535 occur in practice for the target datasets.
|
||
- **Presence matrix layout**: column-major favours per-genome operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
|
||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
|
||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|