refactor(obilayeredmap): support generic payload types
Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.
This commit is contained in:
@@ -10,52 +10,63 @@
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
|
||||
| Mode | Description | Payload |
|
||||
|---|---|---|
|
||||
| 1. Set | membership test only | none |
|
||||
| 2. Set with count | occurrences per kmer per sample | compact integer vector |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | bit matrix n_kmers × n_genomes |
|
||||
| 4. Count matrix | occurrences per kmer per genome | integer matrix n_kmers × n_genomes |
|
||||
| Mode | Description | Payload type | File |
|
||||
|---|---|---|---|
|
||||
| 1. Set | membership test only | `()` | — |
|
||||
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |
|
||||
|
||||
Mode 4 is architecturally identical to mode 3 but with counts instead of bits; the main open question is scale — a naive dense representation reaches hundreds of GB for large genome collections, which may require a sparse encoding.
|
||||
Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.
|
||||
|
||||
### Payload for mode 2: compact count vector
|
||||
### Payload for mode 2: PersistentCompactIntVec
|
||||
|
||||
Counts follow a roughly geometric distribution: the large majority are below 128, almost all below 16 000, with rare large values in highly covered regions. Options for a mmap-compatible random-access count vector:
|
||||
`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.
|
||||
|
||||
- **`Vec<u16>`** — 2 bytes/slot, covers 0–65 535, O(1) access, trivially mmap-able. Practical upper bound sufficient for most metagenomic use cases.
|
||||
- **Bit-packed fixed width** — choose B = ⌈log₂(max\_count)⌉ globally (e.g. B=14 for 99.9% coverage at 1.75 bytes/slot). O(1) access via bit-shift arithmetic.
|
||||
- **Block-varint (PForDelta, StreamVByte)** — good compression but random access requires a separate offset index; no mature Rust crate for mmap use.
|
||||
Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.
|
||||
|
||||
Decision not yet made. `Vec<u16>` is the default baseline pending profiling.
|
||||
### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec
|
||||
|
||||
### Payload for mode 3: presence/absence matrix
|
||||
`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.
|
||||
|
||||
Column-major bit matrix: column j (genome j) is a contiguous `n_slots`-bit array. This layout makes per-genome operations (union, intersection, diff) cache-friendly. For 10⁹ slots × 100 genomes ≈ 12.5 GB — large but mmap-able with no resident-memory cost at open time.
|
||||
Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.
|
||||
|
||||
---
|
||||
|
||||
## Payload architecture
|
||||
|
||||
The payload is orthogonal to the MPHF + evidence layer. `Layer` will be parameterised by a payload type `D`:
|
||||
The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:
|
||||
|
||||
```rust
|
||||
struct Layer<D = ()> {
|
||||
pub trait LayerData: Sized {
|
||||
type Item;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self>;
|
||||
fn read(&self, slot: usize) -> Self::Item;
|
||||
}
|
||||
|
||||
pub struct Layer<D: LayerData = ()> {
|
||||
mphf: Mphf,
|
||||
evidence: Evidence,
|
||||
unitigs: UnitigFileReader,
|
||||
data: D,
|
||||
}
|
||||
|
||||
pub struct Hit<T = ()> {
|
||||
pub slot: usize,
|
||||
pub data: T,
|
||||
}
|
||||
```
|
||||
|
||||
Where `D` implements a `LayerData` trait covering open/close/get. Concrete instances:
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
|
||||
- `()` — mode 1
|
||||
- `CountVec` — mode 2 (u16 or bit-packed)
|
||||
- `BitMatrix` — mode 3
|
||||
- `CountMatrix` — mode 4
|
||||
Implemented concrete types:
|
||||
|
||||
This parameterisation is not yet implemented; the current code uses a fixed `Counts` field.
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
| `()` | `()` | mode 1 — membership only |
|
||||
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |
|
||||
|
||||
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
|
||||
|
||||
---
|
||||
|
||||
@@ -70,8 +81,8 @@ index_root/ ← LayeredMap (collection)
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts.bin [mode 2 only]
|
||||
presence.bin [mode 3/4 only]
|
||||
counts.pciv [mode 2 only]
|
||||
presence_N.pbiv [mode 3/4, one per genome — not yet implemented]
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
@@ -95,8 +106,7 @@ layer_N/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts.bin — [optional] u16 per MPHF slot (mode 2)
|
||||
presence.bin — [optional] bit matrix n_slots × n_samples (mode 3/4)
|
||||
counts.pciv — [mode 2] PersistentCompactIntVec: one u32 per slot
|
||||
```
|
||||
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
@@ -125,11 +135,11 @@ The MPHF per layer is configured as:
|
||||
|
||||
```rust
|
||||
type Mphf = PtrHash<
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
|
||||
Xx64, // hasher: XXH3-64 with seed, handles structured keys
|
||||
Vec<u8>, // pilots
|
||||
Xx64, // hasher: XXH3-64 with seed, handles structured keys
|
||||
Vec<u8>, // pilots
|
||||
>;
|
||||
```
|
||||
|
||||
@@ -139,7 +149,30 @@ type Mphf = PtrHash<
|
||||
|
||||
**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
|
||||
|
||||
**Construction:** `new_from_par_iter` avoids materialising all keys as `Vec<u64>`. MPHF correctness is verified inline during the second pass (evidence/counts fill) using an n/8-byte bitset; any duplicate slot or out-of-bounds index returns `OLMError::Mphf`.
|
||||
---
|
||||
|
||||
## Build path
|
||||
|
||||
The build path is not part of `LayerData`. Each mode exposes its own `impl Layer<D>::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:
|
||||
|
||||
**`build_mphf(out_dir, n) -> OLMResult<Mphf>`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).
|
||||
|
||||
**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).
|
||||
|
||||
```rust
|
||||
// mode 1
|
||||
impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntVec> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
|
||||
|
||||
---
|
||||
|
||||
@@ -148,20 +181,20 @@ type Mphf = PtrHash<
|
||||
A kmer query routes through all three levels:
|
||||
|
||||
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
|
||||
2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.index(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
|
||||
3. **Data access**: read payload at `slot` from the matching layer (counts, presence row, etc.).
|
||||
2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
|
||||
3. **Data access**: `layer.data.read(slot)` returns `D::Item`.
|
||||
|
||||
```
|
||||
fn query(kmer) → Option<Hit>:
|
||||
part = partition_of(kmer)
|
||||
for (i, layer) in part.layers.iter().enumerate():
|
||||
```rust
|
||||
// pseudo-code
|
||||
fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
|
||||
for (i, layer) in self.layers.iter().enumerate():
|
||||
slot = layer.mphf.index(&kmer.raw())
|
||||
if layer.evidence.decode(slot) == kmer:
|
||||
return Some(Hit { layer: i, slot })
|
||||
return Some((i, Hit { slot, data: layer.data.read(slot) }))
|
||||
return None
|
||||
```
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
|
||||
|
||||
---
|
||||
|
||||
@@ -188,16 +221,15 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
|
||||
Flat arrays (evidence, counts, presence) use a simple custom binary format (raw bytes, fixed element size) opened via mmap — no additional serialisation crate required.
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation (store only non-zero entries) may be required; the access pattern and density threshold are not yet defined.
|
||||
- **Count vector encoding (mode 2)**: `Vec<u16>` vs bit-packed (B=14). Decision pending; depends on whether counts > 65 535 occur in practice for the target datasets.
|
||||
- **Presence matrix layout**: column-major favours per-genome operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
|
||||
- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
|
||||
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
|
||||
Reference in New Issue
Block a user