docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
@@ -2,40 +2,66 @@
|
||||
|
||||
## Purpose
|
||||
|
||||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||||
|
||||
---
|
||||
|
||||
## Four usage modes
|
||||
## Three usage modes
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
|
||||
|
||||
| Mode | Description | Payload type | Storage |
|
||||
|---|---|---|---|
|
||||
| 1. Set | membership test only | `()` | — |
|
||||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||||
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::<PersistentBitMatrix>::build_presence`); mode 4 is not yet implemented.
|
||||
|
||||
### Payload for modes 2/4: PersistentCompactIntMatrix
|
||||
|
||||
`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
|
||||
|
||||
Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
|
||||
|
||||
### Payload for mode 3: PersistentBitMatrix
|
||||
|
||||
`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
|
||||
|
||||
Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
|
||||
|
||||
---
|
||||
|
||||
## Payload architecture
|
||||
## MphfLayer — autonomous kmer → slot mapping
|
||||
|
||||
The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:
|
||||
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
|
||||
|
||||
```rust
|
||||
pub struct MphfLayer {
|
||||
mphf: Mphf,
|
||||
evidence: Evidence,
|
||||
unitigs: UnitigFileReader,
|
||||
n: usize, // number of indexed kmers = number of MPHF slots
|
||||
}
|
||||
```
|
||||
|
||||
Public API:
|
||||
|
||||
```rust
|
||||
impl MphfLayer {
|
||||
pub fn open(dir: &Path) -> OLMResult<Self>
|
||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
|
||||
pub fn n(&self) -> usize
|
||||
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
|
||||
pub(crate) fn build(
|
||||
dir: &Path,
|
||||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
||||
) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
|
||||
|
||||
`build` runs two sequential passes over `unitigs.bin`:
|
||||
|
||||
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
||||
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
||||
|
||||
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
|
||||
|
||||
---
|
||||
|
||||
## Layer\<D: LayerData\> — MPHF + payload
|
||||
|
||||
`Layer<D>` pairs an `MphfLayer` with one payload store.
|
||||
|
||||
```rust
|
||||
pub trait LayerData: Sized {
|
||||
@@ -45,10 +71,8 @@ pub trait LayerData: Sized {
|
||||
}
|
||||
|
||||
pub struct Layer<D: LayerData = ()> {
|
||||
mphf: Mphf,
|
||||
evidence: Evidence,
|
||||
unitigs: UnitigFileReader,
|
||||
data: D,
|
||||
mphf: MphfLayer,
|
||||
data: D,
|
||||
}
|
||||
|
||||
pub struct Hit<T = ()> {
|
||||
@@ -57,115 +81,15 @@ pub struct Hit<T = ()> {
|
||||
}
|
||||
```
|
||||
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
|
||||
Implemented concrete types:
|
||||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
|
||||
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
| `()` | `()` | mode 1 — membership only |
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
||||
|
||||
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
|
||||
|
||||
---
|
||||
|
||||
## Three-level hierarchy
|
||||
|
||||
```
|
||||
index_root/ ← LayeredMap (collection)
|
||||
meta.json
|
||||
part_00000/ ← Partition
|
||||
layer_0/ ← Layer
|
||||
mphf.bin
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts/ [modes 2/4]
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3]
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
col_000001.pbiv
|
||||
...
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
layer_0/
|
||||
...
|
||||
```
|
||||
|
||||
**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
|
||||
|
||||
**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
|
||||
|
||||
**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
|
||||
|
||||
---
|
||||
|
||||
## Layer file layout
|
||||
|
||||
```
|
||||
layer_N/
|
||||
mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts/ — [modes 2/4] PersistentCompactIntMatrix
|
||||
presence/ — [mode 3] PersistentBitMatrix
|
||||
```
|
||||
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
|
||||
### Evidence encoding
|
||||
|
||||
Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
|
||||
|
||||
```
|
||||
slot s → unitig_id: u25 | rank: u7
|
||||
```
|
||||
|
||||
Packed into a `u32` (29 bits used, 3 spare). Decoding:
|
||||
|
||||
```
|
||||
kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
|
||||
```
|
||||
|
||||
`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
|
||||
|
||||
---
|
||||
|
||||
## ptr_hash configuration
|
||||
|
||||
The MPHF per layer is configured as:
|
||||
|
||||
```rust
|
||||
type Mphf = PtrHash<
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
|
||||
Xx64, // hasher: XXH3-64 with seed, handles structured keys
|
||||
Vec<u8>, // pilots
|
||||
>;
|
||||
```
|
||||
|
||||
**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
|
||||
|
||||
**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
|
||||
|
||||
**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
|
||||
|
||||
---
|
||||
|
||||
## Build path
|
||||
|
||||
The build path is not part of `LayerData`. Each mode exposes its own `impl Layer<D>::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:
|
||||
|
||||
**`build_mphf(out_dir, n) -> OLMResult<Mphf>`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).
|
||||
|
||||
**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).
|
||||
**Build signatures:**
|
||||
|
||||
```rust
|
||||
// mode 1
|
||||
@@ -173,7 +97,7 @@ impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// modes 2/4
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
@@ -189,33 +113,119 @@ impl Layer<PersistentBitMatrix> {
|
||||
}
|
||||
```
|
||||
|
||||
Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
|
||||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
|
||||
|
||||
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
|
||||
---
|
||||
|
||||
## LayeredStore\<S\> and aggregation traits
|
||||
|
||||
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
|
||||
|
||||
```rust
|
||||
pub struct LayeredStore<S>(pub Vec<S>);
|
||||
|
||||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ col_weights across inner stores
|
||||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wise Σ partials
|
||||
```
|
||||
|
||||
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
||||
|
||||
**Aggregation hierarchy:**
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
||||
```
|
||||
|
||||
**Leaf implementors** (in `obicompactvec`):
|
||||
|
||||
| Type | Traits |
|
||||
|---|---|
|
||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
||||
|
||||
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||||
|
||||
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
||||
|
||||
---
|
||||
|
||||
## On-disk structure
|
||||
|
||||
```
|
||||
index_root/ ← LayeredMap (collection)
|
||||
meta.json
|
||||
part_00000/ ← Partition
|
||||
layer_0/ ← Layer
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
…
|
||||
layer_1/
|
||||
…
|
||||
part_00001/
|
||||
…
|
||||
```
|
||||
|
||||
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
|
||||
|
||||
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
|
||||
|
||||
---
|
||||
|
||||
## Evidence encoding
|
||||
|
||||
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
||||
|
||||
```
|
||||
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
```
|
||||
|
||||
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
|
||||
|
||||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
|
||||
|
||||
---
|
||||
|
||||
## ptr_hash configuration
|
||||
|
||||
```rust
|
||||
type Mphf = PtrHash<
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
|
||||
Xx64, // hasher: XXH3-64 with seed
|
||||
Vec<u8>, // pilots
|
||||
>;
|
||||
```
|
||||
|
||||
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
||||
|
||||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
|
||||
|
||||
---
|
||||
|
||||
## Query path
|
||||
|
||||
A kmer query routes through all three levels:
|
||||
|
||||
1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
|
||||
2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
|
||||
3. **Data access**: `layer.data.read(slot)` returns `D::Item`.
|
||||
|
||||
```rust
|
||||
// pseudo-code
|
||||
fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
|
||||
for (i, layer) in self.layers.iter().enumerate():
|
||||
slot = layer.mphf.index(&kmer.raw())
|
||||
if layer.evidence.decode(slot) == kmer:
|
||||
return Some((i, Hit { slot, data: layer.data.read(slot) }))
|
||||
return None
|
||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
|
||||
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
|
||||
}
|
||||
```
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
|
||||
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
|
||||
|
||||
For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
|
||||
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
|
||||
|
||||
---
|
||||
|
||||
@@ -223,11 +233,11 @@ For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the coun
|
||||
|
||||
When adding dataset B to an existing index:
|
||||
|
||||
1. For each partition, iterate kmers of B routed to that partition.
|
||||
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
|
||||
3. Build a new layer from `B \ index`.
|
||||
4. Append the new layer directory under each `part_XXXXX/`.
|
||||
5. Update `meta.json` (layer count, sample registry).
|
||||
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
||||
2. Collect kmers absent from all layers → `B \ index`.
|
||||
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
|
||||
4. Call `Layer<D>::build` on the new directory.
|
||||
5. Update `meta.json`.
|
||||
|
||||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||||
|
||||
@@ -237,24 +247,11 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
|
||||
| crate | role |
|
||||
|---|---|
|
||||
| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
|
||||
| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `ptr_hash 1.1` | MPHF per layer |
|
||||
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
||||
| `epserde 0.8` | zero-copy MPHF serialisation |
|
||||
| `memmap2 0.9` | mmap of evidence and payload files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
|
||||
|
||||
---
|
||||
|
||||
## Relationship to target architecture
|
||||
|
||||
The target architecture (see [Kmer index architecture](../architecture/index_architecture.md)) separates `MphfLayer` from data stores entirely and introduces a `PartitionedIndex` with parallel dispatch and an `Aggregator` pattern. The current implementation is a stepping stone: `obicompactvec` types are already fully decoupled from the MPHF; the remaining refactoring is within `obilayeredmap` itself.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
| `obicompactvec` | payload types + aggregation traits |
|
||||
| `rayon 1` | parallel MPHF construction pass |
|
||||
| `ndarray 0.16` | aggregation output arrays |
|
||||
|
||||
Reference in New Issue
Block a user