docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
## Purpose
|
||||
|
||||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||||
|
||||
---
|
||||
|
||||
@@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
|
||||
|
||||
---
|
||||
|
||||
## Evidence kinds
|
||||
|
||||
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
|
||||
|
||||
```rust
|
||||
pub enum EvidenceKind {
|
||||
Exact,
|
||||
Approx { b: u8, z: u8 },
|
||||
}
|
||||
```
|
||||
|
||||
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
|
||||
|
||||
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
|
||||
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.
|
||||
|
||||
---
|
||||
|
||||
## MphfLayer — autonomous kmer → slot mapping
|
||||
|
||||
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
|
||||
`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
|
||||
|
||||
```rust
|
||||
pub struct MphfLayer {
|
||||
mphf: Mphf,
|
||||
evidence: Evidence,
|
||||
unitigs: UnitigFileReader,
|
||||
n: usize, // number of indexed kmers = number of MPHF slots
|
||||
mphf: Mphf,
|
||||
ev: LayerEvidence, // loaded at open() time
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
Public API:
|
||||
`LayerEvidence` is an internal enum, not public:
|
||||
|
||||
```rust
|
||||
impl MphfLayer {
|
||||
pub fn open(dir: &Path) -> OLMResult<Self>
|
||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
|
||||
pub fn n(&self) -> usize
|
||||
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
|
||||
pub(crate) fn build(
|
||||
dir: &Path,
|
||||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
||||
) -> OLMResult<usize>
|
||||
enum LayerEvidence {
|
||||
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||||
Approx { fingerprint: FingerprintVec },
|
||||
}
|
||||
```
|
||||
|
||||
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
|
||||
### Query API
|
||||
|
||||
`build` runs two sequential passes over `unitigs.bin`:
|
||||
Three public query methods, all returning `Option<usize>` (slot index):
|
||||
|
||||
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
||||
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
||||
```rust
|
||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||
```
|
||||
|
||||
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
|
||||
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
|
||||
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
|
||||
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
|
||||
|
||||
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
|
||||
|
||||
### Build surface
|
||||
|
||||
```rust
|
||||
// Full MPHF + exact evidence build (two-pass, parallel)
|
||||
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
|
||||
|
||||
// Evidence-only builds (MPHF already present in dir)
|
||||
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
|
||||
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
|
||||
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
|
||||
```
|
||||
|
||||
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
|
||||
|
||||
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
||||
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
||||
|
||||
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
|
||||
|
||||
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
|
||||
|
||||
---
|
||||
|
||||
@@ -81,7 +120,7 @@ pub struct Hit<T = ()> {
|
||||
}
|
||||
```
|
||||
|
||||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
|
||||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
|
||||
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
@@ -89,31 +128,118 @@ pub struct Hit<T = ()> {
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
||||
|
||||
**Build signatures:**
|
||||
### Build signatures
|
||||
|
||||
```rust
|
||||
// mode 1
|
||||
impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
pub fn build(out_dir: &Path, block_bits: u8,
|
||||
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, block_bits: u8,
|
||||
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 3
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn build_presence(
|
||||
out_dir: &Path,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||
) -> OLMResult<usize>
|
||||
pub fn build_presence(out_dir: &Path, block_bits: u8,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
|
||||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
|
||||
|
||||
### Evidence build helpers on Layer
|
||||
|
||||
```rust
|
||||
impl<D: LayerData> Layer<D> {
|
||||
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||||
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
|
||||
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
|
||||
|
||||
---
|
||||
|
||||
## FingerprintVec and FingerprintVecWriter
|
||||
|
||||
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
|
||||
|
||||
```
|
||||
fingerprint.bin format:
|
||||
magic: b"FPVF" (4 bytes)
|
||||
b: u8 (bits per fingerprint, 1..=64)
|
||||
padding: [0u8; 3]
|
||||
n: u64 LE (number of slots)
|
||||
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||||
```
|
||||
|
||||
```rust
|
||||
impl FingerprintVec {
|
||||
pub fn open(path: &Path) -> OLMResult<Self>
|
||||
pub fn get(&self, slot: usize) -> u64
|
||||
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
|
||||
pub fn n(&self) -> usize
|
||||
pub fn b(&self) -> u8
|
||||
}
|
||||
```
|
||||
|
||||
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
|
||||
|
||||
---
|
||||
|
||||
## LayeredMap\<D\> — collection of layers
|
||||
|
||||
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
|
||||
|
||||
```rust
|
||||
pub struct LayeredMap<D: LayerData = ()> {
|
||||
root: PathBuf,
|
||||
meta: PartitionMeta,
|
||||
layers: Vec<Layer<D>>,
|
||||
}
|
||||
```
|
||||
|
||||
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
|
||||
|
||||
### Common methods
|
||||
|
||||
```rust
|
||||
pub fn open(root: &Path) -> OLMResult<Self>
|
||||
pub fn create(root: &Path) -> OLMResult<Self>
|
||||
pub fn n_layers(&self) -> usize
|
||||
pub fn layer(&self, i: usize) -> &Layer<D>
|
||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
|
||||
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
|
||||
```
|
||||
|
||||
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
|
||||
|
||||
### push_layer
|
||||
|
||||
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
|
||||
|
||||
```rust
|
||||
// mode 1
|
||||
impl LayeredMap<()> {
|
||||
pub fn push_layer(&mut self) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl LayeredMap<PersistentCompactIntMatrix> {
|
||||
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
|
||||
|
||||
---
|
||||
|
||||
@@ -131,14 +257,6 @@ impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wi
|
||||
|
||||
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
||||
|
||||
**Aggregation hierarchy:**
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
||||
```
|
||||
|
||||
**Leaf implementors** (in `obicompactvec`):
|
||||
|
||||
| Type | Traits |
|
||||
@@ -146,8 +264,6 @@ LayeredStore<LayeredStore<…>> via blanket impl (partitione
|
||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
||||
|
||||
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||||
|
||||
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
||||
|
||||
---
|
||||
@@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
|
||||
## On-disk structure
|
||||
|
||||
```
|
||||
index_root/ ← LayeredMap (collection)
|
||||
meta.json
|
||||
part_00000/ ← Partition
|
||||
layer_0/ ← Layer
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
…
|
||||
layer_1/
|
||||
…
|
||||
part_00001/
|
||||
partition_root/ ← LayeredMap (one partition)
|
||||
meta.json — {"n_layers": N}
|
||||
layer_0/ ← Layer
|
||||
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index (exact evidence only)
|
||||
evidence.bin — [u32; n], LE (exact evidence only)
|
||||
fingerprint.bin — packed b-bit array (approx evidence only)
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json
|
||||
col_000000.pbiv …
|
||||
layer_1/
|
||||
…
|
||||
```
|
||||
|
||||
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
|
||||
|
||||
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
|
||||
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
|
||||
|
||||
---
|
||||
|
||||
## Evidence encoding
|
||||
## Evidence encoding (exact)
|
||||
|
||||
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
||||
|
||||
@@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
```
|
||||
|
||||
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
|
||||
`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
|
||||
|
||||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
|
||||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
|
||||
|
||||
---
|
||||
|
||||
@@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the
|
||||
type Mphf = PtrHash<
|
||||
u64, // key type: canonical kmer raw encoding
|
||||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
|
||||
CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
|
||||
Xx64, // hasher: XXH3-64 with seed
|
||||
Vec<u8>, // pilots
|
||||
>;
|
||||
@@ -211,21 +323,41 @@ type Mphf = PtrHash<
|
||||
|
||||
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
||||
|
||||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
|
||||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
|
||||
|
||||
---
|
||||
|
||||
## Query path
|
||||
## Column append and merge support
|
||||
|
||||
These methods extend existing layers with new genome columns without touching the MPHF.
|
||||
|
||||
### Layer-level genome column append
|
||||
|
||||
```rust
|
||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
|
||||
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
|
||||
Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
|
||||
|
||||
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
|
||||
### Presence matrix initialisation
|
||||
|
||||
```rust
|
||||
impl Layer<()> {
|
||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
|
||||
|
||||
### Why the MPHF is never rebuilt
|
||||
|
||||
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
|
||||
|
||||
---
|
||||
|
||||
@@ -235,9 +367,9 @@ When adding dataset B to an existing index:
|
||||
|
||||
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
||||
2. Collect kmers absent from all layers → `B \ index`.
|
||||
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
|
||||
4. Call `Layer<D>::build` on the new directory.
|
||||
5. Update `meta.json`.
|
||||
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
|
||||
4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
|
||||
5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
|
||||
|
||||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||||
|
||||
@@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `ptr_hash 1.1` | MPHF per layer |
|
||||
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
||||
| `epserde 0.8` | zero-copy MPHF serialisation |
|
||||
| `memmap2 0.9` | mmap of evidence and payload files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
| `memmap2 0.9` | mmap of evidence and fingerprint files |
|
||||
| `bitvec` | packed b-bit fingerprint storage |
|
||||
| `obiskio` | unitig file writer/reader + `.idx` build |
|
||||
| `obicompactvec` | payload types + aggregation traits |
|
||||
| `rayon 1` | parallel MPHF construction pass |
|
||||
| `ndarray 0.16` | aggregation output arrays |
|
||||
|
||||
---
|
||||
|
||||
## Column append and merge support
|
||||
|
||||
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
|
||||
|
||||
### Matrix column append
|
||||
|
||||
```rust
|
||||
impl PersistentCompactIntMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl PersistentBitMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
|
||||
|
||||
### Presence matrix initialisation
|
||||
|
||||
```rust
|
||||
impl Layer<()> {
|
||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
|
||||
|
||||
### Layer-level genome column append
|
||||
|
||||
```rust
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> bool,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> u32,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
|
||||
|
||||
### Why the MPHF is never rebuilt
|
||||
|
||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
|
||||
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
|
||||
|
||||
Reference in New Issue
Block a user