docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
This commit is contained in:
@@ -2,243 +2,139 @@
|
||||
|
||||
## Role of unitigs in the index
|
||||
|
||||
The MPHF maps each canonical kmer to an integer slot, but provides no way to reconstruct the kmer from its slot. A downstream operation (query, set operation) that receives a slot index and needs the kmer sequence must be able to retrieve it. The **evidence file** serves this purpose: it stores the kmer sequences in compact form and provides, for each MPHF slot, a pointer to where the corresponding kmer can be decoded.
|
||||
The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
|
||||
|
||||
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition can be reconstructed from its unitig FASTA file.
|
||||
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
|
||||
|
||||
---
|
||||
|
||||
## Two encoding strategies
|
||||
## Binary file formats
|
||||
|
||||
### Strategy A — global nucleotide offset
|
||||
### `unitigs.bin` — sequence chunks
|
||||
|
||||
Each MPHF slot stores a single integer: the byte offset of the kmer's first nucleotide within a packed 2-bit nucleotide array that concatenates all unitigs.
|
||||
A sequence of binary records. Each record:
|
||||
|
||||
```
|
||||
evidence[slot] = global_offset (bits: ⌈log₂ N_nuc⌉)
|
||||
[u8: seql − k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
|
||||
```
|
||||
|
||||
where `N_nuc` is the total number of nucleotides across all unitigs in the partition.
|
||||
- `seql − k` (0–255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
|
||||
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
|
||||
- Byte count for packed sequence: `ceil(seql / 4)`.
|
||||
|
||||
Decoding: read k nucleotides starting at `global_offset`.
|
||||
|
||||
### Strategy B — (unitig_id, rank within unitig)
|
||||
|
||||
Each MPHF slot stores a pair:
|
||||
Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql − k + 1 ≤ 256`); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:
|
||||
|
||||
```
|
||||
evidence[slot] = (unitig_id, rank)
|
||||
chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k − 2] (256 kmers)
|
||||
chunk 2: nucleotides [256, end] (remaining kmers)
|
||||
overlap: k−1 nucleotides shared between the two chunks
|
||||
```
|
||||
|
||||
- `unitig_id` : index of the unitig in the partition (0-based)
|
||||
- `rank` : kmer index within the unitig (0 ≤ rank < n_kmers); kmer i starts at nucleotide i, so the nucleotide offset is identical numerically but the kmer-unit interpretation is the natural one
|
||||
### `unitigs.bin.idx` — block-sampled offset index
|
||||
|
||||
Decoding: look up the unitig at `unitig_id`, then read k nucleotides starting at `rank`.
|
||||
```
|
||||
magic : 4 bytes = "UIX3"
|
||||
block_bits: u32 LE — granularity parameter (0–31)
|
||||
n_unitigs : u32 LE — total number of chunks in unitigs.bin
|
||||
n_kmers : u64 LE — total number of kmers across all chunks
|
||||
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
|
||||
```
|
||||
|
||||
One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).
|
||||
|
||||
### `evidence.bin` — per-slot MPHF evidence
|
||||
|
||||
A flat array of u32 values, one per MPHF slot, no header:
|
||||
|
||||
```
|
||||
bits [31:7] = chunk_id (25 bits)
|
||||
bits [6:0] = rank (7 bits, 0–127)
|
||||
```
|
||||
|
||||
File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
||||
|
||||
---
|
||||
|
||||
## Bit-cost analysis
|
||||
## Building and reading the index
|
||||
|
||||
Define for a partition of P kmers with average kmers-per-unitig m:
|
||||
### `build_unitig_idx(path, block_bits)`
|
||||
|
||||
- total nucleotides: $N_{nuc} = P \cdot \left(1 + \dfrac{k-1}{m}\right)$
|
||||
- number of unitigs: $U = P / m$
|
||||
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
|
||||
|
||||
**Strategy A**
|
||||
### `open()` vs `open_sequential()`
|
||||
|
||||
$$
|
||||
b_A = \left\lceil \log_2 N_{nuc} \right\rceil = \left\lceil \log_2 P + \log_2\!\left(1 + \frac{k-1}{m}\right) \right\rceil
|
||||
$$
|
||||
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
|
||||
|
||||
**Strategy B**
|
||||
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
|
||||
|
||||
$$
|
||||
b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rceil
|
||||
$$
|
||||
### `chunk_start(i)` — random access
|
||||
|
||||
where $L_{max}$ is the maximum unitig length (in nucleotides). In practice $L_{max} \ll P$, so the rank field is much cheaper than the full global offset. If unitig lengths are bounded (e.g. by partition structure), the rank field width is a small constant independent of P.
|
||||
|
||||
### Empirical bound on unitig length
|
||||
|
||||
Lengths and ranks are expressed in **kmer units** (not nucleotides): the nucleotide length is `n_kmers + k − 1`, so storing `n_kmers` instead of `seq_length` saves k−1 = 30 units of headroom in the same field width.
|
||||
|
||||
Consequence for `u8` capacity:
|
||||
|
||||
| unit | max representable | max nucleotides |
|
||||
|---|---|---|
|
||||
| nucleotides | 255 nuc | 225 kmers |
|
||||
| **kmers** | **255 kmers** | **285 nuc** |
|
||||
|
||||
**Structural maximum from superkmer construction.** For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = **21 kmers** (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.
|
||||
|
||||
On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).
|
||||
|
||||
### Split strategy for long unitigs
|
||||
|
||||
For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a **k−1 nucleotide overlap** at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.
|
||||
|
||||
```
|
||||
original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
|
||||
↑ cut here
|
||||
|
||||
chunk 1: nucleotides 0 … 284 (255 kmers)
|
||||
chunk 2: nucleotides 255 … N+k-1 (N-255+1 kmers)
|
||||
shared: nucleotides 255 … 284 (k-1 = 30 nucleotides, stored in both)
|
||||
```rust
|
||||
fn chunk_start(&self, i: usize) -> usize {
|
||||
// block_bits=0: single table lookup, O(1) — hot path
|
||||
if self.block_bits == 0 {
|
||||
return self.block_offsets[i] as usize;
|
||||
}
|
||||
// block_bits>0: lookup block, then scan at most 2^block_bits − 1 records
|
||||
let block = i >> self.block_bits;
|
||||
let rem = i & self.mask;
|
||||
let mut offset = self.block_offsets[block] as usize;
|
||||
for _ in 0..rem {
|
||||
let seql_minus_k = self.mmap[offset] as usize;
|
||||
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||
}
|
||||
offset
|
||||
}
|
||||
```
|
||||
|
||||
Cost of one split: k−1 = 30 redundant nucleotides = 60 bits. This event is rare in practice (m_u ≈ 38 for *B. nana*, well below the 255-kmer cap). No kmer is lost: kmer i is in chunk 1 if i < 255, in chunk 2 (at rank i−255) otherwise.
|
||||
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
|
||||
|
||||
### Savings from u8 length fields
|
||||
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
|
||||
|
||||
Because all chunks are guaranteed ≤ 255 kmers, the per-chunk length array in the binary index is a flat `u8` array — 1 byte per chunk instead of 8 bytes (usize) or 4 bytes (u32). For a partition with 4 M unitigs:
|
||||
### Decoding a kmer from slot `s`
|
||||
|
||||
| length type | bytes/chunk | total (4 M chunks) |
|
||||
|---|---|---|
|
||||
| usize (u64) | 8 | 32 MB |
|
||||
| u32 | 4 | 16 MB |
|
||||
| **u8** | **1** | **4 MB** |
|
||||
```rust
|
||||
let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
|
||||
let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
|
||||
```
|
||||
|
||||
Random access to chunk i is recovered at load time by a single prefix-sum pass over the u8 array, computing a u32/u64 offset array in O(n_chunks) time and O(n_chunks × 4) bytes — paid once at open time, cached for the lifetime of the partition handle.
|
||||
|
||||
Bit costs for *Betula nana* (k=31, 256 partitions, P ≈ 10.4 M, U ≈ 275 k, m_u ≈ 37.9):
|
||||
|
||||
| field | strategy A | strategy B |
|
||||
|---|---|---|
|
||||
| offset / id | $\lceil\log_2(P \cdot (1 + 30/m_u))\rceil = 25$ bits | $\lceil\log_2(U)\rceil = 19$ bits |
|
||||
| rank | — | 8 bits (u8, fixed) |
|
||||
| **total** | **25 bits** | **27 bits** |
|
||||
|
||||
Strategy A is 2 bits cheaper. Strategy B's main advantage is **locality**: decoding a kmer touches one unitig's cache lines rather than an arbitrary offset in a large flat array, and the `rank` field doubles as a direct index into the packed nucleotide sequence without pointer arithmetic.
|
||||
Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
|
||||
|
||||
---
|
||||
|
||||
## Partition-size tradeoff
|
||||
## Field widths and capacity
|
||||
|
||||
The total bits/kmer for the index (sequence + evidence + MPHF) as a function of partition size is:
|
||||
| field | bits | range | capacity check (*B. nana*, 256 partitions) |
|
||||
|------------|------|---------------|---------------------------------------------|
|
||||
| `seql − k` | 8 | 0–255 | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
|
||||
| `rank` | 7 | 0–127 | observed max ~46 kmers/chunk; structural max k−m+1 = 21 |
|
||||
| `chunk_id` | 25 | 0–33 554 431 | avg U ≈ 275 k chunks/partition |
|
||||
|
||||
$$
|
||||
\text{total} = \underbrace{2\!\left(1 + \frac{k-1}{m}\right)}_{\text{sequence}} + \underbrace{\log_2 P + \log_2\!\left(1+\frac{k-1}{m}\right)}_{\text{evidence}} + \underbrace{c_{MPHF}}_{\approx 2\text{–}4}
|
||||
$$
|
||||
|
||||
### Empirical observation: m_u is set by De Bruijn graph topology, not partition count
|
||||
|
||||
Measured on *Betula nana* (k=31, m=11), summing n_kmers and sequence counts across all partition files:
|
||||
|
||||
| N partitions | m_sk | m_u | factor m_u/m_sk | nuc ratio (u/sk) |
|
||||
|---|---|---|---|---|
|
||||
| 1 | 12.13 | **41.89** | 3.45× | 0.273 |
|
||||
| 16 | 12.13 | **38.19** | 3.15× | 0.376 |
|
||||
| 256 | 12.13 | **37.90** | 3.12× | 0.388 |
|
||||
| 1 024 | 12.13 | **37.89** | 3.12× | 0.389 |
|
||||
|
||||
- `m_sk` = avg kmers/super-kmer (invariant — same dataset regardless of partition scheme)
|
||||
- `m_u` = avg kmers/unitig = total_n_kmers / total_unitigs, summed across all partitions
|
||||
- `nuc ratio` = (u_symbols + 30·u_reads) / (sk_symbols + 30·sk_reads)
|
||||
|
||||
X-axis in both charts: partition bits (0 = 1 partition, 10 = 1024 partitions) — each step doubles the partition count.
|
||||
|
||||
```mermaid
|
||||
xychart-beta
|
||||
title "m_u (avg kmers/unitig) vs partition bits — B. nana k=31"
|
||||
x-axis "partition bits" 0 --> 10
|
||||
y-axis "m_u" 37 --> 43
|
||||
line [41.89, 40.78, 39.22, 38.52, 38.19, 38.03, 37.96, 37.92, 37.90, 37.89, 37.89]
|
||||
```
|
||||
|
||||
```mermaid
|
||||
xychart-beta
|
||||
title "Nucleotide storage: unitigs / super-kmers (%) vs partition bits — B. nana k=31"
|
||||
x-axis "partition bits" 0 --> 10
|
||||
y-axis "%" 25 --> 42
|
||||
line [27.3, 29.7, 33.9, 36.3, 37.6, 38.3, 38.6, 38.7, 38.8, 38.9, 38.9]
|
||||
```
|
||||
|
||||
Key observations:
|
||||
|
||||
1. **Partition boundaries have a small but non-zero effect on m_u.** Going from 1 to 1024 partitions reduces m_u by 10% (41.9 → 37.9). Within the practical range 16–1024, the variation is under 1% — m_u is effectively constant.
|
||||
2. **m_u is a property of the De Bruijn graph, not the partition scheme.** The dominant factor is graph branching (heterozygosity, repeats, sequencing errors).
|
||||
3. **Unitigs provide substantial compaction over super-kmers.** At 256 partitions, unitigs cover the same unique kmers using 39% of the raw nucleotide content of super-kmers (3.1× compaction factor).
|
||||
|
||||
#### Per-partition compaction ratio (sk_symbols / u_symbols)
|
||||
|
||||
The ratio measures how much super-kmer kmer-slots are "shared" across different super-kmer records: a ratio of 1.35 means each unique kmer (counted once in unitigs) appears in 1.35 super-kmer kmer-slots on average.
|
||||
|
||||
| bits | N partitions | median ratio | min ratio | min partition | min u_reads |
|
||||
|---|---|---|---|---|---|
|
||||
| 6 | 64 | 1.355 | 1.073 | — | 4.5 M |
|
||||
| 7 | 128 | 1.352 | 1.037 | — | 4.1 M |
|
||||
| 8 | 256 | **1.350** | **1.012** | **145** | **3.8 M** |
|
||||
| 9 | 512 | 1.350 | 0.998 | 145 | 3.6 M |
|
||||
| 10 | 1024 | 1.351 | 0.992 | 145 | 3.6 M |
|
||||
|
||||
The median stabilises at **1.35** from 64 partitions onward (stdev = 0.027 at 256 partitions). There is one persistent outlier: **partition 145** (at 256-partition resolution) is consistently anomalous across all partition depths — it contains 10–14× more super-kmers and unitigs than the average partition, with a ratio near 1.0, meaning the unitig representation provides almost no kmer deduplication. This is consistent with a highly repetitive or organellar region where the dominant minimiser belongs to a sequence that appears in many reads without forming long overlapping paths in the De Bruijn graph.
|
||||
|
||||
Per-partition parameters at 256 partitions (*B. nana*):
|
||||
|
||||
| quantity | value |
|
||||
|---|---|
|
||||
| P (unique kmers/partition, avg) | ≈ 10.4 M |
|
||||
| U (unitigs/partition, avg) | ≈ 275 k |
|
||||
| m_u | ≈ 37.9 |
|
||||
| Strategy A bits/kmer | ⌈log₂(P·(1+30/m_u))⌉ = 25 |
|
||||
| Strategy B bits/kmer | ⌈log₂(U)⌉ + 8 = 27 |
|
||||
|
||||
Consequence: **the partition count should be as large as memory and parallelism allow.** Each doubling saves 1 bit/kmer in evidence (log₂ P decreases by 1). The sequence term 2·(1 + 30/m_u) ≈ 3.6 bits/kmer is approximately constant.
|
||||
|
||||
Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)` grows more slowly than `log₂(P)` by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.
|
||||
The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
|
||||
|
||||
---
|
||||
|
||||
## Implementation notes
|
||||
## Evidence bit-cost
|
||||
|
||||
### Evidence file layout (strategy B — implemented)
|
||||
Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
|
||||
|
||||
`evidence.bin` is a flat `[u32; n]` array with no header:
|
||||
| field | theoretical cost | value |
|
||||
|------------|-------------------------|---------|
|
||||
| chunk_id | ⌈log₂ U⌉ | 19 bits |
|
||||
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
|
||||
| **stored** | aligned u32 | **32 bits/slot** |
|
||||
|
||||
```
|
||||
evidence.bin: n × 4 bytes, little-endian
|
||||
each u32: bits [31:7] = chunk_id (25 bits)
|
||||
bits [6:0] = rank (7 bits)
|
||||
```
|
||||
The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
|
||||
|
||||
File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
||||
|
||||
The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
|
||||
|
||||
### Unitig file layout
|
||||
|
||||
Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
|
||||
### Decoding a kmer from slot s
|
||||
|
||||
```
|
||||
(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
|
||||
kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
|
||||
```
|
||||
|
||||
Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
|
||||
|
||||
### Field widths in practice
|
||||
|
||||
Rank is stored in 7 bits (0–127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
|
||||
|
||||
chunk_id is stored in 25 bits (0–33 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
|
||||
|
||||
### Forward vs reverse complement
|
||||
|
||||
The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.
|
||||
Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
|
||||
|
||||
---
|
||||
|
||||
## Non-determinism of the unitig decomposition
|
||||
## Unitig decomposition non-determinism
|
||||
|
||||
The unitig extraction is **not deterministic**: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.
|
||||
The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
|
||||
|
||||
### Source of non-determinism
|
||||
|
||||
The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with `ahash::RandomState::new()`). The `start_iter` first pass emits every node whose `can_extend_left` flag is false — which includes not only true dead-end nodes but also **branch points** (nodes with 2 or more left neighbours, for which `unique_neighbor` returns `None`).
|
||||
|
||||
When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.
|
||||
The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
|
||||
|
||||
**Example** — fork topology (k = 31):
|
||||
|
||||
@@ -248,25 +144,37 @@ A → B ← C
|
||||
D
|
||||
```
|
||||
|
||||
All four nodes are in the graph. B has two left neighbours (A and C), so `can_extend_left = false`; B also has one right neighbour D, so `can_extend_right = true`.
|
||||
B has two left neighbours, so `can_extend_left = false`. Two valid tilings:
|
||||
|
||||
| iteration order | unitigs produced | count |
|
||||
| iteration order | unitigs | count |
|
||||
|---|---|---|
|
||||
| A first, then B, C | ABD · C | 2 |
|
||||
| B first, then A, C | BD · A · C | 3 |
|
||||
| A first | ABD, C | 2 |
|
||||
| B first | BD, A, C | 3 |
|
||||
|
||||
Both tilings cover the same 4 canonical k-mers.
|
||||
Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
|
||||
|
||||
Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.
|
||||
This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
|
||||
|
||||
### Consequence for MPHF construction
|
||||
---
|
||||
|
||||
The MPHF is built from the **k-mer set**, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.
|
||||
## Partition-size tradeoff
|
||||
|
||||
Measured on *B. nana* (k=31, m=11), summing across all partitions:
|
||||
|
||||
| N partitions | m_u |
|
||||
|---|---|
|
||||
| 1 | 41.89 |
|
||||
| 16 | 38.19 |
|
||||
| 256 | 37.90 |
|
||||
| 1 024 | 37.89 |
|
||||
|
||||
`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
|
||||
|
||||
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
||||
|
||||
- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md).
|
||||
- **Cross-partition set operations**: strategy B allows unitig-level operations (mark entire chunks present/absent) rather than kmer-level, reducing cost by a factor of m_u.
|
||||
- **Eliminating evidence.bin**: at ~66% of per-layer lookup footprint, `evidence.bin` dominates index size. See [Evidence elimination design discussion](evidence_elimination.md).
|
||||
|
||||
Reference in New Issue
Block a user