feat: enforce canonical k-mer representation throughout the codebase
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
This commit is contained in:
@@ -225,6 +225,41 @@ The De Bruijn graph stores only canonical kmers. The evidence encodes the canoni
|
||||
|
||||
---
|
||||
|
||||
## Non-determinism of the unitig decomposition
|
||||
|
||||
The unitig extraction is **not deterministic**: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.
|
||||
|
||||
### Source of non-determinism
|
||||
|
||||
The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with `ahash::RandomState::new()`). The `start_iter` first pass emits every node whose `can_extend_left` flag is false — which includes not only true dead-end nodes but also **branch points** (nodes with 2 or more left neighbours, for which `unique_neighbor` returns `None`).
|
||||
|
||||
When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.
|
||||
|
||||
**Example** — fork topology (k = 31):
|
||||
|
||||
```
|
||||
A → B ← C
|
||||
↓
|
||||
D
|
||||
```
|
||||
|
||||
All four nodes are in the graph. B has two left neighbours (A and C), so `can_extend_left = false`; B also has one right neighbour D, so `can_extend_right = true`.
|
||||
|
||||
| iteration order | unitigs produced | count |
|
||||
|---|---|---|
|
||||
| A first, then B, C | ABD · C | 2 |
|
||||
| B first, then A, C | BD · A · C | 3 |
|
||||
|
||||
Both tilings cover the same 4 canonical k-mers.
|
||||
|
||||
Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.
|
||||
|
||||
### Consequence for MPHF construction
|
||||
|
||||
The MPHF is built from the **k-mer set**, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Rank field width**: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On *B. nana* (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
|
||||
|
||||
Reference in New Issue
Block a user