feat: implement persistent layered index and chunked binary format

Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:20:08 +08:00
parent 8c17bf958b
commit 5169f65dc9
24 changed files with 1342 additions and 382 deletions
@@ -69,7 +69,9 @@ Consequence for `u8` capacity:
 | nucleotides | 255 nuc | 225 kmers |
 | **kmers** | **255 kmers** | **285 nuc** |

-On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average; no unitig length distribution data measured yet. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below.
+**Structural maximum from superkmer construction.** For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = **21 kmers** (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.
+
+On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).

 ### Split strategy for long unitigs