feat: optimize unitig index and document evidence elimination

Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
This commit is contained in:
Eric Coissac
2026-05-23 07:51:59 +02:00
parent 9b700ff4a4
commit 4a5ab0b8c2
5 changed files with 488 additions and 151 deletions
+2
View File
@@ -268,3 +268,5 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv
## Open questions
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md).