feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
This commit is contained in:
@@ -268,3 +268,5 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv
|
||||
## Open questions
|
||||
|
||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
||||
|
||||
- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md).
|
||||
|
||||
Reference in New Issue
Block a user