Commit Graph

11 Commits

Author SHA1 Message Date
Eric Coissac 7501b6e854 refactor: switch indexing to IndexMode and update metadata
Replace EvidenceKind with IndexMode (Exact, Approx, Hybrid) across layer construction and query dispatch. Update PartitionMeta and LayerMeta serialization to centralize index-wide configuration. Add flexible push_layer overloads to LayeredMap for dynamic index expansion without full rebuilds. Improve UnitigFileReader to gracefully fallback to sequential scanning when indexes are missing, eliminating panics.
2026-05-26 14:57:17 +02:00
Eric Coissac 9d46400898 feat: support exact and approximate evidence in layer construction
Refactored `MphfLayer::build` to accept an `EvidenceKind` parameter, routing to exact (index-based, parallel MPHF, writes `evidence.bin`) or approximate (sequential mmap iterator, writes `fingerprint.bin`) pipelines. Introduced `CanonicalKmerIter` for memory-mapped, chunked k-mer iteration with O(1) resets via `Arc<Mmap>`. Updated layer and map APIs to forward evidence kind, added `push_layer` for count matrices, and adjusted tests and public exports accordingly.
2026-05-26 10:23:43 +02:00
Eric Coissac b2a52bfb37 perf: optimize chunk_start for single-block indexing
Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes.
2026-05-23 13:34:05 +02:00
Eric Coissac bc51cd9861 feat: add configurable block sizes and in-place reindex command
Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.
2026-05-23 13:28:24 +02:00
Eric Coissac 24afd74e2f refactor: decouple unitig index generation and add exact evidence
Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers.
2026-05-23 13:02:25 +02:00
Eric Coissac 8478072b78 feat: make index granularity configurable via block_bits
Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes.
2026-05-23 12:57:56 +02:00
Eric Coissac 4a5ab0b8c2 feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
2026-05-23 12:53:42 +02:00
Eric Coissac ff75c9198d feat: add kmer iterators and optimize layered map performance
Replace `ph` with `ptr_hash` and introduce `epserde` and `rayon` dependencies. Refactor MPHF construction to leverage parallel iteration, eliminating intermediate `Vec<u64>` allocations and reducing memory footprint. Add a `n_kmers` field to track and serialize total kmer counts, alongside three zero-allocation iterators for efficient chunk traversal. Include comprehensive unit tests for the new iterators and update CLAUDE.md to enforce explicit dependency validation policies.
2026-05-12 22:35:21 +08:00
Eric Coissac 92cda13ae4 refactor: streamline I/O handling and remove deprecated codec module
Removes the `codec` and `limits` modules, eliminating manual serialization logic and OS file descriptor limit queries. Introduces a shared `append_path_suffix` utility to standardize path manipulation across `meta.rs` and `unitig_index.rs`. Refactors the file pool to dynamically size based on available file descriptors and optimizes descriptor lifecycle to prevent leaks. Enhances `SKFileReader` with LRU eviction, consumption tracking, and seek-on-reopen support. Migrates related round-trip and EOF tests to the reader module and updates chunking logic in the unitig index.
2026-05-09 19:29:06 +08:00
Eric Coissac e008a8beb4 harden obiskio error handling with explicit variants and bounds checking
Extend `SKError` with `BadMagic`, `Truncated`, and `InvalidData` variants to replace `expect()` calls and unsafe slice indexing. Implement proper error chaining via `source()` and update `Display` formatting. Improve magic byte validation and serde error mapping for clearer debugging. Documents a broader roadmap for further crate hardening, including LRU eviction, concurrency, and benchmarking.
2026-05-09 18:30:19 +08:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00