Commit Graph

8 Commits

Author SHA1 Message Date
Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 15:08:29 +02:00
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00
Eric Coissac 4a5ab0b8c2 feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
2026-05-23 12:53:42 +02:00
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00
Eric Coissac defeeb9460 feat: enforce canonical k-mer representation throughout the codebase
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
2026-05-02 16:31:08 +02:00
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00