Commit Graph

139 Commits

Author SHA1 Message Date
Eric Coissac 03c7bb0b99 Relax unitig assertion in debruijn test
Replace the strict `unitigs.len() == 1` assertion with a non-empty check to allow multiple unitigs. Update the test comment to describe the general non-repetitive sequence recovery principle instead of a specific example. The core k-mer roundtrip validation logic remains unchanged.
2026-06-06 06:41:45 +02:00
Eric Coissac b39eee688a refactor(debruijn): unify graph traversal with WalkState iterator
Replaces deeply nested branching with early returns and `then_some`. Introduces a cycle-detecting `find_chain_start` method and updates `UnitigNucIter` to use step-based iteration with atomic node claiming. This eliminates nested iterators and redundant state management, improving code readability and maintainability.
2026-06-06 06:38:28 +02:00
Eric Coissac 95b3461405 refactor: centralize graph traversal logic in walk
Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.
2026-06-06 06:36:48 +02:00
Eric Coissac 952a21eef8 refactor: remove naked_asm and extract collect_unitigs helper
Remove the `std::arch::naked_asm` import as it is no longer required. Introduce a `collect_unitigs` helper to abstract nucleotide sequence extraction from `GraphDeBruijn`, and refactor the test suite to use it, eliminating repetitive collection code and standardizing graph iteration logic.
2026-06-06 04:33:59 +02:00
Eric Coissac 5c2f48535f refactor: rename compute_degrees and mark start nodes
Renames `compute_degrees` to `compute_degrees_and_mark_starts` across the De Bruijn graph and partitioner layers to consolidate degree calculation and start-node flagging. Introduces safe neighbor iteration methods and a debug validation block to verify graph consistency. Refactors unitig extraction to use sequential execution with a `Mutex` for safe error propagation. Fixes malformed and duplicated method calls, adds auto-generation of missing `meta.json` files, and ensures persistent matrix builders are explicitly closed to finalize metadata.
2026-06-05 19:48:59 +02:00
Eric Coissac 27088ab810 refactor: optimize unitig iteration and graph traversal
Switches unitig processing to a lazy, fallible `try_for_each_unitig` API across partitioner layers, reducing intermediate allocations and enabling proper error propagation. Refactors de Bruijn graph traversal into a two-pass algorithm with explicit node flags, named constants, and diagnostic logging. Introduces parallel chain processing and staged performance profiling for the unitig command, and adds a memory-efficient `FromIterator` implementation for packed nucleotide sequences.
2026-06-05 19:48:59 +02:00
Eric Coissac d202ead385 feat: parallelize unitig extraction and FASTA output
Replace the non-atomic `set_visited` with atomic `fetch_or` bitmask operations to enable thread-safe node claiming. Introduce a two-phase extraction pipeline where `par_for_each_chain_unitig` builds chains in parallel and `for_each_remaining_unitig` sequentially handles residual cycles and junctions. Add `is_start` and `collect_from_start` to explicitly define unitig boundaries. Wrap `BufWriter` in a `Mutex` and use an `AtomicUsize` counter to ensure thread-safe concurrent FASTA output, refactoring the write logic into a shared closure for safe multi-threaded execution.
2026-06-05 10:33:52 +02:00
Eric Coissac 249998beed perf: add structured performance profiling for unitig stages
Wraps graph construction, degree computation, and unitig enumeration phases with `Stage` start/stop calls. Intervals are recorded in a `Reporter` instance and printed upon completion to provide granular timing metrics for each computational stage.
2026-06-05 10:28:45 +02:00
Eric Coissac 2f29ee2240 feat: Add parallel execution and thread-safe graph operations
Integrate rayon to enable parallel processing of k-mer partitions and degree computation. Replace Cell with AtomicU8 to ensure thread-safe node state management, and add a merge method for combining disjoint graphs. Additionally, introduce progress tracking utilities and a test-utils feature flag for development dependencies.
2026-06-04 23:22:55 +02:00
Eric Coissac edd5e3f8ee feat: add bits-per-kmer diagnostic and stats module
Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.
2026-06-04 23:17:17 +02:00
Eric Coissac 9306ec1c56 perf: Replace manual window tracking with monotonic deque algorithm
Eliminates intermediate allocations by computing per-genome window minimums (`win_min`) directly. Unifies the `z ≤ 1` and `z > 1` branches into a single buffer-reused accumulation loop, efficiently validating k-mer presence.
2026-06-04 21:37:09 +02:00
Eric Coissac 712a03a3a6 refactor: replace unitig extraction with de Bruijn graph pipeline
This change replaces direct partition-based extraction with a pipeline that reconstructs a de Bruijn graph from filtered k-mers. It introduces `FilterArgs` for k-mer selection, collects filtered k-mers in parallel into a `GraphDeBruijn`, computes node degrees, and enumerates unitigs from the graph for output instead of reading pre-computed partition files.
2026-06-04 21:32:49 +02:00
Eric Coissac 3e62ffe010 feat: add selective k-mer filtering to dump and rebuild commands
Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.
2026-06-04 21:29:58 +02:00
Eric Coissac a1499e6153 feat: add kmer filtering and refactor layer iteration
Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.
2026-06-04 21:08:15 +02:00
Eric Coissac 476c7a6394 feat: add metadata-driven k-mer filtering for rebuild command
Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
2026-06-04 21:01:58 +02:00
Eric Coissac 02cb30c0ef feat: add obisys crate for standardized CLI progress reporting
This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.
2026-06-03 19:03:59 +02:00
Eric Coissac 4677d6f177 refactor: improve resource cleanup and index packing
Explicitly close file handles and remove temporary artifacts after serialization to prevent disk space leaks. Additionally, compact internal matrix structures immediately upon loading the KmerIndex to improve memory efficiency and prepare for downstream operations.
2026-06-03 15:35:56 +02:00
Eric Coissac bba5147f0f fix: account for k-mer overlap in total_bases calculation
Introduces a `kmer_overlap` variable (`k-1`) and modifies the `total_bases` accumulation to subtract this overlap from each sequence's length. This ensures the base count accurately reflects only valid k-mer starting positions rather than raw sequence length.
2026-06-03 15:11:48 +02:00
Eric Coissac bfe0cb4b82 feat: integrate obikseq to configure global k-mer and minimizer sizes
This change adds the `obikseq` crate as a local dependency and inserts `set_k` and `set_m` calls across index creation and command modules. By synchronizing the runtime's global k-mer and minimizer dimensions with the loaded index parameters, downstream sequence processing and partitioning operations now consistently use the correct structural constraints.
2026-06-03 14:31:14 +02:00
Eric Coissac 173ac9fb42 feat: introduce packed matrix storage and layer metadata
Unifies bit and integer matrix storage into `PersistentBitMatrix` and `PersistentCompactIntMatrix` enums, supporting both columnar and memory-mapped single-file layouts. Introduces `LayerMeta` to persist layer dimensions as `layer_meta.json`, enabling correct initialization of implicit presence matrices. Adds CLI commands (`pack` and `--upgrade-index`) to convert existing columnar indices to the compact format and backfill missing metadata. Updates partitionner and layered map logic to use the new persistent builders, optimized memory allocation, and auto-detected storage backends.
2026-06-03 14:16:04 +02:00
Eric Coissac de1a41810a perf: enable zero-allocation queries and memory-mapped indexes
Introduce zero-allocation row extraction and query result buffers across `obicompactvec` and `obikpartitionner` to eliminate per-kmer heap allocations. Replace in-memory MPHF deserialization with memory-mapped, zero-copy views to reduce runtime memory footprint. Add configurable I/O chunking, a RAM-aware `--chunk-size` parameter, and system memory monitoring via the new `sysinfo` dependency. Re-export `PreloadedIndex` for external consumers.
2026-06-03 10:24:12 +02:00
Eric Coissac 1661dd6b1c feat: introduce preloaded index cache and thread-safe progress tracker
Introduce `PreloadedIndex` to cache partition indices and eliminate redundant I/O during repeated queries. Refactor the query pipeline to route through this pre-loaded index, and expose it publicly in `obikpartitionner`. Additionally, add a thread-safe, lazily-initialized `MultiProgress` singleton for improved progress tracking.
2026-06-03 09:42:18 +02:00
Eric Coissac 2ebc5f0d75 chore: add logging infrastructure to merge routine
Adds comprehensive logging for source metadata, merge modes, and forced approximation detection. Introduces `format_evidence` and `is_trivial` helpers to format `IndexMode` variants and identify single-genome presence indices. The core merge algorithm remains unmodified, with all changes focused on enhanced runtime observability.
2026-06-01 15:23:37 +02:00
Eric Coissac add6d7f873 enforce uniform index mode and optimize base index selection
Adds validation to ensure all input sources share the same `IndexMode`. Introduces base index selection logic that prioritizes approximate or hybrid evidence and maximizes base size to minimize newly indexed k-mers. Includes helper functions for triviality evaluation, cumulative size calculation, and mode consistency checks.
2026-06-01 14:43:51 +02:00
Eric Coissac 0350ca855b refactor: streamline merge pipeline and MPHF indexing
Replace mphf.find() with direct mphf.index() calls to eliminate absence checks and fallback vectors. Introduce a lightweight MphfOnly wrapper for faster index loading, and standardize k-mer iteration across merge and rebuild layers. Update IndexMeta configuration and n_new calculation to leverage MPHF cardinality, streamlining the overall merge pipeline.
2026-06-01 14:37:35 +02:00
Eric Coissac 657f964dda refactor: abstract k-mer types and fix bit alignment
Abstracts k-mer storage using a `RawKmer` alias and `KMER_BITS` constant to simplify bit manipulation and enable future extension to larger types. Updates bit-shifting and masking logic across `kmer.rs` and `packed_seq.rs` to prevent overflow and improve type safety. Adapts the MPHF layer to iterate over indexed canonical k-mers with explicit slot bounds validation and bit-level encoding. Fixes test suite compilation errors by correcting method names, adding tuple destructuring, and passing the required `IndexMode::Exact` parameter.
2026-05-31 20:46:49 +02:00
Eric Coissac 8b57c91ab7 feat: add iter_canonical_kmers iterator and tests
Implements `iter_canonical_kmers` in `unitig_index` to yield canonical kmers by traversing unitig chunks. Includes unit tests to verify that the iterator exclusively yields canonical kmers and matches `iter_kmers` in count and canonical equivalence.
2026-05-30 16:30:02 +02:00
Eric Coissac 708b0abf9b refactor: aggregate query results at sequence level
Refactor the query pipeline to buffer partition outputs into a per-sequence `seq_results` vector, deferring final accumulation until all partitions complete. This ensures global position ordering before computing k-mer presence, counts, and coverage statistics. Additionally, removes a resolved TODO and documents a known BLAST false-positive issue where chloroplast and bacterial contaminants yield unrealistic high-confidence matches.
2026-05-30 07:18:54 +02:00
Eric Coissac 86b88acb95 feat: implement approximate k-mer indexing and optimize query
Enable approximate k-mer indexing via the `--approx` flag, computing an effective k-mer size of `k - z + 1` and configuring the appropriate indexing mode with validated probabilistic parameters. Refactor the Findere z-window filter in the query command to improve performance and correctness by replacing the precomputed vector with a lazy closure, optimizing cache locality, and fixing a variable naming bug.
2026-05-29 09:20:44 +02:00
Eric Coissac be0e8f1041 refactor(query): parallelize query execution with obipipeline
Extracts chunk processing into a dedicated function and introduces a QueryData enum with unsafe Send/Sync implementations to safely distribute Rope chunks across worker threads. Replaces nested iteration with a flat iterator and parallel block processing. Adds CLI argument parsing for presence, threshold, and detail flags to configure the pipeline.
2026-05-29 09:18:54 +02:00
Eric Coissac eaa52eaab5 feat: introduce nucstream abstraction and comprehensive test suite
Introduces a unified NucStream abstraction with NucPageCursor for byte-offset tracking and MIME-type dispatch to instantiate format-specific parsers. Exposes nuc_stream and open_nuc_stream APIs that return boxed, Send-compatible iterators. Additionally, adds a comprehensive test suite covering chunk boundary alignment, FASTA/FASTQ record parsing, sequence normalization, and edge cases such as CRLF line endings, @ in quality strings, and multi-slice rope processing.
2026-05-29 09:17:24 +02:00
Eric Coissac cfadf63bbc refactor: migrate pipeline to NucPage-based stream processing
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
2026-05-29 09:10:25 +02:00
Eric Coissac a4b57a96de feat: add streaming sequence reader and superkmer iterator
Introduce the `obiread` crate with a streaming byte normalizer that processes FASTA, FASTQ, and GenBank files using a 64 KiB ring buffer for O(1) memory usage. Integrate this crate into `obiskbuilder` to provide `SuperKmerStreamIter`, enabling memory-efficient superkmer traversal with rolling entropy and minimizer-based cut conditions.
2026-05-29 09:10:25 +02:00
Eric Coissac 0d9be53d1f feat: enforce runtime validation for kmer and minimizer parameters
Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.
2026-05-29 09:10:25 +02:00
Eric Coissac 694da5208e feat: add Findere z-window filtering and detail mode coverage tracking
Introduces the `--findere-z` CLI flag to override the index's sliding window parameter and implements `apply_findere` to filter k-mer hits using a z-consecutive positions window. Adds structural support for `--detail` mode, including per-sequence k-mer offsets, conditional allocation of per-position coverage vectors, and JSON serialization. Updates architecture documentation, CLI references, and annotation schemas to align with the new implementation, resolving prior discrepancies with `--detail` and `--mismatch` flags.
2026-05-26 15:43:17 +02:00
Eric Coissac 26ab165807 refactor: add rolling buffer methods and document label constraints
Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.
2026-05-26 15:40:23 +02:00
Eric Coissac dfa0b2bac2 feat: add utils subcommand for renaming genome labels
Introduces a `utils` CLI subcommand to enable in-place genome label renaming without full reindexing. Adds strict label validation to reject empty strings, filesystem separators, and control characters, ensuring safe CSV serialization. Updates index metadata, renames corresponding spectrum JSON files, and registers the command in the main dispatch logic. CLI reference documentation is also updated.
2026-05-26 15:35:22 +02:00
Eric Coissac 9e60a711bc Enforce minimum input paths and handle stdin sentinel
Update CLI validation to require at least 10 input paths, defaulting to stdin (`-`) when the argument list is empty. Refactor the path iterator to explicitly recognize the stdin sentinel, bypassing extension validation and directory expansion to ensure direct passthrough to the file buffer without triggering `stat()` or recursive traversal.
2026-05-26 15:22:38 +02:00
Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 15:08:29 +02:00
Eric Coissac 7501b6e854 refactor: switch indexing to IndexMode and update metadata
Replace EvidenceKind with IndexMode (Exact, Approx, Hybrid) across layer construction and query dispatch. Update PartitionMeta and LayerMeta serialization to centralize index-wide configuration. Add flexible push_layer overloads to LayeredMap for dynamic index expansion without full rebuilds. Improve UnitigFileReader to gracefully fallback to sequential scanning when indexes are missing, eliminating panics.
2026-05-26 14:57:17 +02:00
Eric Coissac 1d880fdc5f refactor: optimize MPHF construction and update legacy guidelines
Replaces parallel random-access unitig iteration with a sequential mmap-based iterator for MPHF construction, eliminating the build-time `.idx` dependency by deferring index generation until after persistence. Updates `CLAUDE.md` to treat existing code as a hypothesis, mandating proactive removal of obsolete legacy constructs rather than preserving them out of inertia.
2026-05-26 10:54:59 +02:00
Eric Coissac 009a328c58 refactor: handle kmer deduplication and layer initialization concurrently
Introduce a secondary layer iteration to open `SrcLayerData` alongside the unitig reader for concurrent metadata access. This refactors the merge routine to handle kmer deduplication and per-layer data initialization simultaneously. Also corrects a typo in `rebuild_layer.rs` where `openopen_sequential` is renamed to `open_sequential`.
2026-05-26 10:52:08 +02:00
Eric Coissac 9d46400898 feat: support exact and approximate evidence in layer construction
Refactored `MphfLayer::build` to accept an `EvidenceKind` parameter, routing to exact (index-based, parallel MPHF, writes `evidence.bin`) or approximate (sequential mmap iterator, writes `fingerprint.bin`) pipelines. Introduced `CanonicalKmerIter` for memory-mapped, chunked k-mer iteration with O(1) resets via `Arc<Mmap>`. Updated layer and map APIs to forward evidence kind, added `push_layer` for count matrices, and adjusted tests and public exports accordingly.
2026-05-26 10:23:43 +02:00
Eric Coissac b2a52bfb37 perf: optimize chunk_start for single-block indexing
Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes.
2026-05-23 13:34:05 +02:00
Eric Coissac bc51cd9861 feat: add configurable block sizes and in-place reindex command
Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.
2026-05-23 13:28:24 +02:00
Eric Coissac 876bc0127f feat: add approximate evidence matching and index estimation CLI
Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency.
2026-05-23 13:16:49 +02:00
Eric Coissac 16a6b0d033 feat: add evidence metadata and configurable k-mer parameters
Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns.
2026-05-23 13:10:18 +02:00
Eric Coissac e1dab86daf feat: add approximate kmer fingerprinting with memory-mapped storage
Introduce a new `fingerprint` module that stores packed B-bit vectors via memory-mapped files. Expose the module publicly and add `build_approx_evidence` to `Layer` and `MphfLayer` for generating compact `fingerprint.bin` files. Implement `find_approx` for fast, probabilistic kmer lookups using configurable bit-widths. Update dependencies to `bitvec` v1 and add `cacheline-ef`, `epserde`, and `memmap2` to support the new storage and serialization logic.
2026-05-23 13:07:02 +02:00
Eric Coissac 24afd74e2f refactor: decouple unitig index generation and add exact evidence
Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers.
2026-05-23 13:02:25 +02:00
Eric Coissac 8478072b78 feat: make index granularity configurable via block_bits
Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes.
2026-05-23 12:57:56 +02:00