Commit Graph

175 Commits

Author SHA1 Message Date
coissac b0dab452f6 Merge pull request 'refactor: optimize dump partition iteration and add progress tracking' (#20) from push-xqswlxlvmyrq into main
Reviewed-on: #20
2026-06-09 09:34:13 +00:00
Eric Coissac db730e9cf6 refactor: optimize dump partition iteration and add progress tracking
Refactor partition iteration to support a generic `on_partition` callback executed after each parallel partition completes. Split the logic into bounded and unbounded paths; the bounded path uses an `AtomicUsize` to enforce row limits, while the unbounded path eliminates atomic contention to improve throughput. Additionally, integrate a progress bar into the dump command by passing an increment callback to `idx.dump()`, ensuring proper initialization and cleanup.
2026-06-09 11:07:48 +02:00
coissac f65ecd19cc Merge pull request 'Push lrwmyplxxzkn' (#19) from push-lrwmyplxxzkn into main
Reviewed-on: #19
2026-06-09 08:28:20 +00:00
Eric Coissac 7dd8db1409 docs: document conservative rounding strategy for filtering thresholds
Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.
2026-06-09 10:26:21 +02:00
Eric Coissac ce45e2fbe1 refactor: centralize k-mer filtering logic and add validation
Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
2026-06-09 10:22:25 +02:00
Eric Coissac 2465cfbc4b Parallelize partition iteration using Rayon
Introduce thread-local `Vec<u8>` buffers to eliminate concurrent I/O contention. Replace the mutable row counter with an `AtomicUsize` and `fetch_update` to enable lock-free early termination when the limit is reached. Collected chunks are then written sequentially to preserve partition ordering.
2026-06-09 10:04:25 +02:00
Eric Coissac d626d42ec7 feat: add --head and --presence-threshold to dump and distance
Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.
2026-06-09 10:04:25 +02:00
coissac 650eea43b6 Merge pull request 'Push quqlpklvxsqx' (#18) from push-quqlpklvxsqx into main
Reviewed-on: #18
2026-06-08 18:15:01 +00:00
Eric Coissac eb7805c747 feat: add configurable presence threshold to kmer distance
Introduce a `--presence-threshold` CLI argument (default: 1) and update `KmerIndex::distance` to accept a `presence_threshold` parameter. This replaces hardcoded zero thresholds, enabling configurable filtering of low-abundance kmers during Jaccard distance calculations.
2026-06-08 20:14:33 +02:00
Eric Coissac 1ec65922df feat: implement parallel pairwise distance matrices
Introduces parallelized pairwise distance matrix computation for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics across `Columnar`, `Packed`, and `Implicit` matrix variants. Adds trait methods and convenience wrappers, safely handles normalization and zero-denominator edge cases, and updates test suites to import required traits for validation.
2026-06-08 20:08:09 +02:00
Eric Coissac 09d9e21744 feat: integrate tracing and enhance bit matrix operations
Add the `tracing` crate to `obidebruinj`, `obisys`, and resolve it in `Cargo.lock`. Replace `eprintln!` statements with structured `debug!` and `info!` macros. Introduce a `TracedBar` wrapper for progress bars and enhance the `Stage` lifecycle to emit structured events for timing, memory metrics, and swap warnings. Add a progress spinner for unitig degree computation. Extend `PersistentBitMatrix` with columnar bit-vector operations and parallel distance methods, enabling uniform distance computations across all storage layouts while replacing previous panics with dimension-based fallbacks.
2026-06-08 19:55:06 +02:00
coissac 3f47e22083 Merge pull request 'Push pvqkqxlkkwry' (#17) from push-pvqkqxlkkwry into main
Reviewed-on: #17
2026-06-06 04:44:10 +00:00
Eric Coissac 03c7bb0b99 Relax unitig assertion in debruijn test
Replace the strict `unitigs.len() == 1` assertion with a non-empty check to allow multiple unitigs. Update the test comment to describe the general non-repetitive sequence recovery principle instead of a specific example. The core k-mer roundtrip validation logic remains unchanged.
2026-06-06 06:41:45 +02:00
Eric Coissac b39eee688a refactor(debruijn): unify graph traversal with WalkState iterator
Replaces deeply nested branching with early returns and `then_some`. Introduces a cycle-detecting `find_chain_start` method and updates `UnitigNucIter` to use step-based iteration with atomic node claiming. This eliminates nested iterators and redundant state management, improving code readability and maintainability.
2026-06-06 06:38:28 +02:00
Eric Coissac 95b3461405 refactor: centralize graph traversal logic in walk
Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.
2026-06-06 06:36:48 +02:00
Eric Coissac 952a21eef8 refactor: remove naked_asm and extract collect_unitigs helper
Remove the `std::arch::naked_asm` import as it is no longer required. Introduce a `collect_unitigs` helper to abstract nucleotide sequence extraction from `GraphDeBruijn`, and refactor the test suite to use it, eliminating repetitive collection code and standardizing graph iteration logic.
2026-06-06 04:33:59 +02:00
Eric Coissac 5c2f48535f refactor: rename compute_degrees and mark start nodes
Renames `compute_degrees` to `compute_degrees_and_mark_starts` across the De Bruijn graph and partitioner layers to consolidate degree calculation and start-node flagging. Introduces safe neighbor iteration methods and a debug validation block to verify graph consistency. Refactors unitig extraction to use sequential execution with a `Mutex` for safe error propagation. Fixes malformed and duplicated method calls, adds auto-generation of missing `meta.json` files, and ensures persistent matrix builders are explicitly closed to finalize metadata.
2026-06-05 19:48:59 +02:00
Eric Coissac 27088ab810 refactor: optimize unitig iteration and graph traversal
Switches unitig processing to a lazy, fallible `try_for_each_unitig` API across partitioner layers, reducing intermediate allocations and enabling proper error propagation. Refactors de Bruijn graph traversal into a two-pass algorithm with explicit node flags, named constants, and diagnostic logging. Introduces parallel chain processing and staged performance profiling for the unitig command, and adds a memory-efficient `FromIterator` implementation for packed nucleotide sequences.
2026-06-05 19:48:59 +02:00
coissac ea2c594c86 Merge pull request 'Push ruqusmkoyvwm' (#16) from push-ruqusmkoyvwm into main
Reviewed-on: #16
2026-06-05 08:41:08 +00:00
Eric Coissac d202ead385 feat: parallelize unitig extraction and FASTA output
Replace the non-atomic `set_visited` with atomic `fetch_or` bitmask operations to enable thread-safe node claiming. Introduce a two-phase extraction pipeline where `par_for_each_chain_unitig` builds chains in parallel and `for_each_remaining_unitig` sequentially handles residual cycles and junctions. Add `is_start` and `collect_from_start` to explicitly define unitig boundaries. Wrap `BufWriter` in a `Mutex` and use an `AtomicUsize` counter to ensure thread-safe concurrent FASTA output, refactoring the write logic into a shared closure for safe multi-threaded execution.
2026-06-05 10:33:52 +02:00
Eric Coissac 249998beed perf: add structured performance profiling for unitig stages
Wraps graph construction, degree computation, and unitig enumeration phases with `Stage` start/stop calls. Intervals are recorded in a `Reporter` instance and printed upon completion to provide granular timing metrics for each computational stage.
2026-06-05 10:28:45 +02:00
Eric Coissac 2f29ee2240 feat: Add parallel execution and thread-safe graph operations
Integrate rayon to enable parallel processing of k-mer partitions and degree computation. Replace Cell with AtomicU8 to ensure thread-safe node state management, and add a merge method for combining disjoint graphs. Additionally, introduce progress tracking utilities and a test-utils feature flag for development dependencies.
2026-06-04 23:22:55 +02:00
Eric Coissac edd5e3f8ee feat: add bits-per-kmer diagnostic and stats module
Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.
2026-06-04 23:17:17 +02:00
Eric Coissac bb7adc1154 docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 22:59:41 +02:00
Eric Coissac 9306ec1c56 perf: Replace manual window tracking with monotonic deque algorithm
Eliminates intermediate allocations by computing per-genome window minimums (`win_min`) directly. Unifies the `z ≤ 1` and `z > 1` branches into a single buffer-reused accumulation loop, efficiently validating k-mer presence.
2026-06-04 21:37:09 +02:00
Eric Coissac 712a03a3a6 refactor: replace unitig extraction with de Bruijn graph pipeline
This change replaces direct partition-based extraction with a pipeline that reconstructs a de Bruijn graph from filtered k-mers. It introduces `FilterArgs` for k-mer selection, collects filtered k-mers in parallel into a `GraphDeBruijn`, computes node degrees, and enumerates unitigs from the graph for output instead of reading pre-computed partition files.
2026-06-04 21:32:49 +02:00
Eric Coissac 3e62ffe010 feat: add selective k-mer filtering to dump and rebuild commands
Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.
2026-06-04 21:29:58 +02:00
Eric Coissac a1499e6153 feat: add kmer filtering and refactor layer iteration
Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.
2026-06-04 21:08:15 +02:00
Eric Coissac 476c7a6394 feat: add metadata-driven k-mer filtering for rebuild command
Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
2026-06-04 21:01:58 +02:00
coissac edc18b4908 Merge pull request 'Push rrwpnquuzsvr' (#15) from push-rrwpnquuzsvr into main
Reviewed-on: #15
2026-06-03 17:04:25 +00:00
Eric Coissac 02cb30c0ef feat: add obisys crate for standardized CLI progress reporting
This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.
2026-06-03 19:03:59 +02:00
Eric Coissac 4677d6f177 refactor: improve resource cleanup and index packing
Explicitly close file handles and remove temporary artifacts after serialization to prevent disk space leaks. Additionally, compact internal matrix structures immediately upon loading the KmerIndex to improve memory efficiency and prepare for downstream operations.
2026-06-03 15:35:56 +02:00
coissac 7a29ca6305 Merge pull request 'Push ywwwypqxrtmy' (#14) from push-ywwwypqxrtmy into main
Reviewed-on: #14
2026-06-03 13:18:41 +00:00
Eric Coissac bba5147f0f fix: account for k-mer overlap in total_bases calculation
Introduces a `kmer_overlap` variable (`k-1`) and modifies the `total_bases` accumulation to subtract this overlap from each sequence's length. This ensures the base count accurately reflects only valid k-mer starting positions rather than raw sequence length.
2026-06-03 15:11:48 +02:00
Eric Coissac bfe0cb4b82 feat: integrate obikseq to configure global k-mer and minimizer sizes
This change adds the `obikseq` crate as a local dependency and inserts `set_k` and `set_m` calls across index creation and command modules. By synchronizing the runtime's global k-mer and minimizer dimensions with the loaded index parameters, downstream sequence processing and partitioning operations now consistently use the correct structural constraints.
2026-06-03 14:31:14 +02:00
Eric Coissac 173ac9fb42 feat: introduce packed matrix storage and layer metadata
Unifies bit and integer matrix storage into `PersistentBitMatrix` and `PersistentCompactIntMatrix` enums, supporting both columnar and memory-mapped single-file layouts. Introduces `LayerMeta` to persist layer dimensions as `layer_meta.json`, enabling correct initialization of implicit presence matrices. Adds CLI commands (`pack` and `--upgrade-index`) to convert existing columnar indices to the compact format and backfill missing metadata. Updates partitionner and layered map logic to use the new persistent builders, optimized memory allocation, and auto-detected storage backends.
2026-06-03 14:16:04 +02:00
Eric Coissac de1a41810a perf: enable zero-allocation queries and memory-mapped indexes
Introduce zero-allocation row extraction and query result buffers across `obicompactvec` and `obikpartitionner` to eliminate per-kmer heap allocations. Replace in-memory MPHF deserialization with memory-mapped, zero-copy views to reduce runtime memory footprint. Add configurable I/O chunking, a RAM-aware `--chunk-size` parameter, and system memory monitoring via the new `sysinfo` dependency. Re-export `PreloadedIndex` for external consumers.
2026-06-03 10:24:12 +02:00
Eric Coissac 1661dd6b1c feat: introduce preloaded index cache and thread-safe progress tracker
Introduce `PreloadedIndex` to cache partition indices and eliminate redundant I/O during repeated queries. Refactor the query pipeline to route through this pre-loaded index, and expose it publicly in `obikpartitionner`. Additionally, add a thread-safe, lazily-initialized `MultiProgress` singleton for improved progress tracking.
2026-06-03 09:42:18 +02:00
Eric Coissac 2ebc5f0d75 chore: add logging infrastructure to merge routine
Adds comprehensive logging for source metadata, merge modes, and forced approximation detection. Introduces `format_evidence` and `is_trivial` helpers to format `IndexMode` variants and identify single-genome presence indices. The core merge algorithm remains unmodified, with all changes focused on enhanced runtime observability.
2026-06-01 15:23:37 +02:00
coissac 4fd0eb989f Merge pull request 'Push pnxswqpxlyso' (#13) from push-pnxswqpxlyso into main
Reviewed-on: #13
2026-06-01 12:45:46 +00:00
Eric Coissac add6d7f873 enforce uniform index mode and optimize base index selection
Adds validation to ensure all input sources share the same `IndexMode`. Introduces base index selection logic that prioritizes approximate or hybrid evidence and maximizes base size to minimize newly indexed k-mers. Includes helper functions for triviality evaluation, cumulative size calculation, and mode consistency checks.
2026-06-01 14:43:51 +02:00
Eric Coissac 0350ca855b refactor: streamline merge pipeline and MPHF indexing
Replace mphf.find() with direct mphf.index() calls to eliminate absence checks and fallback vectors. Introduce a lightweight MphfOnly wrapper for faster index loading, and standardize k-mer iteration across merge and rebuild layers. Update IndexMeta configuration and n_new calculation to leverage MPHF cardinality, streamlining the overall merge pipeline.
2026-06-01 14:37:35 +02:00
Eric Coissac 1e2115a1b0 docs: Update README to reflect new indexing workflow
Replace the documented direct combined indexing command with a two-step workflow. This involves building separate exact indexes per genome, merging them into a single multi-genome index via `obikmer merge`, and keeping the approximate index conversion unchanged.
2026-06-01 13:56:48 +02:00
Eric Coissac 657f964dda refactor: abstract k-mer types and fix bit alignment
Abstracts k-mer storage using a `RawKmer` alias and `KMER_BITS` constant to simplify bit manipulation and enable future extension to larger types. Updates bit-shifting and masking logic across `kmer.rs` and `packed_seq.rs` to prevent overflow and improve type safety. Adapts the MPHF layer to iterate over indexed canonical k-mers with explicit slot bounds validation and bit-level encoding. Fixes test suite compilation errors by correcting method names, adding tuple destructuring, and passing the required `IndexMode::Exact` parameter.
2026-05-31 20:46:49 +02:00
Eric Coissac 8b57c91ab7 feat: add iter_canonical_kmers iterator and tests
Implements `iter_canonical_kmers` in `unitig_index` to yield canonical kmers by traversing unitig chunks. Includes unit tests to verify that the iterator exclusively yields canonical kmers and matches `iter_kmers` in count and canonical equivalence.
2026-05-30 16:30:02 +02:00
coissac 728476a0a6 Merge pull request 'Push rxrxptuxltlp' (#12) from push-rxrxptuxltlp into main
Reviewed-on: #12
2026-05-30 14:03:13 +00:00
Eric Coissac 8a0b898b4b docs: clarify query pipeline, Findere trick, and input formats
Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
2026-05-30 15:59:12 +02:00
Eric Coissac 708b0abf9b refactor: aggregate query results at sequence level
Refactor the query pipeline to buffer partition outputs into a per-sequence `seq_results` vector, deferring final accumulation until all partitions complete. This ensures global position ordering before computing k-mer presence, counts, and coverage statistics. Additionally, removes a resolved TODO and documents a known BLAST false-positive issue where chloroplast and bacterial contaminants yield unrealistic high-confidence matches.
2026-05-30 07:18:54 +02:00
coissac 3138f6382c Merge pull request 'Push nvyqwlpspwvl' (#11) from push-nvyqwlpspwvl into main
Reviewed-on: #11
2026-05-29 07:21:58 +00:00
Eric Coissac 86b88acb95 feat: implement approximate k-mer indexing and optimize query
Enable approximate k-mer indexing via the `--approx` flag, computing an effective k-mer size of `k - z + 1` and configuring the appropriate indexing mode with validated probabilistic parameters. Refactor the Findere z-window filter in the query command to improve performance and correctness by replacing the precomputed vector with a lazy closure, optimizing cache locality, and fixing a variable naming bug.
2026-05-29 09:20:44 +02:00