Commit Graph

192 Commits

Author SHA1 Message Date
Eric Coissac 1f336fe496 refactor: replace mutex with channels for parallel debruijn processing
Add `rayon` and `crossbeam-channel` dependencies to support concurrent execution. Replace the synchronous, mutex-protected closure pattern with a channel-based producer-consumer approach using `std::thread::scope`. This decouples unitig iteration from processing, eliminating lock contention and `Mutex` overhead while enabling parallel workloads.
2026-06-13 11:49:27 +02:00
Eric Coissac 5f98d2ef96 refactor: replace explicit collect with Unitig::from_nucleotides
Introduce a thread-local buffer to materialize nucleotide iterators into contiguous slices. Update `try_for_each_unitig` across the debruijn, index, merge, and rebuild layers to directly instantiate `Unitig` via `from_nucleotides()` instead of explicitly collecting iterators. This eliminates intermediate allocations and aligns test code with the new approach.
2026-06-13 11:47:06 +02:00
Eric Coissac 8b563d0804 refactor: migrate pipeline stages and improve graph processing
Refactored neighbor resolution to explicitly track unvisited indices for degree-1 nodes, updated display formatting, and added timing and debug logging to the degree computation routine. Migrated pipeline stages from eager vector returns to explicit flat implementations, enabling backpressure-aware streaming, configurable batch processing, incremental yielding, and progress tracking via a delta channel.
2026-06-13 11:44:17 +02:00
coissac 7208dcbb4a Merge pull request 'refactor: defer SrcLayerData lookups in RawBatch' (#25) from push-nxrynoorswrw into main
Reviewed-on: #25
2026-06-12 20:19:21 +00:00
Eric Coissac 2e69b0b7fe refactor: defer SrcLayerData lookups in RawBatch
Replace eager resolution of `Vec<u32>` values with an `Arc<SrcLayerData>` handle passed alongside `Vec<CanonicalKmer>`. This shifts the lookup logic to the subsequent transform step, reducing memory overhead and enabling shared, thread-safe access to the source layer data.
2026-06-12 22:18:57 +02:00
coissac 9ea1dff5d6 Merge pull request 'Push rwqsmuvystym' (#24) from push-rwqsmuvystym into main
Reviewed-on: #24
2026-06-12 19:33:20 +00:00
Eric Coissac b2c8373586 refactor: parallelize merge layer with WorkerPool pipeline
Replaces the synchronous sequential loop with a multi-threaded pipeline using `WorkerPool` and custom stage macros. Shared mutable state is wrapped in `Arc<Mutex<>>` for thread-safe updates, while pipeline errors are centralized via `Arc<Mutex<Option<String>>>` to propagate failures before thread join.
2026-06-12 21:32:53 +02:00
Eric Coissac ba49af6f9e refactor: parallelize merge and partition logic with obipipeline
Introduce the `obipipeline` dependency and refactor merge and partition logic to leverage parallel execution. Update `merge_partitions` to use rayon with dynamic memory budgeting and concurrency control via a pilot run. Refactor Pass 1 to concurrently read unitigs, filter kmers through a shared `LayeredMap`, and populate the graph safely. Simplify diagnostics to report total kmer counts and replace manual flags with graph length validation.
2026-06-12 21:32:04 +02:00
Eric Coissac 2bc189e962 feat: dynamically compute seed expansion based on RSS
Introduce a `peak_rss_bytes()` utility for accurate per-phase RAM measurement. Replace the genome-length heuristic with a dynamic seed expansion ratio based on actual RSS delta. Explicitly drop the `GraphDeBruijn` instance before MPHF construction to prevent resource contention and ensure proper memory management.
2026-06-12 16:39:38 +02:00
coissac db9c604199 Merge pull request 'feat: enhance memory budgeting and add rebuild diagnostics' (#23) from push-nptzpkomspkv into main
Reviewed-on: #23
2026-06-12 13:21:12 +00:00
Eric Coissac 52fd2cf801 feat: enhance memory budgeting and add rebuild diagnostics
This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
2026-06-12 15:20:38 +02:00
coissac 97e3fb9761 Merge pull request 'Push ylnwstyzqwrt' (#22) from push-ylnwstyzqwrt into main
Reviewed-on: #22
2026-06-12 10:10:03 +00:00
Eric Coissac b5e027f23b feat: add memory-aware parallel merge scheduling and CLI flags
Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.
2026-06-12 11:44:10 +02:00
Eric Coissac f44fe042bc feat: add parallel k-mer counting and stats CLI
Introduces allocation-free `sum()` and `count_nonzero()` methods for compact integer vectors, extending the `ColumnWeights` trait with `partial_kmer_counts`. Adds parallel partition scanning to the k-mer index for computing per-genome distinct k-mer counts, and exposes a new `--stats` CLI flag to output these statistics as CSV.
2026-06-12 11:29:32 +02:00
coissac 94e0a370b3 Merge pull request 'Push tmpsxsztwpxl' (#21) from push-tmpsxsztwpxl into main
Reviewed-on: #21
2026-06-09 13:31:25 +00:00
Eric Coissac 970460be42 refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
2026-06-09 15:26:37 +02:00
Eric Coissac e66adef23d feat: add select command for genome column projection and aggregation
Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.
2026-06-09 15:09:47 +02:00
coissac b0dab452f6 Merge pull request 'refactor: optimize dump partition iteration and add progress tracking' (#20) from push-xqswlxlvmyrq into main
Reviewed-on: #20
2026-06-09 09:34:13 +00:00
Eric Coissac db730e9cf6 refactor: optimize dump partition iteration and add progress tracking
Refactor partition iteration to support a generic `on_partition` callback executed after each parallel partition completes. Split the logic into bounded and unbounded paths; the bounded path uses an `AtomicUsize` to enforce row limits, while the unbounded path eliminates atomic contention to improve throughput. Additionally, integrate a progress bar into the dump command by passing an increment callback to `idx.dump()`, ensuring proper initialization and cleanup.
2026-06-09 11:07:48 +02:00
coissac f65ecd19cc Merge pull request 'Push lrwmyplxxzkn' (#19) from push-lrwmyplxxzkn into main
Reviewed-on: #19
2026-06-09 08:28:20 +00:00
Eric Coissac 7dd8db1409 docs: document conservative rounding strategy for filtering thresholds
Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.
2026-06-09 10:26:21 +02:00
Eric Coissac ce45e2fbe1 refactor: centralize k-mer filtering logic and add validation
Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
2026-06-09 10:22:25 +02:00
Eric Coissac 2465cfbc4b Parallelize partition iteration using Rayon
Introduce thread-local `Vec<u8>` buffers to eliminate concurrent I/O contention. Replace the mutable row counter with an `AtomicUsize` and `fetch_update` to enable lock-free early termination when the limit is reached. Collected chunks are then written sequentially to preserve partition ordering.
2026-06-09 10:04:25 +02:00
Eric Coissac d626d42ec7 feat: add --head and --presence-threshold to dump and distance
Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.
2026-06-09 10:04:25 +02:00
coissac 650eea43b6 Merge pull request 'Push quqlpklvxsqx' (#18) from push-quqlpklvxsqx into main
Reviewed-on: #18
2026-06-08 18:15:01 +00:00
Eric Coissac eb7805c747 feat: add configurable presence threshold to kmer distance
Introduce a `--presence-threshold` CLI argument (default: 1) and update `KmerIndex::distance` to accept a `presence_threshold` parameter. This replaces hardcoded zero thresholds, enabling configurable filtering of low-abundance kmers during Jaccard distance calculations.
2026-06-08 20:14:33 +02:00
Eric Coissac 1ec65922df feat: implement parallel pairwise distance matrices
Introduces parallelized pairwise distance matrix computation for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics across `Columnar`, `Packed`, and `Implicit` matrix variants. Adds trait methods and convenience wrappers, safely handles normalization and zero-denominator edge cases, and updates test suites to import required traits for validation.
2026-06-08 20:08:09 +02:00
Eric Coissac 09d9e21744 feat: integrate tracing and enhance bit matrix operations
Add the `tracing` crate to `obidebruinj`, `obisys`, and resolve it in `Cargo.lock`. Replace `eprintln!` statements with structured `debug!` and `info!` macros. Introduce a `TracedBar` wrapper for progress bars and enhance the `Stage` lifecycle to emit structured events for timing, memory metrics, and swap warnings. Add a progress spinner for unitig degree computation. Extend `PersistentBitMatrix` with columnar bit-vector operations and parallel distance methods, enabling uniform distance computations across all storage layouts while replacing previous panics with dimension-based fallbacks.
2026-06-08 19:55:06 +02:00
coissac 3f47e22083 Merge pull request 'Push pvqkqxlkkwry' (#17) from push-pvqkqxlkkwry into main
Reviewed-on: #17
2026-06-06 04:44:10 +00:00
Eric Coissac 03c7bb0b99 Relax unitig assertion in debruijn test
Replace the strict `unitigs.len() == 1` assertion with a non-empty check to allow multiple unitigs. Update the test comment to describe the general non-repetitive sequence recovery principle instead of a specific example. The core k-mer roundtrip validation logic remains unchanged.
2026-06-06 06:41:45 +02:00
Eric Coissac b39eee688a refactor(debruijn): unify graph traversal with WalkState iterator
Replaces deeply nested branching with early returns and `then_some`. Introduces a cycle-detecting `find_chain_start` method and updates `UnitigNucIter` to use step-based iteration with atomic node claiming. This eliminates nested iterators and redundant state management, improving code readability and maintainability.
2026-06-06 06:38:28 +02:00
Eric Coissac 95b3461405 refactor: centralize graph traversal logic in walk
Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.
2026-06-06 06:36:48 +02:00
Eric Coissac 952a21eef8 refactor: remove naked_asm and extract collect_unitigs helper
Remove the `std::arch::naked_asm` import as it is no longer required. Introduce a `collect_unitigs` helper to abstract nucleotide sequence extraction from `GraphDeBruijn`, and refactor the test suite to use it, eliminating repetitive collection code and standardizing graph iteration logic.
2026-06-06 04:33:59 +02:00
Eric Coissac 5c2f48535f refactor: rename compute_degrees and mark start nodes
Renames `compute_degrees` to `compute_degrees_and_mark_starts` across the De Bruijn graph and partitioner layers to consolidate degree calculation and start-node flagging. Introduces safe neighbor iteration methods and a debug validation block to verify graph consistency. Refactors unitig extraction to use sequential execution with a `Mutex` for safe error propagation. Fixes malformed and duplicated method calls, adds auto-generation of missing `meta.json` files, and ensures persistent matrix builders are explicitly closed to finalize metadata.
2026-06-05 19:48:59 +02:00
Eric Coissac 27088ab810 refactor: optimize unitig iteration and graph traversal
Switches unitig processing to a lazy, fallible `try_for_each_unitig` API across partitioner layers, reducing intermediate allocations and enabling proper error propagation. Refactors de Bruijn graph traversal into a two-pass algorithm with explicit node flags, named constants, and diagnostic logging. Introduces parallel chain processing and staged performance profiling for the unitig command, and adds a memory-efficient `FromIterator` implementation for packed nucleotide sequences.
2026-06-05 19:48:59 +02:00
coissac ea2c594c86 Merge pull request 'Push ruqusmkoyvwm' (#16) from push-ruqusmkoyvwm into main
Reviewed-on: #16
2026-06-05 08:41:08 +00:00
Eric Coissac d202ead385 feat: parallelize unitig extraction and FASTA output
Replace the non-atomic `set_visited` with atomic `fetch_or` bitmask operations to enable thread-safe node claiming. Introduce a two-phase extraction pipeline where `par_for_each_chain_unitig` builds chains in parallel and `for_each_remaining_unitig` sequentially handles residual cycles and junctions. Add `is_start` and `collect_from_start` to explicitly define unitig boundaries. Wrap `BufWriter` in a `Mutex` and use an `AtomicUsize` counter to ensure thread-safe concurrent FASTA output, refactoring the write logic into a shared closure for safe multi-threaded execution.
2026-06-05 10:33:52 +02:00
Eric Coissac 249998beed perf: add structured performance profiling for unitig stages
Wraps graph construction, degree computation, and unitig enumeration phases with `Stage` start/stop calls. Intervals are recorded in a `Reporter` instance and printed upon completion to provide granular timing metrics for each computational stage.
2026-06-05 10:28:45 +02:00
Eric Coissac 2f29ee2240 feat: Add parallel execution and thread-safe graph operations
Integrate rayon to enable parallel processing of k-mer partitions and degree computation. Replace Cell with AtomicU8 to ensure thread-safe node state management, and add a merge method for combining disjoint graphs. Additionally, introduce progress tracking utilities and a test-utils feature flag for development dependencies.
2026-06-04 23:22:55 +02:00
Eric Coissac edd5e3f8ee feat: add bits-per-kmer diagnostic and stats module
Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.
2026-06-04 23:17:17 +02:00
Eric Coissac bb7adc1154 docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 22:59:41 +02:00
Eric Coissac 9306ec1c56 perf: Replace manual window tracking with monotonic deque algorithm
Eliminates intermediate allocations by computing per-genome window minimums (`win_min`) directly. Unifies the `z ≤ 1` and `z > 1` branches into a single buffer-reused accumulation loop, efficiently validating k-mer presence.
2026-06-04 21:37:09 +02:00
Eric Coissac 712a03a3a6 refactor: replace unitig extraction with de Bruijn graph pipeline
This change replaces direct partition-based extraction with a pipeline that reconstructs a de Bruijn graph from filtered k-mers. It introduces `FilterArgs` for k-mer selection, collects filtered k-mers in parallel into a `GraphDeBruijn`, computes node degrees, and enumerates unitigs from the graph for output instead of reading pre-computed partition files.
2026-06-04 21:32:49 +02:00
Eric Coissac 3e62ffe010 feat: add selective k-mer filtering to dump and rebuild commands
Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.
2026-06-04 21:29:58 +02:00
Eric Coissac a1499e6153 feat: add kmer filtering and refactor layer iteration
Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.
2026-06-04 21:08:15 +02:00
Eric Coissac 476c7a6394 feat: add metadata-driven k-mer filtering for rebuild command
Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
2026-06-04 21:01:58 +02:00
coissac edc18b4908 Merge pull request 'Push rrwpnquuzsvr' (#15) from push-rrwpnquuzsvr into main
Reviewed-on: #15
2026-06-03 17:04:25 +00:00
Eric Coissac 02cb30c0ef feat: add obisys crate for standardized CLI progress reporting
This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.
2026-06-03 19:03:59 +02:00
Eric Coissac 4677d6f177 refactor: improve resource cleanup and index packing
Explicitly close file handles and remove temporary artifacts after serialization to prevent disk space leaks. Additionally, compact internal matrix structures immediately upon loading the KmerIndex to improve memory efficiency and prepare for downstream operations.
2026-06-03 15:35:56 +02:00
coissac 7a29ca6305 Merge pull request 'Push ywwwypqxrtmy' (#14) from push-ywwwypqxrtmy into main
Reviewed-on: #14
2026-06-03 13:18:41 +00:00