Commit Graph

22 Commits

Author SHA1 Message Date
Eric Coissac c694e1f2b0 feat: add benchmark pipeline, expose APIs, and enforce strict paths
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
2026-06-22 10:18:33 +02:00
Eric Coissac 4a64718fd1 perf: replace partition processing with adaptive NUMA worker pool
Replaces the previous partition processing logic with an adaptive, NUMA-aware multi-threaded worker pool that dynamically scales active threads based on real-time CPU efficiency. Introduces pre-spawned, CPU-pinned threads managed via crossbeam channels and Rayon to optimize memory bandwidth and core utilization. Adds a `max_workers()` accessor to aggregate maximum worker capacity across NUMA nodes and updates diagnostics to report active versus maximum worker counts.
2026-06-15 11:40:14 +02:00
Eric Coissac ea767376bd feat: implement NUMA-aware worker pools for merge command
Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.
2026-06-14 23:56:52 +02:00
Eric Coissac c4071eb450 refactor(merge): extract adaptive worker spawn logic
Centralize inline spawn checks into a `should_spawn_worker` function with adaptive thresholds. The first worker spawns at <95% CPU efficiency, while subsequent workers only trigger if marginal efficiency gain exceeds 25% of the expected `1/n_workers` (minimum 3%). Also increases the spawn poll interval from 10s to 20s.
2026-06-13 14:56:01 +02:00
Eric Coissac 6d85387077 feat: add performance instrumentation and dynamic worker scaling
This change enhances observability and adaptability in the merge pipeline. Performance timing and debug logging are added to the De Bruijn graph and partition merge layers to track phase durations and pipeline metrics. The merge module replaces blocking receives with timed polls to sample CPU efficiency, dynamically spawning workers when utilization drops below a threshold. A new script is also introduced to parse merge debug logs and generate structured Markdown reports detailing throughput, phase breakdowns, and partition performance.
2026-06-13 13:21:53 +02:00
Eric Coissac fddf630772 style: apply consistent formatting and whitespace normalization
Applies consistent formatting, whitespace normalization, and indentation standardization to `debruijn.rs` and `merge.rs`. Reorganizes imports and downgrades a unitig traversal log from `info!` to `debug!`. No functional logic or runtime behavior is altered.
2026-06-13 11:58:20 +02:00
Eric Coissac bc14346f5f feat: add CPU-aware parallel worker pool for partition merging
Introduce CpuSample to measure process-level CPU efficiency and wall-clock time. Use crossbeam-channel to distribute partition merging tasks to a dynamic worker pool that scales based on CPU utilization, capped at half the available cores. Update diagnostics to track pool usage.
2026-06-13 11:58:20 +02:00
Eric Coissac ba49af6f9e refactor: parallelize merge and partition logic with obipipeline
Introduce the `obipipeline` dependency and refactor merge and partition logic to leverage parallel execution. Update `merge_partitions` to use rayon with dynamic memory budgeting and concurrency control via a pilot run. Refactor Pass 1 to concurrently read unitigs, filter kmers through a shared `LayeredMap`, and populate the graph safely. Simplify diagnostics to report total kmer counts and replace manual flags with graph length validation.
2026-06-12 21:32:04 +02:00
Eric Coissac 2bc189e962 feat: dynamically compute seed expansion based on RSS
Introduce a `peak_rss_bytes()` utility for accurate per-phase RAM measurement. Replace the genome-length heuristic with a dynamic seed expansion ratio based on actual RSS delta. Explicitly drop the `GraphDeBruijn` instance before MPHF construction to prevent resource contention and ensure proper memory management.
2026-06-12 16:39:38 +02:00
Eric Coissac 52fd2cf801 feat: enhance memory budgeting and add rebuild diagnostics
This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
2026-06-12 15:20:38 +02:00
Eric Coissac b5e027f23b feat: add memory-aware parallel merge scheduling and CLI flags
Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.
2026-06-12 11:44:10 +02:00
Eric Coissac 02cb30c0ef feat: add obisys crate for standardized CLI progress reporting
This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.
2026-06-03 19:03:59 +02:00
Eric Coissac 173ac9fb42 feat: introduce packed matrix storage and layer metadata
Unifies bit and integer matrix storage into `PersistentBitMatrix` and `PersistentCompactIntMatrix` enums, supporting both columnar and memory-mapped single-file layouts. Introduces `LayerMeta` to persist layer dimensions as `layer_meta.json`, enabling correct initialization of implicit presence matrices. Adds CLI commands (`pack` and `--upgrade-index`) to convert existing columnar indices to the compact format and backfill missing metadata. Updates partitionner and layered map logic to use the new persistent builders, optimized memory allocation, and auto-detected storage backends.
2026-06-03 14:16:04 +02:00
Eric Coissac 2ebc5f0d75 chore: add logging infrastructure to merge routine
Adds comprehensive logging for source metadata, merge modes, and forced approximation detection. Introduces `format_evidence` and `is_trivial` helpers to format `IndexMode` variants and identify single-genome presence indices. The core merge algorithm remains unmodified, with all changes focused on enhanced runtime observability.
2026-06-01 15:23:37 +02:00
Eric Coissac add6d7f873 enforce uniform index mode and optimize base index selection
Adds validation to ensure all input sources share the same `IndexMode`. Introduces base index selection logic that prioritizes approximate or hybrid evidence and maximizes base size to minimize newly indexed k-mers. Includes helper functions for triviality evaluation, cumulative size calculation, and mode consistency checks.
2026-06-01 14:43:51 +02:00
Eric Coissac 0350ca855b refactor: streamline merge pipeline and MPHF indexing
Replace mphf.find() with direct mphf.index() calls to eliminate absence checks and fallback vectors. Introduce a lightweight MphfOnly wrapper for faster index loading, and standardize k-mer iteration across merge and rebuild layers. Update IndexMeta configuration and n_new calculation to leverage MPHF cardinality, streamlining the overall merge pipeline.
2026-06-01 14:37:35 +02:00
Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 15:08:29 +02:00
Eric Coissac bc51cd9861 feat: add configurable block sizes and in-place reindex command
Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.
2026-05-23 13:28:24 +02:00
Eric Coissac 0f8f61d3dd feat: introduce genome metadata tracking and CSV export
This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.
2026-05-22 09:35:20 +02:00
Eric Coissac 9e1d6f2f25 feat: implement partition-based merge command for k-mer indices
Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.
2026-05-21 14:55:38 +02:00
Eric Coissac 11182005a2 feat: enhance merge label resolution, debug dump, and layer metadata
This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.
2026-05-21 14:52:59 +02:00
Eric Coissac e1d59fde54 feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 13:44:50 +02:00