obikmer

Author	SHA1	Message	Date
Eric Coissac	c1d6f277ce	feat(select): add metrics reporting to selection methods Integrates an obisys::Reporter across indexing and command modules to capture execution metrics. Replaces discarded timer stops with explicit rep.push() calls, adds timing instrumentation for the pack stage, and prints collected reports after each selection branch.	2026-06-22 10:25:24 +02:00
Eric Coissac	c694e1f2b0	feat: add benchmark pipeline, expose APIs, and enforce strict paths Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.	2026-06-22 10:18:33 +02:00
Eric Coissac	cde6457eea	feat: add memory vectors, slice traits, and column extraction methods Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.	2026-06-17 15:03:18 +02:00
Eric Coissac	b6fcbc545f	refactor: replace rayon with NUMA-aware PartitionRunner Replaces `rayon` parallel iteration across index, rebuild, reindex, and select modules with a custom `PartitionRunner`. This introduces NUMA-aware task distribution with CPU pinning and round-robin scheduling, eliminating `Arc`, `Mutex`, and atomic synchronization primitives in favor of a flat, pre-spawned worker architecture. Error handling is simplified via `.map_err()` and the `?` operator, while progress bar updates are decoupled into dedicated callbacks.	2026-06-15 18:53:31 +02:00
Eric Coissac	4a64718fd1	perf: replace partition processing with adaptive NUMA worker pool Replaces the previous partition processing logic with an adaptive, NUMA-aware multi-threaded worker pool that dynamically scales active threads based on real-time CPU efficiency. Introduces pre-spawned, CPU-pinned threads managed via crossbeam channels and Rayon to optimize memory bandwidth and core utilization. Adds a `max_workers()` accessor to aggregate maximum worker capacity across NUMA nodes and updates diagnostics to report active versus maximum worker counts.	2026-06-15 11:40:14 +02:00
Eric Coissac	7a87e911b6	feat: introduce NUMA-aware PartitionRunner for adaptive parallelism Replace NUMA-naive Rayon loops and ad-hoc adaptive pools with a unified `PartitionRunner` that manages a NUMA-aware worker pool. The implementation uses pinned Rayon thread pools per node and activates dormant threads based on real-time CPU efficiency metrics. This standardizes partition-level parallelism, optimizes memory locality, and eliminates cross-socket traffic. Includes architecture documentation and updates mkdocs navigation.	2026-06-15 11:34:41 +02:00
Eric Coissac	ea767376bd	feat: implement NUMA-aware worker pools for merge command Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.	2026-06-14 23:56:52 +02:00
Eric Coissac	c4071eb450	refactor(merge): extract adaptive worker spawn logic Centralize inline spawn checks into a `should_spawn_worker` function with adaptive thresholds. The first worker spawns at <95% CPU efficiency, while subsequent workers only trigger if marginal efficiency gain exceeds 25% of the expected `1/n_workers` (minimum 3%). Also increases the spawn poll interval from 10s to 20s.	2026-06-13 14:56:01 +02:00
Eric Coissac	6d85387077	feat: add performance instrumentation and dynamic worker scaling This change enhances observability and adaptability in the merge pipeline. Performance timing and debug logging are added to the De Bruijn graph and partition merge layers to track phase durations and pipeline metrics. The merge module replaces blocking receives with timed polls to sample CPU efficiency, dynamically spawning workers when utilization drops below a threshold. A new script is also introduced to parse merge debug logs and generate structured Markdown reports detailing throughput, phase breakdowns, and partition performance.	2026-06-13 13:21:53 +02:00
Eric Coissac	fddf630772	style: apply consistent formatting and whitespace normalization Applies consistent formatting, whitespace normalization, and indentation standardization to `debruijn.rs` and `merge.rs`. Reorganizes imports and downgrades a unitig traversal log from `info!` to `debug!`. No functional logic or runtime behavior is altered.	2026-06-13 11:58:20 +02:00
Eric Coissac	bc14346f5f	feat: add CPU-aware parallel worker pool for partition merging Introduce CpuSample to measure process-level CPU efficiency and wall-clock time. Use crossbeam-channel to distribute partition merging tasks to a dynamic worker pool that scales based on CPU utilization, capped at half the available cores. Update diagnostics to track pool usage.	2026-06-13 11:58:20 +02:00
Eric Coissac	ba49af6f9e	refactor: parallelize merge and partition logic with obipipeline Introduce the `obipipeline` dependency and refactor merge and partition logic to leverage parallel execution. Update `merge_partitions` to use rayon with dynamic memory budgeting and concurrency control via a pilot run. Refactor Pass 1 to concurrently read unitigs, filter kmers through a shared `LayeredMap`, and populate the graph safely. Simplify diagnostics to report total kmer counts and replace manual flags with graph length validation.	2026-06-12 21:32:04 +02:00
Eric Coissac	2bc189e962	feat: dynamically compute seed expansion based on RSS Introduce a `peak_rss_bytes()` utility for accurate per-phase RAM measurement. Replace the genome-length heuristic with a dynamic seed expansion ratio based on actual RSS delta. Explicitly drop the `GraphDeBruijn` instance before MPHF construction to prevent resource contention and ensure proper memory management.	2026-06-12 16:39:38 +02:00
Eric Coissac	52fd2cf801	feat: enhance memory budgeting and add rebuild diagnostics This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.	2026-06-12 15:20:38 +02:00
Eric Coissac	b5e027f23b	feat: add memory-aware parallel merge scheduling and CLI flags Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.	2026-06-12 11:44:10 +02:00
Eric Coissac	f44fe042bc	feat: add parallel k-mer counting and stats CLI Introduces allocation-free `sum()` and `count_nonzero()` methods for compact integer vectors, extending the `ColumnWeights` trait with `partial_kmer_counts`. Adds parallel partition scanning to the k-mer index for computing per-genome distinct k-mer counts, and exposes a new `--stats` CLI flag to output these statistics as CSV.	2026-06-12 11:29:32 +02:00
Eric Coissac	e66adef23d	feat: add select command for genome column projection and aggregation Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.	2026-06-09 15:09:47 +02:00
Eric Coissac	db730e9cf6	refactor: optimize dump partition iteration and add progress tracking Refactor partition iteration to support a generic `on_partition` callback executed after each parallel partition completes. Split the logic into bounded and unbounded paths; the bounded path uses an `AtomicUsize` to enforce row limits, while the unbounded path eliminates atomic contention to improve throughput. Additionally, integrate a progress bar into the dump command by passing an increment callback to `idx.dump()`, ensuring proper initialization and cleanup.	2026-06-09 11:07:48 +02:00
Eric Coissac	2465cfbc4b	Parallelize partition iteration using Rayon Introduce thread-local `Vec<u8>` buffers to eliminate concurrent I/O contention. Replace the mutable row counter with an `AtomicUsize` and `fetch_update` to enable lock-free early termination when the limit is reached. Collected chunks are then written sequentially to preserve partition ordering.	2026-06-09 10:04:25 +02:00
Eric Coissac	d626d42ec7	feat: add --head and --presence-threshold to dump and distance Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.	2026-06-09 10:04:25 +02:00
Eric Coissac	eb7805c747	feat: add configurable presence threshold to kmer distance Introduce a `--presence-threshold` CLI argument (default: 1) and update `KmerIndex::distance` to accept a `presence_threshold` parameter. This replaces hardcoded zero thresholds, enabling configurable filtering of low-abundance kmers during Jaccard distance calculations.	2026-06-08 20:14:33 +02:00
Eric Coissac	edd5e3f8ee	feat: add bits-per-kmer diagnostic and stats module Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.	2026-06-04 23:17:17 +02:00
Eric Coissac	3e62ffe010	feat: add selective k-mer filtering to dump and rebuild commands Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.	2026-06-04 21:29:58 +02:00
Eric Coissac	a1499e6153	feat: add kmer filtering and refactor layer iteration Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.	2026-06-04 21:08:15 +02:00
Eric Coissac	02cb30c0ef	feat: add obisys crate for standardized CLI progress reporting This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.	2026-06-03 19:03:59 +02:00
Eric Coissac	4677d6f177	refactor: improve resource cleanup and index packing Explicitly close file handles and remove temporary artifacts after serialization to prevent disk space leaks. Additionally, compact internal matrix structures immediately upon loading the KmerIndex to improve memory efficiency and prepare for downstream operations.	2026-06-03 15:35:56 +02:00
Eric Coissac	bfe0cb4b82	feat: integrate obikseq to configure global k-mer and minimizer sizes This change adds the `obikseq` crate as a local dependency and inserts `set_k` and `set_m` calls across index creation and command modules. By synchronizing the runtime's global k-mer and minimizer dimensions with the loaded index parameters, downstream sequence processing and partitioning operations now consistently use the correct structural constraints.	2026-06-03 14:31:14 +02:00
Eric Coissac	173ac9fb42	feat: introduce packed matrix storage and layer metadata Unifies bit and integer matrix storage into `PersistentBitMatrix` and `PersistentCompactIntMatrix` enums, supporting both columnar and memory-mapped single-file layouts. Introduces `LayerMeta` to persist layer dimensions as `layer_meta.json`, enabling correct initialization of implicit presence matrices. Adds CLI commands (`pack` and `--upgrade-index`) to convert existing columnar indices to the compact format and backfill missing metadata. Updates partitionner and layered map logic to use the new persistent builders, optimized memory allocation, and auto-detected storage backends.	2026-06-03 14:16:04 +02:00
Eric Coissac	1661dd6b1c	feat: introduce preloaded index cache and thread-safe progress tracker Introduce `PreloadedIndex` to cache partition indices and eliminate redundant I/O during repeated queries. Refactor the query pipeline to route through this pre-loaded index, and expose it publicly in `obikpartitionner`. Additionally, add a thread-safe, lazily-initialized `MultiProgress` singleton for improved progress tracking.	2026-06-03 09:42:18 +02:00
Eric Coissac	2ebc5f0d75	chore: add logging infrastructure to merge routine Adds comprehensive logging for source metadata, merge modes, and forced approximation detection. Introduces `format_evidence` and `is_trivial` helpers to format `IndexMode` variants and identify single-genome presence indices. The core merge algorithm remains unmodified, with all changes focused on enhanced runtime observability.	2026-06-01 15:23:37 +02:00
Eric Coissac	add6d7f873	enforce uniform index mode and optimize base index selection Adds validation to ensure all input sources share the same `IndexMode`. Introduces base index selection logic that prioritizes approximate or hybrid evidence and maximizes base size to minimize newly indexed k-mers. Includes helper functions for triviality evaluation, cumulative size calculation, and mode consistency checks.	2026-06-01 14:43:51 +02:00
Eric Coissac	0350ca855b	refactor: streamline merge pipeline and MPHF indexing Replace mphf.find() with direct mphf.index() calls to eliminate absence checks and fallback vectors. Introduce a lightweight MphfOnly wrapper for faster index loading, and standardize k-mer iteration across merge and rebuild layers. Update IndexMeta configuration and n_new calculation to leverage MPHF cardinality, streamlining the overall merge pipeline.	2026-06-01 14:37:35 +02:00
Eric Coissac	dfa0b2bac2	feat: add utils subcommand for renaming genome labels Introduces a `utils` CLI subcommand to enable in-place genome label renaming without full reindexing. Adds strict label validation to reject empty strings, filesystem separators, and control characters, ensuring safe CSV serialization. Updates index metadata, renames corresponding spectrum JSON files, and registers the command in the main dispatch logic. CLI reference documentation is also updated.	2026-05-26 15:35:22 +02:00
Eric Coissac	98c14aade9	feat: centralize index configuration and add hybrid mode Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.	2026-05-26 15:08:29 +02:00
Eric Coissac	bc51cd9861	feat: add configurable block sizes and in-place reindex command Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.	2026-05-23 13:28:24 +02:00
Eric Coissac	876bc0127f	feat: add approximate evidence matching and index estimation CLI Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency.	2026-05-23 13:16:49 +02:00
Eric Coissac	16a6b0d033	feat: add evidence metadata and configurable k-mer parameters Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns.	2026-05-23 13:10:18 +02:00
Eric Coissac	0f8f61d3dd	feat: introduce genome metadata tracking and CSV export This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.	2026-05-22 09:35:20 +02:00
Eric Coissac	13599dd444	feat: Implement query subcommand for sequence-to-genome mapping This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.	2026-05-21 18:56:41 +02:00
Eric Coissac	d9aa211b8f	feat: add k-mer index rebuild and compaction feature This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.	2026-05-21 15:08:19 +02:00
Eric Coissac	3fa1dbf8cc	feat: add pairwise distance computation and phylogenetic trees This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.	2026-05-21 15:03:08 +02:00
Eric Coissac	9e1d6f2f25	feat: implement partition-based merge command for k-mer indices Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.	2026-05-21 14:55:38 +02:00
Eric Coissac	11182005a2	feat: enhance merge label resolution, debug dump, and layer metadata This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.	2026-05-21 14:52:59 +02:00
Eric Coissac	1a1f95e59d	feat: add CLI command to export indexed k-mers to CSV This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.	2026-05-21 13:48:07 +02:00
Eric Coissac	e1d59fde54	feat: add merge command to consolidate k-mer indexes Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.	2026-05-21 13:44:50 +02:00
Eric Coissac	7d1b62ddf3	refactor: replace single spectrum file with per-partition outputs Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.	2026-05-21 13:35:06 +02:00
Eric Coissac	c5bcb7b8fa	feat: introduce layered MPHF indexing and partition metadata Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.	2026-05-21 13:31:37 +02:00
Eric Coissac	17c9e076bd	refactor: extract obikindex crate and remove deprecated CLI commands Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.	2026-05-20 18:54:12 +02:00

48 Commits