Commit Graph

110 Commits

Author SHA1 Message Date
Eric Coissac 9d46400898 feat: support exact and approximate evidence in layer construction
Refactored `MphfLayer::build` to accept an `EvidenceKind` parameter, routing to exact (index-based, parallel MPHF, writes `evidence.bin`) or approximate (sequential mmap iterator, writes `fingerprint.bin`) pipelines. Introduced `CanonicalKmerIter` for memory-mapped, chunked k-mer iteration with O(1) resets via `Arc<Mmap>`. Updated layer and map APIs to forward evidence kind, added `push_layer` for count matrices, and adjusted tests and public exports accordingly.
2026-05-26 10:23:43 +02:00
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00
coissac 88365e444c Merge pull request 'Push kztouvrzoqym' (#8) from push-kztouvrzoqym into main
Reviewed-on: #8
2026-05-23 12:04:50 +00:00
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00
Eric Coissac b7db3a33ed docs: add coverage reference files and flag architectural drift
These files catalog test coverage for Rust modules across architecture, implementation, and theory sections. They track recent structural changes, flag areas prone to documentation drift, and mandate verification of key parameters and routing logic to maintain alignment with the active codebase.
2026-05-23 13:44:23 +02:00
Eric Coissac b2a52bfb37 perf: optimize chunk_start for single-block indexing
Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes.
2026-05-23 13:34:05 +02:00
Eric Coissac bc51cd9861 feat: add configurable block sizes and in-place reindex command
Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.
2026-05-23 13:28:24 +02:00
Eric Coissac 876bc0127f feat: add approximate evidence matching and index estimation CLI
Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency.
2026-05-23 13:16:49 +02:00
Eric Coissac 16a6b0d033 feat: add evidence metadata and configurable k-mer parameters
Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns.
2026-05-23 13:10:18 +02:00
Eric Coissac e1dab86daf feat: add approximate kmer fingerprinting with memory-mapped storage
Introduce a new `fingerprint` module that stores packed B-bit vectors via memory-mapped files. Expose the module publicly and add `build_approx_evidence` to `Layer` and `MphfLayer` for generating compact `fingerprint.bin` files. Implement `find_approx` for fast, probabilistic kmer lookups using configurable bit-widths. Update dependencies to `bitvec` v1 and add `cacheline-ef`, `epserde`, and `memmap2` to support the new storage and serialization logic.
2026-05-23 13:07:02 +02:00
Eric Coissac 24afd74e2f refactor: decouple unitig index generation and add exact evidence
Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers.
2026-05-23 13:02:25 +02:00
Eric Coissac 8478072b78 feat: make index granularity configurable via block_bits
Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes.
2026-05-23 12:57:56 +02:00
Eric Coissac 4a5ab0b8c2 feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
2026-05-23 12:53:42 +02:00
coissac 9b700ff4a4 Merge pull request 'feat: Implement RAII-based file handle throttling' (#7) from push-qtnvlqlooklx into main
Reviewed-on: #7
2026-05-22 12:03:40 +00:00
Eric Coissac ca71e100ef feat: Implement RAII-based file handle throttling
Introduces a thread-safe, RAII-based throttling mechanism (`throttle_paths`, `FileSlots`, `SlotsGuard`) to enforce a new `max_open_files` configuration limit. This replaces direct file opening in the scatter and superkmer pipelines with a concurrency semaphore that automatically releases handles upon completion, preventing resource exhaustion and deadlocks during concurrent I/O.
2026-05-22 14:03:24 +02:00
coissac 6c1a8da2d1 Merge pull request 'feat: limit concurrent open files during scatter' (#6) from push-rkytvkympxrn into main
Reviewed-on: #6
2026-05-22 09:33:39 +00:00
Eric Coissac 85e1901898 feat: limit concurrent open files during scatter
Introduces a `max_open_files` CLI argument (default: 20) to cap concurrently open input files during scatter operations. The scatter phase now parallelizes sequence file partitioning across worker threads while enforcing a configurable concurrency limit using a custom semaphore and `GuardedIter` wrapper. This ensures bounded resource usage and prevents file handle exhaustion during index construction.
2026-05-22 11:28:44 +02:00
coissac 1ba6690256 Merge pull request 'refactor(scatter): move logging call into pipeline closure' (#5) from push-txyqunyttswl into main
Reviewed-on: #5
2026-05-22 09:15:05 +00:00
Eric Coissac 9b37e848d4 refactor(scatter): move logging call into pipeline closure
Add an explicit `PathBuf` type annotation and consolidate the "indexing" info log with the chunk-opening logic. This reduces pipeline boilerplate by keeping the logging directly in the initial stage closure.
2026-05-22 11:14:52 +02:00
coissac df28cadc41 Merge pull request 'feat: add input file logging and optimize path traversal' (#4) from push-zoyvrpponqqo into main
Reviewed-on: #4
2026-05-22 09:04:42 +00:00
Eric Coissac fe2127c463 feat: add input file logging and optimize path traversal
Instrument index and scatter stages with `tracing::info` to log input file paths for better runtime observability. Additionally, optimize the path iterator by replacing redundant `is_dir()` checks with explicit `is_file()` validation and deferring metadata resolution, eliminating unnecessary `stat()` syscalls and improving traversal performance on high-latency network filesystems like Lustre and NFS.
2026-05-22 11:04:04 +02:00
coissac fe0832190b Merge pull request 'feat(obiread): add static bzip2 and lzma compression support' (#3) from push-qpvrrxpnqlkw into main
Reviewed-on: #3
2026-05-22 08:29:42 +00:00
Eric Coissac 72d054c06b feat(obiread): add static bzip2 and lzma compression support
Explicitly add `bzip2-sys` and `liblzma-sys` with the `static` feature to the `obiread` crate. This enforces static linking for BZ2 and LZMA/XZ backends, eliminating runtime dynamic library dependencies and ensuring consistent binary distribution.
2026-05-22 10:29:22 +02:00
coissac 3d58a32613 Merge pull request 'feat: introduce genome metadata tracking and CSV export' (#2) from push-zrlrptrsrlkk into main
Reviewed-on: #2
2026-05-22 07:36:11 +00:00
Eric Coissac 0f8f61d3dd feat: introduce genome metadata tracking and CSV export
This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.
2026-05-22 09:35:20 +02:00
coissac 77a0186fae Merge pull request 'Push qkpyqurltlpk' (#1) from push-qkpyqurltlpk into main
Reviewed-on: #1
2026-05-21 16:57:19 +00:00
Eric Coissac 13599dd444 feat: Implement query subcommand for sequence-to-genome mapping
This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.
2026-05-21 18:56:41 +02:00
Eric Coissac c8e591fc78 feat: add superkmer CLI setup and partition bit handling
This commit introduces CLI argument parsing for the `superkmer` command via a new `SuperkmerArgs` struct. It also adds a `partitions_to_bits` utility to compute the minimum bit width for partition encoding, enforcing a 1-bit floor. Finally, the index configuration automatically rounds the partition count up to the nearest power of two to ensure compatibility with bitmask-based indexing operations.
2026-05-21 15:10:48 +02:00
Eric Coissac d9aa211b8f feat: add k-mer index rebuild and compaction feature
This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.
2026-05-21 15:08:19 +02:00
Eric Coissac 3fa1dbf8cc feat: add pairwise distance computation and phylogenetic trees
This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.
2026-05-21 15:03:08 +02:00
Eric Coissac 9e1d6f2f25 feat: implement partition-based merge command for k-mer indices
Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.
2026-05-21 14:55:38 +02:00
Eric Coissac 11182005a2 feat: enhance merge label resolution, debug dump, and layer metadata
This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.
2026-05-21 14:52:59 +02:00
Eric Coissac 1a1f95e59d feat: add CLI command to export indexed k-mers to CSV
This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.
2026-05-21 13:48:07 +02:00
Eric Coissac e1d59fde54 feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 13:44:50 +02:00
Eric Coissac bfa436ad15 feat: add merge operation specs and partition progress bar
Added implementation specifications for the `merge` operation, detailing parallel partition processing, I/O paths, and kmer matrix aggregation across multiple indexes. Integrated an `indicatif` progress bar into the `rayon` parallel loop to monitor processing position, throughput, ETA, and recent partition duration.
2026-05-21 13:36:50 +02:00
Eric Coissac 7d1b62ddf3 refactor: replace single spectrum file with per-partition outputs
Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.
2026-05-21 13:35:06 +02:00
Eric Coissac c5bcb7b8fa feat: introduce layered MPHF indexing and partition metadata
Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.
2026-05-21 13:31:37 +02:00
Eric Coissac 17c9e076bd refactor: extract obikindex crate and remove deprecated CLI commands
Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.
2026-05-20 18:54:12 +02:00
Eric Coissac f8cfb493b8 refactor: extract pipeline stages and centralize partition directory paths
Extracts the scatter, dereplicate/count, and index pipeline stages into a new `steps` module to improve modularity. Centralizes partition directory path construction by introducing a `part_dir()` helper, replacing manual string formatting across multiple command files. Adds `--with-counts` and `--keep-intermediate` CLI flags to the index command and fixes a typo in the `partition_dir` parameter name.
2026-05-20 18:42:09 +02:00
Eric Coissac cc2ed4bd31 feat: Add progress tracking and timing instrumentation to index
Introduces comprehensive progress tracking and timing instrumentation using indicatif and obisys::Reporter/Stage. Adds an EMA-based throughput calculator for the scatter phase and wraps parallel progress bars in Arc<Mutex> for thread-safe concurrent updates across all pipeline stages.
2026-05-20 18:34:28 +02:00
Eric Coissac e66c4d81ef feat(obikmer): add index subcommand for kmer counting pipeline
Introduce the `index` CLI subcommand, implementing a resumable, multi-stage pipeline to partition, dereplicate, and count kmers from input sequences. The command builds a layered de Bruijn graph index per partition, applies optional abundance filtering, and persists unitigs alongside an MPHF-based count matrix. Update `Cargo.toml` and `Cargo.lock` to include new dependencies (`epserde`, `ptr_hash`, `cacheline-ef`, `obicompactvec`, `obilayeredmap`) required for the index builder, and refresh the profiling data files.
2026-05-20 18:25:12 +02:00
Eric Coissac c20a1ed465 perf: optimize k-mer pipeline with compile-time tables
This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.
2026-05-20 15:54:20 +02:00
Eric Coissac 9a1c0c0ee0 Add CLI progress bars and throughput metrics to partitioning
Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.
2026-05-20 15:46:52 +02:00
Eric Coissac b80ab77d66 perf: Switch to sequential PHF construction to avoid thread contention
The outer partition loop already saturates parallelism, making parallel PHF construction redundant and causing Rayon thread pool contention. This change switches to a sequential variant to improve performance. Additionally, explicit error handling is now added for construction failures, while preserving the existing mmap-backed kmer slice.
2026-05-20 12:48:42 +02:00
Eric Coissac 6e2a4c977b fix: Replace unreliable memory pressure check with swap indicator
The previous `major_faults > 10` check is unreliable on macOS, as it counts file-backed mmap page-ins rather than true memory pressure. This change replaces it with `swaps > 0`, a more accurate cross-platform indicator of RAM exhaustion. The swap diagnosis message is also updated to clarify that the working set exceeds available RAM, and comments are added to document this rationale.
2026-05-19 11:41:41 +02:00
Eric Coissac 8c16b79983 feat(obikmer): add obisys profiling to partition pipeline
Added obisys as a local dependency and integrated its Reporter and Stage instrumentation into the partition command. Each major phase (scatter, dereplicate, and kmer-counting) is now wrapped in timing blocks, with aggregated execution times printed to stdout upon completion.
2026-05-19 11:40:20 +02:00
Eric Coissac d0c277d5b6 feat(obisys): Add stage-based performance profiler
Establishes the `obisys` crate using Rust 2024 and the `libc` dependency. Introduces a lightweight profiler that captures wall-clock time and `getrusage` system metrics per pipeline stage. Automatically computes parallelism and efficiency ratios, detects bottlenecks such as memory pressure and disk I/O, and prints a formatted diagnostic summary to stderr.
2026-05-19 11:40:20 +02:00
Eric Coissac 4736a7b6de refactor: restructure k-mer partitioning pipeline for memory efficiency
Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 16:08:47 +08:00
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00
Eric Coissac cf693f17f2 refactor: delegate MPHF construction and I/O to MphfLayer
Extracts MPHF construction, evidence encoding, and unitig I/O into a new `MphfLayer` module. This removes direct dependencies on `Evidence`, `PersistentBitMatrix`, and `PersistentCompactIntMatrix` from `Layer`. The `query` method is simplified to perform direct MPHF lookups, while build logic and serialization are consolidated within the new module.
2026-05-16 19:32:40 +08:00