obikmer

Author	SHA1	Message	Date
Eric Coissac	b7db3a33ed	docs: add coverage reference files and flag architectural drift These files catalog test coverage for Rust modules across architecture, implementation, and theory sections. They track recent structural changes, flag areas prone to documentation drift, and mandate verification of key parameters and routing logic to maintain alignment with the active codebase.	2026-05-23 13:44:23 +02:00
Eric Coissac	b2a52bfb37	perf: optimize chunk_start for single-block indexing Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes.	2026-05-23 13:34:05 +02:00
Eric Coissac	bc51cd9861	feat: add configurable block sizes and in-place reindex command Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.	2026-05-23 13:28:24 +02:00
Eric Coissac	876bc0127f	feat: add approximate evidence matching and index estimation CLI Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency.	2026-05-23 13:16:49 +02:00
Eric Coissac	16a6b0d033	feat: add evidence metadata and configurable k-mer parameters Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns.	2026-05-23 13:10:18 +02:00
Eric Coissac	e1dab86daf	feat: add approximate kmer fingerprinting with memory-mapped storage Introduce a new `fingerprint` module that stores packed B-bit vectors via memory-mapped files. Expose the module publicly and add `build_approx_evidence` to `Layer` and `MphfLayer` for generating compact `fingerprint.bin` files. Implement `find_approx` for fast, probabilistic kmer lookups using configurable bit-widths. Update dependencies to `bitvec` v1 and add `cacheline-ef`, `epserde`, and `memmap2` to support the new storage and serialization logic.	2026-05-23 13:07:02 +02:00
Eric Coissac	24afd74e2f	refactor: decouple unitig index generation and add exact evidence Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers.	2026-05-23 13:02:25 +02:00
Eric Coissac	8478072b78	feat: make index granularity configurable via block_bits Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes.	2026-05-23 12:57:56 +02:00
Eric Coissac	4a5ab0b8c2	feat: optimize unitig index and document evidence elimination Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.	2026-05-23 12:53:42 +02:00
coissac	9b700ff4a4	Merge pull request 'feat: Implement RAII-based file handle throttling' (#7 ) from push-qtnvlqlooklx into main Reviewed-on: #7	2026-05-22 12:03:40 +00:00
Eric Coissac	ca71e100ef	feat: Implement RAII-based file handle throttling Introduces a thread-safe, RAII-based throttling mechanism (`throttle_paths`, `FileSlots`, `SlotsGuard`) to enforce a new `max_open_files` configuration limit. This replaces direct file opening in the scatter and superkmer pipelines with a concurrency semaphore that automatically releases handles upon completion, preventing resource exhaustion and deadlocks during concurrent I/O.	2026-05-22 14:03:24 +02:00
coissac	6c1a8da2d1	Merge pull request 'feat: limit concurrent open files during scatter' (#6 ) from push-rkytvkympxrn into main Reviewed-on: #6	2026-05-22 09:33:39 +00:00
Eric Coissac	85e1901898	feat: limit concurrent open files during scatter Introduces a `max_open_files` CLI argument (default: 20) to cap concurrently open input files during scatter operations. The scatter phase now parallelizes sequence file partitioning across worker threads while enforcing a configurable concurrency limit using a custom semaphore and `GuardedIter` wrapper. This ensures bounded resource usage and prevents file handle exhaustion during index construction.	2026-05-22 11:28:44 +02:00
coissac	1ba6690256	Merge pull request 'refactor(scatter): move logging call into pipeline closure' (#5 ) from push-txyqunyttswl into main Reviewed-on: #5	2026-05-22 09:15:05 +00:00
Eric Coissac	9b37e848d4	refactor(scatter): move logging call into pipeline closure Add an explicit `PathBuf` type annotation and consolidate the "indexing" info log with the chunk-opening logic. This reduces pipeline boilerplate by keeping the logging directly in the initial stage closure.	2026-05-22 11:14:52 +02:00
coissac	df28cadc41	Merge pull request 'feat: add input file logging and optimize path traversal' (#4 ) from push-zoyvrpponqqo into main Reviewed-on: #4	2026-05-22 09:04:42 +00:00
Eric Coissac	fe2127c463	feat: add input file logging and optimize path traversal Instrument index and scatter stages with `tracing::info` to log input file paths for better runtime observability. Additionally, optimize the path iterator by replacing redundant `is_dir()` checks with explicit `is_file()` validation and deferring metadata resolution, eliminating unnecessary `stat()` syscalls and improving traversal performance on high-latency network filesystems like Lustre and NFS.	2026-05-22 11:04:04 +02:00
coissac	fe0832190b	Merge pull request 'feat(obiread): add static bzip2 and lzma compression support' (#3 ) from push-qpvrrxpnqlkw into main Reviewed-on: #3	2026-05-22 08:29:42 +00:00
Eric Coissac	72d054c06b	feat(obiread): add static bzip2 and lzma compression support Explicitly add `bzip2-sys` and `liblzma-sys` with the `static` feature to the `obiread` crate. This enforces static linking for BZ2 and LZMA/XZ backends, eliminating runtime dynamic library dependencies and ensuring consistent binary distribution.	2026-05-22 10:29:22 +02:00
coissac	3d58a32613	Merge pull request 'feat: introduce genome metadata tracking and CSV export' (#2 ) from push-zrlrptrsrlkk into main Reviewed-on: #2	2026-05-22 07:36:11 +00:00
Eric Coissac	0f8f61d3dd	feat: introduce genome metadata tracking and CSV export This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.	2026-05-22 09:35:20 +02:00
coissac	77a0186fae	Merge pull request 'Push qkpyqurltlpk' (#1 ) from push-qkpyqurltlpk into main Reviewed-on: #1	2026-05-21 16:57:19 +00:00
Eric Coissac	13599dd444	feat: Implement query subcommand for sequence-to-genome mapping This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.	2026-05-21 18:56:41 +02:00
Eric Coissac	c8e591fc78	feat: add superkmer CLI setup and partition bit handling This commit introduces CLI argument parsing for the `superkmer` command via a new `SuperkmerArgs` struct. It also adds a `partitions_to_bits` utility to compute the minimum bit width for partition encoding, enforcing a 1-bit floor. Finally, the index configuration automatically rounds the partition count up to the nearest power of two to ensure compatibility with bitmask-based indexing operations.	2026-05-21 15:10:48 +02:00
Eric Coissac	d9aa211b8f	feat: add k-mer index rebuild and compaction feature This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.	2026-05-21 15:08:19 +02:00
Eric Coissac	3fa1dbf8cc	feat: add pairwise distance computation and phylogenetic trees This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.	2026-05-21 15:03:08 +02:00
Eric Coissac	9e1d6f2f25	feat: implement partition-based merge command for k-mer indices Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.	2026-05-21 14:55:38 +02:00
Eric Coissac	11182005a2	feat: enhance merge label resolution, debug dump, and layer metadata This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.	2026-05-21 14:52:59 +02:00
Eric Coissac	1a1f95e59d	feat: add CLI command to export indexed k-mers to CSV This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.	2026-05-21 13:48:07 +02:00
Eric Coissac	e1d59fde54	feat: add merge command to consolidate k-mer indexes Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.	2026-05-21 13:44:50 +02:00
Eric Coissac	bfa436ad15	feat: add merge operation specs and partition progress bar Added implementation specifications for the `merge` operation, detailing parallel partition processing, I/O paths, and kmer matrix aggregation across multiple indexes. Integrated an `indicatif` progress bar into the `rayon` parallel loop to monitor processing position, throughput, ETA, and recent partition duration.	2026-05-21 13:36:50 +02:00
Eric Coissac	7d1b62ddf3	refactor: replace single spectrum file with per-partition outputs Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.	2026-05-21 13:35:06 +02:00
Eric Coissac	c5bcb7b8fa	feat: introduce layered MPHF indexing and partition metadata Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.	2026-05-21 13:31:37 +02:00
Eric Coissac	17c9e076bd	refactor: extract obikindex crate and remove deprecated CLI commands Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.	2026-05-20 18:54:12 +02:00
Eric Coissac	f8cfb493b8	refactor: extract pipeline stages and centralize partition directory paths Extracts the scatter, dereplicate/count, and index pipeline stages into a new `steps` module to improve modularity. Centralizes partition directory path construction by introducing a `part_dir()` helper, replacing manual string formatting across multiple command files. Adds `--with-counts` and `--keep-intermediate` CLI flags to the index command and fixes a typo in the `partition_dir` parameter name.	2026-05-20 18:42:09 +02:00
Eric Coissac	cc2ed4bd31	feat: Add progress tracking and timing instrumentation to index Introduces comprehensive progress tracking and timing instrumentation using indicatif and obisys::Reporter/Stage. Adds an EMA-based throughput calculator for the scatter phase and wraps parallel progress bars in Arc<Mutex> for thread-safe concurrent updates across all pipeline stages.	2026-05-20 18:34:28 +02:00
Eric Coissac	e66c4d81ef	feat(obikmer): add index subcommand for kmer counting pipeline Introduce the `index` CLI subcommand, implementing a resumable, multi-stage pipeline to partition, dereplicate, and count kmers from input sequences. The command builds a layered de Bruijn graph index per partition, applies optional abundance filtering, and persists unitigs alongside an MPHF-based count matrix. Update `Cargo.toml` and `Cargo.lock` to include new dependencies (`epserde`, `ptr_hash`, `cacheline-ef`, `obicompactvec`, `obilayeredmap`) required for the index builder, and refresh the profiling data files.	2026-05-20 18:25:12 +02:00
Eric Coissac	c20a1ed465	perf: optimize k-mer pipeline with compile-time tables This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.	2026-05-20 15:54:20 +02:00
Eric Coissac	9a1c0c0ee0	Add CLI progress bars and throughput metrics to partitioning Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.	2026-05-20 15:46:52 +02:00
Eric Coissac	b80ab77d66	perf: Switch to sequential PHF construction to avoid thread contention The outer partition loop already saturates parallelism, making parallel PHF construction redundant and causing Rayon thread pool contention. This change switches to a sequential variant to improve performance. Additionally, explicit error handling is now added for construction failures, while preserving the existing mmap-backed kmer slice.	2026-05-20 12:48:42 +02:00
Eric Coissac	6e2a4c977b	fix: Replace unreliable memory pressure check with swap indicator The previous `major_faults > 10` check is unreliable on macOS, as it counts file-backed mmap page-ins rather than true memory pressure. This change replaces it with `swaps > 0`, a more accurate cross-platform indicator of RAM exhaustion. The swap diagnosis message is also updated to clarify that the working set exceeds available RAM, and comments are added to document this rationale.	2026-05-19 11:41:41 +02:00
Eric Coissac	8c16b79983	feat(obikmer): add obisys profiling to partition pipeline Added obisys as a local dependency and integrated its Reporter and Stage instrumentation into the partition command. Each major phase (scatter, dereplicate, and kmer-counting) is now wrapped in timing blocks, with aggregated execution times printed to stdout upon completion.	2026-05-19 11:40:20 +02:00
Eric Coissac	d0c277d5b6	feat(obisys): Add stage-based performance profiler Establishes the `obisys` crate using Rust 2024 and the `libc` dependency. Introduces a lightweight profiler that captures wall-clock time and `getrusage` system metrics per pipeline stage. Automatically computes parallelism and efficiency ratios, detects bottlenecks such as memory pressure and disk I/O, and prints a formatted diagnostic summary to stderr.	2026-05-19 11:40:20 +02:00
Eric Coissac	4736a7b6de	refactor: restructure k-mer partitioning pipeline for memory efficiency Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.	2026-05-17 16:08:47 +08:00
Eric Coissac	f36b095ce2	docs: clarify MPHF indexing, storage layout, and distance traits Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.	2026-05-17 15:59:10 +08:00
Eric Coissac	cf693f17f2	refactor: delegate MPHF construction and I/O to MphfLayer Extracts MPHF construction, evidence encoding, and unitig I/O into a new `MphfLayer` module. This removes direct dependencies on `Evidence`, `PersistentBitMatrix`, and `PersistentCompactIntMatrix` from `Layer`. The `query` method is simplified to perform direct MPHF lookups, while build logic and serialization are consolidated within the new module.	2026-05-16 19:32:40 +08:00
Eric Coissac	13e69e23c9	feat: introduce trait-based distance aggregation and layered store Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.	2026-05-16 14:41:49 +08:00
Eric Coissac	45d49ed501	docs: document k-mer index architecture and refactor distance metrics Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.	2026-05-15 21:24:30 +08:00
Eric Coissac	8409c852ef	feat: Add parallel column counts and partial distance metrics Introduces parallel `count_ones` for `BitMatrix` and parallel column-sum aggregation alongside three pairwise distance constructors (Bray-Curtis, Euclidean, Hellinger) for `IntMatrix`. These methods support partial, layer-wise data by accepting precomputed global column sums for normalization, enabling additive decomposition across partitions. Includes unit tests verifying mathematical equivalence and partition additivity.	2026-05-15 20:44:58 +08:00
Eric Coissac	8bee9f3017	feat: add parallel distance matrix computation for bit and int matrices Introduce parallel distance matrix generation using `ndarray` and `rayon` for both `BitMatrix` and `IntMatrix`. Adds full and additive-partial variants for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics. Includes comprehensive unit tests verifying matrix symmetry, zero diagonals, and numerical correctness against pairwise calculations.	2026-05-15 17:23:12 +08:00

1 2 3

106 Commits