obikmer

Author	SHA1	Message	Date
coissac	728476a0a6	Merge pull request 'Push rxrxptuxltlp' (#12 ) from push-rxrxptuxltlp into main Reviewed-on: #12	2026-05-30 14:03:13 +00:00
Eric Coissac	8a0b898b4b	docs: clarify query pipeline, Findere trick, and input formats Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.	2026-05-30 15:59:12 +02:00
Eric Coissac	708b0abf9b	refactor: aggregate query results at sequence level Refactor the query pipeline to buffer partition outputs into a per-sequence `seq_results` vector, deferring final accumulation until all partitions complete. This ensures global position ordering before computing k-mer presence, counts, and coverage statistics. Additionally, removes a resolved TODO and documents a known BLAST false-positive issue where chloroplast and bacterial contaminants yield unrealistic high-confidence matches.	2026-05-30 07:18:54 +02:00
coissac	3138f6382c	Merge pull request 'Push nvyqwlpspwvl' (#11 ) from push-nvyqwlpspwvl into main Reviewed-on: #11	2026-05-29 07:21:58 +00:00
Eric Coissac	86b88acb95	feat: implement approximate k-mer indexing and optimize query Enable approximate k-mer indexing via the `--approx` flag, computing an effective k-mer size of `k - z + 1` and configuring the appropriate indexing mode with validated probabilistic parameters. Refactor the Findere z-window filter in the query command to improve performance and correctness by replacing the precomputed vector with a lazy closure, optimizing cache locality, and fixing a variable naming bug.	2026-05-29 09:20:44 +02:00
Eric Coissac	be0e8f1041	refactor(query): parallelize query execution with obipipeline Extracts chunk processing into a dedicated function and introduces a QueryData enum with unsafe Send/Sync implementations to safely distribute Rope chunks across worker threads. Replaces nested iteration with a flat iterator and parallel block processing. Adds CLI argument parsing for presence, threshold, and detail flags to configure the pipeline.	2026-05-29 09:18:54 +02:00
Eric Coissac	eaa52eaab5	feat: introduce nucstream abstraction and comprehensive test suite Introduces a unified NucStream abstraction with NucPageCursor for byte-offset tracking and MIME-type dispatch to instantiate format-specific parsers. Exposes nuc_stream and open_nuc_stream APIs that return boxed, Send-compatible iterators. Additionally, adds a comprehensive test suite covering chunk boundary alignment, FASTA/FASTQ record parsing, sequence normalization, and edge cases such as CRLF line endings, @ in quality strings, and multi-slice rope processing.	2026-05-29 09:17:24 +02:00
Eric Coissac	cfadf63bbc	refactor: migrate pipeline to NucPage-based stream processing Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.	2026-05-29 09:10:25 +02:00
Eric Coissac	a4b57a96de	feat: add streaming sequence reader and superkmer iterator Introduce the `obiread` crate with a streaming byte normalizer that processes FASTA, FASTQ, and GenBank files using a 64 KiB ring buffer for O(1) memory usage. Integrate this crate into `obiskbuilder` to provide `SuperKmerStreamIter`, enabling memory-efficient superkmer traversal with rolling entropy and minimizer-based cut conditions.	2026-05-29 09:10:25 +02:00
Eric Coissac	0d9be53d1f	feat: enforce runtime validation for kmer and minimizer parameters Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.	2026-05-29 09:10:25 +02:00
coissac	82ec6aa1cf	Merge pull request 'Push tklvqnrqtzpo' (#10 ) from push-tklvqnrqtzpo into main Reviewed-on: #10	2026-05-26 15:41:05 +00:00
Eric Coissac	694da5208e	feat: add Findere z-window filtering and detail mode coverage tracking Introduces the `--findere-z` CLI flag to override the index's sliding window parameter and implements `apply_findere` to filter k-mer hits using a z-consecutive positions window. Adds structural support for `--detail` mode, including per-sequence k-mer offsets, conditional allocation of per-position coverage vectors, and JSON serialization. Updates architecture documentation, CLI references, and annotation schemas to align with the new implementation, resolving prior discrepancies with `--detail` and `--mismatch` flags.	2026-05-26 15:43:17 +02:00
Eric Coissac	26ab165807	refactor: add rolling buffer methods and document label constraints Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.	2026-05-26 15:40:23 +02:00
Eric Coissac	dfa0b2bac2	feat: add utils subcommand for renaming genome labels Introduces a `utils` CLI subcommand to enable in-place genome label renaming without full reindexing. Adds strict label validation to reject empty strings, filesystem separators, and control characters, ensuring safe CSV serialization. Updates index metadata, renames corresponding spectrum JSON files, and registers the command in the main dispatch logic. CLI reference documentation is also updated.	2026-05-26 15:35:22 +02:00
Eric Coissac	9e60a711bc	Enforce minimum input paths and handle stdin sentinel Update CLI validation to require at least 10 input paths, defaulting to stdin (`-`) when the argument list is empty. Refactor the path iterator to explicitly recognize the stdin sentinel, bypassing extension validation and directory expansion to ensure direct passthrough to the file buffer without triggering `stat()` or recursive traversal.	2026-05-26 15:22:38 +02:00
Eric Coissac	98c14aade9	feat: centralize index configuration and add hybrid mode Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.	2026-05-26 15:08:29 +02:00
Eric Coissac	7501b6e854	refactor: switch indexing to IndexMode and update metadata Replace EvidenceKind with IndexMode (Exact, Approx, Hybrid) across layer construction and query dispatch. Update PartitionMeta and LayerMeta serialization to centralize index-wide configuration. Add flexible push_layer overloads to LayeredMap for dynamic index expansion without full rebuilds. Improve UnitigFileReader to gracefully fallback to sequential scanning when indexes are missing, eliminating panics.	2026-05-26 14:57:17 +02:00
coissac	6f7abddeaf	Merge pull request 'Push vqmnuyrkpxot' (#9 ) from push-vqmnuyrkpxot into main Reviewed-on: #9	2026-05-26 09:05:50 +00:00
Eric Coissac	1d880fdc5f	refactor: optimize MPHF construction and update legacy guidelines Replaces parallel random-access unitig iteration with a sequential mmap-based iterator for MPHF construction, eliminating the build-time `.idx` dependency by deferring index generation until after persistence. Updates `CLAUDE.md` to treat existing code as a hypothesis, mandating proactive removal of obsolete legacy constructs rather than preserving them out of inertia.	2026-05-26 10:54:59 +02:00
Eric Coissac	009a328c58	refactor: handle kmer deduplication and layer initialization concurrently Introduce a secondary layer iteration to open `SrcLayerData` alongside the unitig reader for concurrent metadata access. This refactors the merge routine to handle kmer deduplication and per-layer data initialization simultaneously. Also corrects a typo in `rebuild_layer.rs` where `openopen_sequential` is renamed to `open_sequential`.	2026-05-26 10:52:08 +02:00
Eric Coissac	9d46400898	feat: support exact and approximate evidence in layer construction Refactored `MphfLayer::build` to accept an `EvidenceKind` parameter, routing to exact (index-based, parallel MPHF, writes `evidence.bin`) or approximate (sequential mmap iterator, writes `fingerprint.bin`) pipelines. Introduced `CanonicalKmerIter` for memory-mapped, chunked k-mer iteration with O(1) resets via `Arc<Mmap>`. Updated layer and map APIs to forward evidence kind, added `push_layer` for count matrices, and adjusted tests and public exports accordingly.	2026-05-26 10:23:43 +02:00
Eric Coissac	036d044291	refactor: update core types and add approximate evidence support Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.	2026-05-26 10:04:25 +02:00
coissac	88365e444c	Merge pull request 'Push kztouvrzoqym' (#8 ) from push-kztouvrzoqym into main Reviewed-on: #8	2026-05-23 12:04:50 +00:00
Eric Coissac	da56c3e290	docs: update architecture and storage specs for approximate index Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.	2026-05-23 13:54:31 +02:00
Eric Coissac	b7db3a33ed	docs: add coverage reference files and flag architectural drift These files catalog test coverage for Rust modules across architecture, implementation, and theory sections. They track recent structural changes, flag areas prone to documentation drift, and mandate verification of key parameters and routing logic to maintain alignment with the active codebase.	2026-05-23 13:44:23 +02:00
Eric Coissac	b2a52bfb37	perf: optimize chunk_start for single-block indexing Bypasses bitwise shift and mask operations when `block_bits == 0`, directly indexing `self.block_offsets[i]` instead. This eliminates unnecessary arithmetic overhead for single-block cases while preserving the original block-based offset calculation for larger block sizes.	2026-05-23 13:34:05 +02:00
Eric Coissac	bc51cd9861	feat: add configurable block sizes and in-place reindex command Propagate configurable block size (`block_bits`) through index and layer construction to control unitig chunking and optimize memory/performance trade-offs. Introduce an in-place `reindex` command and library method to convert indices between exact and approximate evidence formats. Add validation to reject merging indexes with mismatched evidence types, and update parallel kmer counting to use `AtomicUsize` for thread-safe aggregation. Includes CLI argument parsing, metadata persistence, and updated tests.	2026-05-23 13:28:24 +02:00
Eric Coissac	876bc0127f	feat: add approximate evidence matching and index estimation CLI Introduces a new `estimate` CLI subcommand to calculate bloom filter size, evidence bits, and false-positive rates for approximate indexing. Updates the index building and querying pipelines to support both exact and approximate evidence types via a unified `EvidenceKind` abstraction. Refactors `MphfLayer` and partition index builders to route operations based on the selected evidence mode, and adds the required `obilayeredmap` dependency.	2026-05-23 13:16:49 +02:00
Eric Coissac	16a6b0d033	feat: add evidence metadata and configurable k-mer parameters Introduces `EvidenceKind` and `LayerMeta` structs to manage per-layer evidence configuration and false-positive parameters. Adds JSON serialization for layer metadata persistence and updates `build_approx_evidence` to accept a `z` parameter for consecutive k-mer thresholds. Exposes these types publicly and documents a future `aggregate` command for merging index matrix columns.	2026-05-23 13:10:18 +02:00
Eric Coissac	e1dab86daf	feat: add approximate kmer fingerprinting with memory-mapped storage Introduce a new `fingerprint` module that stores packed B-bit vectors via memory-mapped files. Expose the module publicly and add `build_approx_evidence` to `Layer` and `MphfLayer` for generating compact `fingerprint.bin` files. Implement `find_approx` for fast, probabilistic kmer lookups using configurable bit-widths. Update dependencies to `bitvec` v1 and add `cacheline-ef`, `epserde`, and `memmap2` to support the new storage and serialization logic.	2026-05-23 13:07:02 +02:00
Eric Coissac	24afd74e2f	refactor: decouple unitig index generation and add exact evidence Decouple index generation by introducing `build_unitig_idx()` for retroactive `.idx` creation and optional immediate writing on close. Add `open_sequential()` for index-less iteration while enforcing index requirements for random access. Refactor the MPHF layer to pre-generate the unitig index for parallel random access, integrate `rayon` for k-mer processing, and enforce mapping integrity via duplicate slot validation. Additionally, implement `build_exact_evidence()` to reconstruct evidence from existing artifacts, and update tests to leverage the new index generation and simplified k-mer iteration helpers.	2026-05-23 13:02:25 +02:00
Eric Coissac	8478072b78	feat: make index granularity configurable via block_bits Replaces the hardcoded BLOCK_SIZE constant with a configurable block_bits parameter, enabling variable index granularity to balance index size and sequential scan cost. Both the reader and writer now store block_bits and a precomputed mask for branchless offset arithmetic, while the index file format is upgraded to UIX3 to persist the configuration. Comprehensive unit tests verify serialization, chunk offset indexing, random access consistency, and kmer count accuracy across various block sizes.	2026-05-23 12:57:56 +02:00
Eric Coissac	4a5ab0b8c2	feat: optimize unitig index and document evidence elimination Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.	2026-05-23 12:53:42 +02:00
coissac	9b700ff4a4	Merge pull request 'feat: Implement RAII-based file handle throttling' (#7 ) from push-qtnvlqlooklx into main Reviewed-on: #7	2026-05-22 12:03:40 +00:00
Eric Coissac	ca71e100ef	feat: Implement RAII-based file handle throttling Introduces a thread-safe, RAII-based throttling mechanism (`throttle_paths`, `FileSlots`, `SlotsGuard`) to enforce a new `max_open_files` configuration limit. This replaces direct file opening in the scatter and superkmer pipelines with a concurrency semaphore that automatically releases handles upon completion, preventing resource exhaustion and deadlocks during concurrent I/O.	2026-05-22 14:03:24 +02:00
coissac	6c1a8da2d1	Merge pull request 'feat: limit concurrent open files during scatter' (#6 ) from push-rkytvkympxrn into main Reviewed-on: #6	2026-05-22 09:33:39 +00:00
Eric Coissac	85e1901898	feat: limit concurrent open files during scatter Introduces a `max_open_files` CLI argument (default: 20) to cap concurrently open input files during scatter operations. The scatter phase now parallelizes sequence file partitioning across worker threads while enforcing a configurable concurrency limit using a custom semaphore and `GuardedIter` wrapper. This ensures bounded resource usage and prevents file handle exhaustion during index construction.	2026-05-22 11:28:44 +02:00
coissac	1ba6690256	Merge pull request 'refactor(scatter): move logging call into pipeline closure' (#5 ) from push-txyqunyttswl into main Reviewed-on: #5	2026-05-22 09:15:05 +00:00
Eric Coissac	9b37e848d4	refactor(scatter): move logging call into pipeline closure Add an explicit `PathBuf` type annotation and consolidate the "indexing" info log with the chunk-opening logic. This reduces pipeline boilerplate by keeping the logging directly in the initial stage closure.	2026-05-22 11:14:52 +02:00
coissac	df28cadc41	Merge pull request 'feat: add input file logging and optimize path traversal' (#4 ) from push-zoyvrpponqqo into main Reviewed-on: #4	2026-05-22 09:04:42 +00:00
Eric Coissac	fe2127c463	feat: add input file logging and optimize path traversal Instrument index and scatter stages with `tracing::info` to log input file paths for better runtime observability. Additionally, optimize the path iterator by replacing redundant `is_dir()` checks with explicit `is_file()` validation and deferring metadata resolution, eliminating unnecessary `stat()` syscalls and improving traversal performance on high-latency network filesystems like Lustre and NFS.	2026-05-22 11:04:04 +02:00
coissac	fe0832190b	Merge pull request 'feat(obiread): add static bzip2 and lzma compression support' (#3 ) from push-qpvrrxpnqlkw into main Reviewed-on: #3	2026-05-22 08:29:42 +00:00
Eric Coissac	72d054c06b	feat(obiread): add static bzip2 and lzma compression support Explicitly add `bzip2-sys` and `liblzma-sys` with the `static` feature to the `obiread` crate. This enforces static linking for BZ2 and LZMA/XZ backends, eliminating runtime dynamic library dependencies and ensuring consistent binary distribution.	2026-05-22 10:29:22 +02:00
coissac	3d58a32613	Merge pull request 'feat: introduce genome metadata tracking and CSV export' (#2 ) from push-zrlrptrsrlkk into main Reviewed-on: #2	2026-05-22 07:36:11 +00:00
Eric Coissac	0f8f61d3dd	feat: introduce genome metadata tracking and CSV export This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.	2026-05-22 09:35:20 +02:00
coissac	77a0186fae	Merge pull request 'Push qkpyqurltlpk' (#1 ) from push-qkpyqurltlpk into main Reviewed-on: #1	2026-05-21 16:57:19 +00:00
Eric Coissac	13599dd444	feat: Implement query subcommand for sequence-to-genome mapping This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.	2026-05-21 18:56:41 +02:00
Eric Coissac	c8e591fc78	feat: add superkmer CLI setup and partition bit handling This commit introduces CLI argument parsing for the `superkmer` command via a new `SuperkmerArgs` struct. It also adds a `partitions_to_bits` utility to compute the minimum bit width for partition encoding, enforcing a 1-bit floor. Finally, the index configuration automatically rounds the partition count up to the nearest power of two to ensure compatibility with bitmask-based indexing operations.	2026-05-21 15:10:48 +02:00
Eric Coissac	d9aa211b8f	feat: add k-mer index rebuild and compaction feature This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.	2026-05-21 15:08:19 +02:00
Eric Coissac	3fa1dbf8cc	feat: add pairwise distance computation and phylogenetic trees This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.	2026-05-21 15:03:08 +02:00

1 2 3

130 Commits