Push nvyqwlpspwvl #11

Merged
coissac merged 6 commits from push-nvyqwlpspwvl into main 2026-05-29 07:21:58 +00:00
Owner
No description provided.
coissac added 6 commits 2026-05-29 07:21:51 +00:00
Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.
Introduce the `obiread` crate with a streaming byte normalizer that processes FASTA, FASTQ, and GenBank files using a 64 KiB ring buffer for O(1) memory usage. Integrate this crate into `obiskbuilder` to provide `SuperKmerStreamIter`, enabling memory-efficient superkmer traversal with rolling entropy and minimizer-based cut conditions.
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
Introduces a unified NucStream abstraction with NucPageCursor for byte-offset tracking and MIME-type dispatch to instantiate format-specific parsers. Exposes nuc_stream and open_nuc_stream APIs that return boxed, Send-compatible iterators. Additionally, adds a comprehensive test suite covering chunk boundary alignment, FASTA/FASTQ record parsing, sequence normalization, and edge cases such as CRLF line endings, @ in quality strings, and multi-slice rope processing.
Extracts chunk processing into a dedicated function and introduces a QueryData enum with unsafe Send/Sync implementations to safely distribute Rope chunks across worker threads. Replaces nested iteration with a flat iterator and parallel block processing. Adds CLI argument parsing for presence, threshold, and detail flags to configure the pipeline.
Enable approximate k-mer indexing via the `--approx` flag, computing an effective k-mer size of `k - z + 1` and configuring the appropriate indexing mode with validated probabilistic parameters. Refactor the Findere z-window filter in the query command to improve performance and correctness by replacing the precomputed vector with a lazy closure, optimizing cache locality, and fixing a variable naming bug.
coissac merged commit 3138f6382c into main 2026-05-29 07:21:58 +00:00
coissac deleted branch push-nvyqwlpspwvl 2026-05-29 07:21:59 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: OBIKmers/obikmer#11