20 Commits

Author SHA1 Message Date
Eric Coissac cfadf63bbc refactor: migrate pipeline to NucPage-based stream processing
Replace the existing chunk and Rope-based processing pipeline with a fixed-size NucPage architecture. Introduce a new nucstream module featuring buffer-pooled, in-place parsing that auto-detects and decompresses FASTA/FASTQ/GenBank inputs into normalized ACGT streams with k-mer overlap preservation. Update obikmer scatter and superkmer stages to consume NucPage iterators and cursor-based navigation, eliminating std::io::Read dependencies and optimizing memory management. Add a configurable max_open_files CLI argument and update implementation documentation to reflect the new record vs. stream reading paths.
2026-05-29 09:10:25 +02:00
Eric Coissac a4b57a96de feat: add streaming sequence reader and superkmer iterator
Introduce the `obiread` crate with a streaming byte normalizer that processes FASTA, FASTQ, and GenBank files using a 64 KiB ring buffer for O(1) memory usage. Integrate this crate into `obiskbuilder` to provide `SuperKmerStreamIter`, enabling memory-efficient superkmer traversal with rolling entropy and minimizer-based cut conditions.
2026-05-29 09:10:25 +02:00
Eric Coissac 26ab165807 refactor: add rolling buffer methods and document label constraints
Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.
2026-05-26 15:40:23 +02:00
Eric Coissac c20a1ed465 perf: optimize k-mer pipeline with compile-time tables
This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.
2026-05-20 15:54:20 +02:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00
Eric Coissac 8c17bf958b refactor: centralize k-mer config and introduce packed sequences
Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.
2026-05-08 06:34:24 +08:00
Eric Coissac defeeb9460 feat: enforce canonical k-mer representation throughout the codebase
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
2026-05-02 16:31:08 +02:00
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00
Eric Coissac 4c19882f03 add PhantomData import for generic type safety
- Added `use std::marker::PhantomData;` to prepare for generic scheduler implementations
- Ensures type safety and avoids unused lifetime/type parameters warnings
2026-04-30 07:04:03 +02:00
Eric Coissac 58391886a3 🔧 Replace degenerate minimizer logic with hash-based random ordering
- Add `hash` field to MmerItem for stable, randomized minimizer ordering
- Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T)
- Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only
- Update push logic: compare hashes instead of canonical values with degeneracy checks
2026-04-27 20:19:43 +02:00
Eric Coissac 1f466bf113 Refactor: simplify user authentication flow
- Replaced manual token validation with built-in middleware
 - Removed redundant session checks in controllers
2026-04-27 16:55:04 +02:00
Eric Coissac f1c8fc85c9 ⬆️ refactor superkmer to use obipipeline
- Replace manual threading with Pipeline abstraction from `obipipline`
- Remove crossbeam-channel dependency and format detection logic
- Introduce typed `PipelineData` enum for pipeline stages (RawChunk, Norm Chunk, Batch)
- Implement shared normalization and extraction steps as `SharedFn`ƒ
  - Add unsafe Send/Sync impls for PipelineData (Rope ownership is moved, not shared)
- Replace manual reader/worker/output threads with a single Pipeline execution
  - Uses `make_source_fallible!`, shared transform functions, and a sink for output
- Simplify argument handling (remove `--format` flag)
  - Update Cargo.toml: remove crossbeam-channel, add obipipeline
2026-04-24 18:17:19 +02:00
Eric Coissac 380b5a6f94 📖 Update super-kmer theory and implementation to prefer non-degenerate m-mers
- Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer).
- Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool.
- Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder.
- Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking.
- Minor formatting cleanup in superkmer command and chunk processing.
2026-04-20 17:50:09 +02:00
Eric Coissac b534c693ac 🔧 refactor(iter): simplify minimizer access via new canonical_minimizer_raw()
- Replace `canonicalMinimzer().map(|k| k.raw())` with direct call to new helper method
- Add `canonical_minimizer_raw()` in RollingStat for cleaner access of raw minimizer value
2026-04-20 16:57:56 +02:00
Eric Coissac 5e77ea4eba 🗑️ Refactor entropy and minimizer logic into RollingStat
- Remove `entropy.rs`, `minimizer.rs` and `window.rs`; consolidate logic into new module
- Introduce unified state management in RollingStat with incremental entropy tracking and canonical minimizer computation via monotone deque
- Update SuperKmerIter to use RollingStat instead of separate components, simplifying iteration and state transitions
- Add `*.fasta` to .gitignore for generated FASTA outputs
2026-04-20 16:45:57 +02:00
Eric Coissac b4accf1149 [obiskbuilder] Add canonical k-mer tables and refactor entropy computation
Introduce static precomputed lists of canonical k-mers (K1– K6) via build_canonical_list and expose them through a canonical_kmers() helper. Update RollingStat to accept entropy_max_k parameter, remove obsolete shift_left field and fix minimizer window condition. Refactor normalized_entropy() to use entropy_max_k instead of hardcoded 1..=6, and optimize count-based loop in compute_entropy() to iterate only over canonical indices.
2026-04-20 15:56:41 +02:00
Eric Coissac f09b70b209 🔧 Fix rolling k-mer and minimizer logic
Fix incorrect nucleotide encoding in `rolling_k` update, correct shift amount for reverse complement k-mer (`self.k - 1`, not `k`), and rename method to match semantics. Also add proper windowed minimizer cleanup when received length exceeds k.
2026-04-20 15:43:50 +02:00
Eric Coissac ae5e1152b9 (feat) Add entropy-based filtering and rolling statistics for k-mers
- Introduce lazy_static dependency
- Refactor encoding: rename encode_base →encode_nuc and make it pub(crate)
- Add from_raw_right/raw Right methods to Kmer for right-aligned handling
- Improve error message formatting and code readability in kmod.rs tests  
- Replace inline entropy computation with precomputed tables (entropy_table module)—using LazyLock for static lookup arrays
- Simplify EntropyFilter by removing redundant tables and delegating to new entropy_table API  
- Add RollingStat module for real-time kmer statistics and minimizer tracking
- Reorganize modules: move iter, encoding to pub(crate), add entropy_table and rolling_stat
- Update imports across obiskbuilder crate accordingly
2026-04-20 15:36:02 +02:00
Eric Coissac 0dcb5dd6c2 ♻️ refactor rope implementation to use obikrope
- rename `obirope` → `obikroper`
- replace legacy rope with new in-place, Cell-based implementation
  - add ForwardCursor/Backward Cursor & SeekMode support (no more BytesMut)
- update all dependents:
  - obiread: switch to Rope + cursors, remove tape.rs
    • chunk iterator yields `Rope` instead of Vec<Bytes>
  - obiskbuilder: use ForwardCursor over Rope
- remove bytes dependency from affected crates
2026-04-19 21:23:10 +02:00
Eric Coissac de3f9b16cf first implementation but far to be optimal 2026-04-19 12:17:16 +02:00