obikmer

Author	SHA1	Message	Date
Eric Coissac	a4b57a96de	feat: add streaming sequence reader and superkmer iterator Introduce the `obiread` crate with a streaming byte normalizer that processes FASTA, FASTQ, and GenBank files using a 64 KiB ring buffer for O(1) memory usage. Integrate this crate into `obiskbuilder` to provide `SuperKmerStreamIter`, enabling memory-efficient superkmer traversal with rolling entropy and minimizer-based cut conditions.	2026-05-29 09:10:25 +02:00
Eric Coissac	26ab165807	refactor: add rolling buffer methods and document label constraints Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.	2026-05-26 15:40:23 +02:00
Eric Coissac	c20a1ed465	perf: optimize k-mer pipeline with compile-time tables This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.	2026-05-20 15:54:20 +02:00
Eric Coissac	5169f65dc9	feat: implement persistent layered index and chunked binary format Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.	2026-05-09 17:38:29 +08:00
Eric Coissac	8c17bf958b	refactor: centralize k-mer config and introduce packed sequences Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.	2026-05-08 06:34:24 +08:00
Eric Coissac	defeeb9460	feat: enforce canonical k-mer representation throughout the codebase Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.	2026-05-02 16:31:08 +02:00
Eric Coissac	27f5e88a7b	refactor: implement RoutableSuperKmer and update k-mer indexing pipeline Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.	2026-05-01 09:33:26 +02:00
Eric Coissac	4c19882f03	✨ add PhantomData import for generic type safety - Added `use std::marker::PhantomData;` to prepare for generic scheduler implementations - Ensures type safety and avoids unused lifetime/type parameters warnings	2026-04-30 07:04:03 +02:00
Eric Coissac	58391886a3	🔧 Replace degenerate minimizer logic with hash-based random ordering - Add `hash` field to MmerItem for stable, randomized minimizer ordering - Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T) - Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only - Update push logic: compare hashes instead of canonical values with degeneracy checks	2026-04-27 20:19:43 +02:00
Eric Coissac	1f466bf113	Refactor: simplify user authentication flow - Replaced manual token validation with built-in middleware - Removed redundant session checks in controllers	2026-04-27 16:55:04 +02:00
Eric Coissac	f1c8fc85c9	⬆️ refactor superkmer to use obipipeline - Replace manual threading with Pipeline abstraction from `obipipline` - Remove crossbeam-channel dependency and format detection logic - Introduce typed `PipelineData` enum for pipeline stages (RawChunk, Norm Chunk, Batch) - Implement shared normalization and extraction steps as `SharedFn`ƒ - Add unsafe Send/Sync impls for PipelineData (Rope ownership is moved, not shared) - Replace manual reader/worker/output threads with a single Pipeline execution - Uses `make_source_fallible!`, shared transform functions, and a sink for output - Simplify argument handling (remove `--format` flag) - Update Cargo.toml: remove crossbeam-channel, add obipipeline	2026-04-24 18:17:19 +02:00
Eric Coissac	380b5a6f94	📖 Update super-kmer theory and implementation to prefer non-degenerate m-mers - Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer). - Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool. - Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder. - Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking. - Minor formatting cleanup in superkmer command and chunk processing.	2026-04-20 17:50:09 +02:00
Eric Coissac	b534c693ac	🔧 refactor(iter): simplify minimizer access via new canonical_minimizer_raw() - Replace `canonicalMinimzer().map(\|k\| k.raw())` with direct call to new helper method - Add `canonical_minimizer_raw()` in RollingStat for cleaner access of raw minimizer value	2026-04-20 16:57:56 +02:00
Eric Coissac	5e77ea4eba	🗑️ Refactor entropy and minimizer logic into RollingStat - Remove `entropy.rs`, `minimizer.rs` and `window.rs`; consolidate logic into new module - Introduce unified state management in RollingStat with incremental entropy tracking and canonical minimizer computation via monotone deque - Update SuperKmerIter to use RollingStat instead of separate components, simplifying iteration and state transitions - Add `*.fasta` to .gitignore for generated FASTA outputs	2026-04-20 16:45:57 +02:00
Eric Coissac	b4accf1149	[obiskbuilder] Add canonical k-mer tables and refactor entropy computation Introduce static precomputed lists of canonical k-mers (K1– K6) via build_canonical_list and expose them through a canonical_kmers() helper. Update RollingStat to accept entropy_max_k parameter, remove obsolete shift_left field and fix minimizer window condition. Refactor normalized_entropy() to use entropy_max_k instead of hardcoded 1..=6, and optimize count-based loop in compute_entropy() to iterate only over canonical indices.	2026-04-20 15:56:41 +02:00
Eric Coissac	f09b70b209	🔧 Fix rolling k-mer and minimizer logic Fix incorrect nucleotide encoding in `rolling_k` update, correct shift amount for reverse complement k-mer (`self.k - 1`, not `k`), and rename method to match semantics. Also add proper windowed minimizer cleanup when received length exceeds k.	2026-04-20 15:43:50 +02:00
Eric Coissac	ae5e1152b9	(feat) Add entropy-based filtering and rolling statistics for k-mers - Introduce lazy_static dependency - Refactor encoding: rename encode_base →encode_nuc and make it pub(crate) - Add from_raw_right/raw Right methods to Kmer for right-aligned handling - Improve error message formatting and code readability in kmod.rs tests - Replace inline entropy computation with precomputed tables (entropy_table module)—using LazyLock for static lookup arrays - Simplify EntropyFilter by removing redundant tables and delegating to new entropy_table API - Add RollingStat module for real-time kmer statistics and minimizer tracking - Reorganize modules: move iter, encoding to pub(crate), add entropy_table and rolling_stat - Update imports across obiskbuilder crate accordingly	2026-04-20 15:36:02 +02:00
Eric Coissac	0dcb5dd6c2	♻️ refactor rope implementation to use obikrope - rename `obirope` → `obikroper` - replace legacy rope with new in-place, Cell-based implementation - add ForwardCursor/Backward Cursor & SeekMode support (no more BytesMut) - update all dependents: - obiread: switch to Rope + cursors, remove tape.rs • chunk iterator yields `Rope` instead of Vec<Bytes> - obiskbuilder: use ForwardCursor over Rope - remove bytes dependency from affected crates	2026-04-19 21:23:10 +02:00
Eric Coissac	de3f9b16cf	first implementation but far to be optimal	2026-04-19 12:17:16 +02:00

19 Commits