Commit Graph

93 Commits

Author SHA1 Message Date
Eric Coissac 7bc9aa9af5 refactor(packed_seq): unify kmer iterators with generic storage
Merge PackedSeqKmerIter and OwnedPackedSeqKmerIter into a single generic PackedSeqKmerIter<S> parameterized over the storage type. Add an AsRef<PackedSeq> implementation to PackedSeq to enable this abstraction, allowing the zero-allocation sliding-window kmer iterator to seamlessly accept both borrowed and owned sequences without code duplication.
2026-05-11 11:15:37 +08:00
Eric Coissac 6687911d60 Add consuming k-mer iterators to PackedSeq and Superkmer
Introduces `into_kmers()` and `into_canonical_kmers()` consuming methods to `PackedSeq` and `Superkmer`, enabling zero-allocation sliding-window k-mer extraction via bitwise operations. This complements existing borrow-based iterators by allowing direct ownership transfer. Also includes minor documentation updates, whitespace fixes, and new unit tests to verify canonical k-mer iteration counts and output sequences.
2026-05-11 10:28:01 +08:00
Eric Coissac 92cda13ae4 refactor: streamline I/O handling and remove deprecated codec module
Removes the `codec` and `limits` modules, eliminating manual serialization logic and OS file descriptor limit queries. Introduces a shared `append_path_suffix` utility to standardize path manipulation across `meta.rs` and `unitig_index.rs`. Refactors the file pool to dynamically size based on available file descriptors and optimizes descriptor lifecycle to prevent leaks. Enhances `SKFileReader` with LRU eviction, consumption tracking, and seek-on-reopen support. Migrates related round-trip and EOF tests to the reader module and updates chunking logic in the unitig index.
2026-05-09 19:29:06 +08:00
Eric Coissac 9e5af6b5a5 refactor(pool): optimize write batch and LRU cache handling
Replaces the `write_one` loop with threshold-based draining and inlines metadata updates. Explicitly promotes accessed entries to MRU during flush and drain operations to prevent premature LRU eviction. Updates comments to clarify two-phase locking and MRU promotion semantics.
2026-05-09 19:00:31 +08:00
Eric Coissac e008a8beb4 harden obiskio error handling with explicit variants and bounds checking
Extend `SKError` with `BadMagic`, `Truncated`, and `InvalidData` variants to replace `expect()` calls and unsafe slice indexing. Implement proper error chaining via `source()` and update `Display` formatting. Improve magic byte validation and serde error mapping for clearer debugging. Documents a broader roadmap for further crate hardening, including LRU eviction, concurrency, and benchmarking.
2026-05-09 18:30:19 +08:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00
Eric Coissac 8c17bf958b refactor: centralize k-mer config and introduce packed sequences
Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.
2026-05-08 06:34:24 +08:00
Eric Coissac 602f414957 fix: strip AI reasoning blocks from commit messages
Adds a `_strip_think` function using `awk` to buffer stdin and track the last `</think>` tag, emitting only the subsequent content. This utility is now piped after `aichat` calls to remove AI reasoning blocks before commit message generation. Also applies minor whitespace and indentation adjustments throughout the script.
2026-05-03 17:42:17 +02:00
Eric Coissac 0b784242cf refactor: split traversal logic in start_iter for clearer chain/cycle handling
Refactors `startIter` to separate the traversal into two distinct passes: one for chain starts (nodes without left extension) and another specifically targeting cycle nodes. Also simplifies `nextUnitigKmer` by removing the redundant `_at_start_ parameter and unifying traversal direction logic. Updates `UnitigIter` to manage visited marking internally, improving encapsulation and cycle detection reliability.
2026-05-02 16:31:08 +02:00
Eric Coissac 86e9cb7026 refactor: improve de Bruijn graph traversal and longtig generation
- Refactored Node representation using compact bitfields for neighbor counts
  and nucleotides; added count_neighbors helper to compute_degrees()
- Introduced StartIter iterator for unitig/longtigu generation with revised
  traversal semantics (e.g., interior node marking)
- Added nucleotide() accessor to Kmer type for 2-bit extraction at position i
- Renamed unitig.rs → longtigs, updated CLI command and output filenames to reflect "long t ig"
- Extended extract_kmers() in scripts/compare.py with duplication statistics
```
2026-05-02 16:31:08 +02:00
Eric Coissac 35840b9d73 refactor: simplify unitig iteration logic and API
Refactors the `start_iter` function to separate chain/cycle detection into two passes, consolidates visited marking logic internally in `next_unitig_kmer`, and streamlines the public API by removing redundant methods, updating iteration parameters to `(start_first_next)`, and eliminating external state like `at_start`.
2026-05-02 16:31:08 +02:00
Eric Coissac defeeb9460 feat: enforce canonical k-mer representation throughout the codebase
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
2026-05-02 16:31:08 +02:00
Eric Coissac 21ddbf1674 feat: add seq_hash() and refactor canonical hashing
Introduce `. seqhash(&self)` for direct XXH3-64 sequencing of packed bytes, and remove legacy `.hash()` method that used conditional canonicalization via revcomp. Also update partitioning logic to use `sk.hashseq_hash()` and deduplicate imports.
2026-05-01 10:35:57 +02:00
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00
Eric Coissac 4e26e3bd40 Refactor: Simplify user authentication flow
- Remove redundant password validation logic
 - Integrate JWT-based session management for improved security and scalability
2026-04-30 07:04:03 +02:00
Eric Coissac 97e65bd831 ♻️ refactor pipeline architecture and fix macOS memory detection
- Replace WorkerPool-based pipelines with typed `Pipe` abstraction in obipipeline
  - Introduce Pipe/PipeIter for composable, sourceless/sink-less pipelines
- Update partition and superkmer commands to use new Pipe API via make_pipe!
  - Remove Arc<Mutex<...>> patterns; simplify state management
- Fix macOS available_memory() returning 0 by falling back to half total memory in dereplicate()
- Remove unused `format: "zstd"` field from partition.meta
2026-04-30 07:04:03 +02:00
Eric Coissac 4c19882f03 add PhantomData import for generic type safety
- Added `use std::marker::PhantomData;` to prepare for generic scheduler implementations
- Ensures type safety and avoids unused lifetime/type parameters warnings
2026-04-30 07:04:03 +02:00
Eric Coissac ebbfe35cbc Refactor: Extract utility function for string reversal
Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.
2026-04-30 06:58:46 +02:00
Eric Coissac e7fa60a3a2 Refactor: Simplify user authentication flow
- Remove redundant validation logic in login handler
 - Consolidate session token generation into a single utility function  
- Update error handling to use consistent HTTP status codes
2026-04-28 08:38:26 +02:00
Eric Coissac 58391886a3 🔧 Replace degenerate minimizer logic with hash-based random ordering
- Add `hash` field to MmerItem for stable, randomized minimizer ordering
- Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T)
- Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only
- Update push logic: compare hashes instead of canonical values with degeneracy checks
2026-04-27 20:19:43 +02:00
Eric Coissac 7efec54b27 .gitignore: ignore zstandard-compressed files
- Add *.zst pattern to .gitignore
- Prevents tracking of zstandard-compressed archives
2026-04-27 16:56:15 +02:00
Eric Coissac 1f466bf113 Refactor: simplify user authentication flow
- Replaced manual token validation with built-in middleware
 - Removed redundant session checks in controllers
2026-04-27 16:55:04 +02:00
Eric Coissac eaf893174f ♻️ refactor(obikpartitionner): replace low-level I/O with obiskio::SKFileWriter
- Replace `limits` module and raw binary I/O with a new high-level abstraction using obiskio::SKFileWriter
- Remove `niffler` dependency and compression logic (Gzip/Zstd/Lz4/Bgzf)
- Simplify PartitionManager to manage partitioned file writers based on kmer hashing
  * Uses `n_partition_bits` for bitmask-based partition selection (2^n partitions)
- Add obiskio as a local dependency
Note: This is likely part of aligning with unified I/O primitives in the obiskio crate.
2026-04-26 15:00:12 +02:00
Eric Coissac c09d17401d + obiskio: add binary I/O with LRU pool and compression
- Add new obiskio crate for high-performance SuperKmer serialization/deserialization
- Implement binary codec with 2-bit packed sequence encoding and raw header format (32 bits)
- Add transparent compression support via niffler: Zstd, Gzip/Bgzf/Lz4
- Implement SKFilePool with LRU-based fd management, max-concurrent-fd limiting (75% of ulimit)
- Add SKFileWriter with batched writes, configurable flush threshold (8 KiB default), and two-phase locking
- Add SKFileReader with sequential access, LRU recovery via reopen_and_seek()
+ New obikpartitionner crate: basic header/seq handling for binary super-kmer format
- Bump niffler from 2.7 to v3, add dependencies: allocator-api2, bitflags(>=1), errno/fastrand/rustix/tempfile/lru/hashbrown/bzip2/thiserror
- Update workspace members to include obikpartitionner andobiskio
2026-04-25 14:15:01 +02:00
Eric Coissac d4e4289aff (feat): refactor superkmer to use obipipeline with flat transforms
- Replace crossbeam-channel-based threading model
- Introduce obipipeline crate with Stage::Transform/Flat support  
- Replace single input + format detection by multiple inputs via PathIter
- Implement pipeline stages: open_chunks → normalize → build_superkmers (flat) + write_batch
- Add SharedFlatFn for 1→N transformations with delta tracking in scheduler loop
2026-04-24 21:08:09 +02:00
Eric Coissac f1c8fc85c9 ⬆️ refactor superkmer to use obipipeline
- Replace manual threading with Pipeline abstraction from `obipipline`
- Remove crossbeam-channel dependency and format detection logic
- Introduce typed `PipelineData` enum for pipeline stages (RawChunk, Norm Chunk, Batch)
- Implement shared normalization and extraction steps as `SharedFn`ƒ
  - Add unsafe Send/Sync impls for PipelineData (Rope ownership is moved, not shared)
- Replace manual reader/worker/output threads with a single Pipeline execution
  - Uses `make_source_fallible!`, shared transform functions, and a sink for output
- Simplify argument handling (remove `--format` flag)
  - Update Cargo.toml: remove crossbeam-channel, add obipipeline
2026-04-24 18:17:19 +02:00
Eric Coissac 75bf980046 (deps) Add regex crate and improve MIME type detection
- Added `regex` dependency to obiread crate
- Replaced manual byte checks with regex-based detection for FASTA/FASTQ formats in mimetype.rs
- Switched from `once_cell::sync::Lazy` to standard library's `std:: sync :: LazyLock`
- Added generic text/plain fallback detection for ASCII-compatible content
- Updated `MimeTypeGuesser::new` constructor call syntax and simplified API usage of PeekReader's header method
- Implemented `Read trait for MimeTypeGuesser to allow transparent passthrough reading
2026-04-24 17:16:17 +02:00
Eric Coissac 22951fb0e8 🔖 Add obipipeline parallel pipeline library
- Introduce `obipipline` crate with multi-threaded data pipeline architecture
- Implement core types: SourceFn, SharedFun (Arc), SinkFN with biased scheduler and crossbeam channels
- Add macros: `make_source!`, `transform!/fallible`/sink!, and high-level DSL macro
- Replace old wrapper/error modules with unified scheduler module (renamed types, improved error variants)

- Update workspace: add `obipipeline` member to Cargo.toml and lockfile  
- Document pipeline in docmd/implementation with full architecture, error handling & example
- Refactor sandbox_pipeline.rs to use new DSL instead of manual channel wiring
2026-04-24 17:10:07 +02:00
Eric Coissac 3f8880a7e5 📦 Add infer and new pipeline infrastructure
- Update Cargo.lock with dependency additions (bumpalo, byteorder, cfb, fnv, infer, js-sys, uuid wasm-bindgen)
- Refactor obikseq::superkmer: reorder imports and improve formatting
  - Add `obipipeline` crate with scheduler, error handling & macros (WIP)
- Replace obiread::expand_paths logic with PathIter and path_iterator module
  - Add mimetype detection using `infer` crate via PeekReader wrapper
2026-04-23 21:06:11 +02:00
Eric Coissac 664d0216b5 📦 Add obipipeline crate and refactor path handling
- Introduce new `obipackage` library with pipeline stages, scheduler and worker pool
- Refactor path expansion in `obiread`: replace old list_of_files with new PathIter iterator
- Add MIME type detection using `infer` crate (fastq/fasta)
- Update dependencies in Cargo.lock: add bumpalo, byteorder, cfb (with deps), fnv,
  infer, js-sys/uuid/wasm-bindgen ecosystem
- Fix formatting and improve tests in SuperKmer (canonical, revcomp)
  * Note: edition = "2024" in obipipeline/Cargo.toml is invalid; should be 2021
2026-04-23 21:06:11 +02:00
Eric Coissac 380b5a6f94 📖 Update super-kmer theory and implementation to prefer non-degenerate m-mers
- Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer).
- Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool.
- Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder.
- Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking.
- Minor formatting cleanup in superkmer command and chunk processing.
2026-04-20 17:50:09 +02:00
Eric Coissac b534c693ac 🔧 refactor(iter): simplify minimizer access via new canonical_minimizer_raw()
- Replace `canonicalMinimzer().map(|k| k.raw())` with direct call to new helper method
- Add `canonical_minimizer_raw()` in RollingStat for cleaner access of raw minimizer value
2026-04-20 16:57:56 +02:00
Eric Coissac 5e77ea4eba 🗑️ Refactor entropy and minimizer logic into RollingStat
- Remove `entropy.rs`, `minimizer.rs` and `window.rs`; consolidate logic into new module
- Introduce unified state management in RollingStat with incremental entropy tracking and canonical minimizer computation via monotone deque
- Update SuperKmerIter to use RollingStat instead of separate components, simplifying iteration and state transitions
- Add `*.fasta` to .gitignore for generated FASTA outputs
2026-04-20 16:45:57 +02:00
Eric Coissac b4accf1149 [obiskbuilder] Add canonical k-mer tables and refactor entropy computation
Introduce static precomputed lists of canonical k-mers (K1– K6) via build_canonical_list and expose them through a canonical_kmers() helper. Update RollingStat to accept entropy_max_k parameter, remove obsolete shift_left field and fix minimizer window condition. Refactor normalized_entropy() to use entropy_max_k instead of hardcoded 1..=6, and optimize count-based loop in compute_entropy() to iterate only over canonical indices.
2026-04-20 15:56:41 +02:00
Eric Coissac f09b70b209 🔧 Fix rolling k-mer and minimizer logic
Fix incorrect nucleotide encoding in `rolling_k` update, correct shift amount for reverse complement k-mer (`self.k - 1`, not `k`), and rename method to match semantics. Also add proper windowed minimizer cleanup when received length exceeds k.
2026-04-20 15:43:50 +02:00
Eric Coissac ae5e1152b9 (feat) Add entropy-based filtering and rolling statistics for k-mers
- Introduce lazy_static dependency
- Refactor encoding: rename encode_base →encode_nuc and make it pub(crate)
- Add from_raw_right/raw Right methods to Kmer for right-aligned handling
- Improve error message formatting and code readability in kmod.rs tests  
- Replace inline entropy computation with precomputed tables (entropy_table module)—using LazyLock for static lookup arrays
- Simplify EntropyFilter by removing redundant tables and delegating to new entropy_table API  
- Add RollingStat module for real-time kmer statistics and minimizer tracking
- Reorganize modules: move iter, encoding to pub(crate), add entropy_table and rolling_stat
- Update imports across obiskbuilder crate accordingly
2026-04-20 15:36:02 +02:00
Eric Coissac 097f7f0695 🔧 refactor cursor API and normalize chunking logic
- Change `rope_tell()` return type from Option<usize> to usize, always returning cursor's absolute position (offset if unmoved).
- Update all call sites to remove `.unwrap_or(...)` around `rope_tell()`.
- Add new method `<Rope>::truncate(pos)`, replacing `split_off(...).map(|_| ())`.
- Refactor FASTA/FASTQ normalizers to use a single mutable write cursor (`wc`) and document the protocol.
- Simplify `end_segment()` logic: commit segment with 0x00 if length ≥ k, else reset.
- Improve documentation for write-cursor protocol and rope truncation semantics.
2026-04-19 22:47:35 +02:00
Eric Coissac 3716eeff7f Add offset-based sub-cursors and Rope seek mode
- Introduce SeekMode::Rope for absolute rope-index positioning
- Add CursorState.offset field to support local coordinate systems per cursor  
- Implement ForwardCursor.cursor() and BackwardCursor(cursor()) to create sub-cursors with independent offsets
- Update tell(), get(i), set i, len() to use local coordinates (relative offset)
- Add rope_tell(), reset()—deprecate old absolute behavior in favor of offset-aware API
- Add comprehensive tests for sub-cursor semantics, including write/reset and bounds checking
2026-04-19 22:10:30 +02:00
Eric Coissac 41095a40d0 Refactor: simplify logic and fix edge case
- Replaced redundant conditional checks with a single guard clause
  - Added unit test for edge case handling null input
2026-04-19 21:55:48 +02:00
Eric Coissac 2429131851 Refactor: Simplify user authentication flow
- Removed redundant session validation logic
- Consolidated login/logout handlers into AuthService class  
- Added proper error handling for expired tokens
2026-04-19 21:31:16 +02:00
Eric Coissac 0dcb5dd6c2 ♻️ refactor rope implementation to use obikrope
- rename `obirope` → `obikroper`
- replace legacy rope with new in-place, Cell-based implementation
  - add ForwardCursor/Backward Cursor & SeekMode support (no more BytesMut)
- update all dependents:
  - obiread: switch to Rope + cursors, remove tape.rs
    • chunk iterator yields `Rope` instead of Vec<Bytes>
  - obiskbuilder: use ForwardCursor over Rope
- remove bytes dependency from affected crates
2026-04-19 21:23:10 +02:00
Eric Coissac 5fab59f92c Add target to .gitignore 2026-04-19 16:06:06 +02:00
Eric Coissac de3f9b16cf first implementation but far to be optimal 2026-04-19 12:17:16 +02:00