Commit Graph

29 Commits

Author SHA1 Message Date
Eric Coissac 17c9e076bd refactor: extract obikindex crate and remove deprecated CLI commands
Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.
2026-05-20 18:54:12 +02:00
Eric Coissac e66c4d81ef feat(obikmer): add index subcommand for kmer counting pipeline
Introduce the `index` CLI subcommand, implementing a resumable, multi-stage pipeline to partition, dereplicate, and count kmers from input sequences. The command builds a layered de Bruijn graph index per partition, applies optional abundance filtering, and persists unitigs alongside an MPHF-based count matrix. Update `Cargo.toml` and `Cargo.lock` to include new dependencies (`epserde`, `ptr_hash`, `cacheline-ef`, `obicompactvec`, `obilayeredmap`) required for the index builder, and refresh the profiling data files.
2026-05-20 18:25:12 +02:00
Eric Coissac 9a1c0c0ee0 Add CLI progress bars and throughput metrics to partitioning
Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.
2026-05-20 15:46:52 +02:00
Eric Coissac 8c16b79983 feat(obikmer): add obisys profiling to partition pipeline
Added obisys as a local dependency and integrated its Reporter and Stage instrumentation into the partition command. Each major phase (scatter, dereplicate, and kmer-counting) is now wrapped in timing blocks, with aggregated execution times printed to stdout upon completion.
2026-05-19 11:40:20 +02:00
Eric Coissac 4736a7b6de refactor: restructure k-mer partitioning pipeline for memory efficiency
Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 16:08:47 +08:00
Eric Coissac 13e69e23c9 feat: introduce trait-based distance aggregation and layered store
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
2026-05-16 14:41:49 +08:00
Eric Coissac 8bee9f3017 feat: add parallel distance matrix computation for bit and int matrices
Introduce parallel distance matrix generation using `ndarray` and `rayon` for both `BitMatrix` and `IntMatrix`. Adds full and additive-partial variants for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics. Includes comprehensive unit tests verifying matrix symmetry, zero diagonals, and numerical correctness against pairwise calculations.
2026-05-15 17:23:12 +08:00
Eric Coissac f48f7500cd refactor(obilayeredmap): support generic payload types
Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.
2026-05-14 09:33:18 +08:00
Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 10:09:46 +08:00
Eric Coissac ff75c9198d feat: add kmer iterators and optimize layered map performance
Replace `ph` with `ptr_hash` and introduce `epserde` and `rayon` dependencies. Refactor MPHF construction to leverage parallel iteration, eliminating intermediate `Vec<u64>` allocations and reducing memory footprint. Add a `n_kmers` field to track and serialize total kmer counts, alongside three zero-allocation iterators for efficient chunk traversal. Include comprehensive unit tests for the new iterators and update CLAUDE.md to enforce explicit dependency validation policies.
2026-05-12 22:35:21 +08:00
Eric Coissac 9c41891cc8 feat: add obilayeredmap crate for disk-backed k-mer indexing
Introduces the `obilayeredmap` crate (v0.1.0), implementing an append-only, disk-backed k-mer index using a minimal perfect hash function (MPHF). The module features memory-mapped reads, buffered writes, custom error handling, partition metadata persistence, and comprehensive unit tests. Also adds a reverse complement benchmark for `obikseq` and updates `Cargo.lock` with the new dependencies.
2026-05-12 15:26:39 +08:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00
Eric Coissac 8c17bf958b refactor: centralize k-mer config and introduce packed sequences
Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.
2026-05-08 06:34:24 +08:00
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00
Eric Coissac 4e26e3bd40 Refactor: Simplify user authentication flow
- Remove redundant password validation logic
 - Integrate JWT-based session management for improved security and scalability
2026-04-30 07:04:03 +02:00
Eric Coissac ebbfe35cbc Refactor: Extract utility function for string reversal
Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.
2026-04-30 06:58:46 +02:00
Eric Coissac e7fa60a3a2 Refactor: Simplify user authentication flow
- Remove redundant validation logic in login handler
 - Consolidate session token generation into a single utility function  
- Update error handling to use consistent HTTP status codes
2026-04-28 08:38:26 +02:00
Eric Coissac 7efec54b27 .gitignore: ignore zstandard-compressed files
- Add *.zst pattern to .gitignore
- Prevents tracking of zstandard-compressed archives
2026-04-27 16:56:15 +02:00
Eric Coissac eaf893174f ♻️ refactor(obikpartitionner): replace low-level I/O with obiskio::SKFileWriter
- Replace `limits` module and raw binary I/O with a new high-level abstraction using obiskio::SKFileWriter
- Remove `niffler` dependency and compression logic (Gzip/Zstd/Lz4/Bgzf)
- Simplify PartitionManager to manage partitioned file writers based on kmer hashing
  * Uses `n_partition_bits` for bitmask-based partition selection (2^n partitions)
- Add obiskio as a local dependency
Note: This is likely part of aligning with unified I/O primitives in the obiskio crate.
2026-04-26 15:00:12 +02:00
Eric Coissac c09d17401d + obiskio: add binary I/O with LRU pool and compression
- Add new obiskio crate for high-performance SuperKmer serialization/deserialization
- Implement binary codec with 2-bit packed sequence encoding and raw header format (32 bits)
- Add transparent compression support via niffler: Zstd, Gzip/Bgzf/Lz4
- Implement SKFilePool with LRU-based fd management, max-concurrent-fd limiting (75% of ulimit)
- Add SKFileWriter with batched writes, configurable flush threshold (8 KiB default), and two-phase locking
- Add SKFileReader with sequential access, LRU recovery via reopen_and_seek()
+ New obikpartitionner crate: basic header/seq handling for binary super-kmer format
- Bump niffler from 2.7 to v3, add dependencies: allocator-api2, bitflags(>=1), errno/fastrand/rustix/tempfile/lru/hashbrown/bzip2/thiserror
- Update workspace members to include obikpartitionner andobiskio
2026-04-25 14:15:01 +02:00
Eric Coissac d4e4289aff (feat): refactor superkmer to use obipipeline with flat transforms
- Replace crossbeam-channel-based threading model
- Introduce obipipeline crate with Stage::Transform/Flat support  
- Replace single input + format detection by multiple inputs via PathIter
- Implement pipeline stages: open_chunks → normalize → build_superkmers (flat) + write_batch
- Add SharedFlatFn for 1→N transformations with delta tracking in scheduler loop
2026-04-24 21:08:09 +02:00
Eric Coissac 75bf980046 (deps) Add regex crate and improve MIME type detection
- Added `regex` dependency to obiread crate
- Replaced manual byte checks with regex-based detection for FASTA/FASTQ formats in mimetype.rs
- Switched from `once_cell::sync::Lazy` to standard library's `std:: sync :: LazyLock`
- Added generic text/plain fallback detection for ASCII-compatible content
- Updated `MimeTypeGuesser::new` constructor call syntax and simplified API usage of PeekReader's header method
- Implemented `Read trait for MimeTypeGuesser to allow transparent passthrough reading
2026-04-24 17:16:17 +02:00
Eric Coissac 22951fb0e8 🔖 Add obipipeline parallel pipeline library
- Introduce `obipipline` crate with multi-threaded data pipeline architecture
- Implement core types: SourceFn, SharedFun (Arc), SinkFN with biased scheduler and crossbeam channels
- Add macros: `make_source!`, `transform!/fallible`/sink!, and high-level DSL macro
- Replace old wrapper/error modules with unified scheduler module (renamed types, improved error variants)

- Update workspace: add `obipipeline` member to Cargo.toml and lockfile  
- Document pipeline in docmd/implementation with full architecture, error handling & example
- Refactor sandbox_pipeline.rs to use new DSL instead of manual channel wiring
2026-04-24 17:10:07 +02:00
Eric Coissac 3f8880a7e5 📦 Add infer and new pipeline infrastructure
- Update Cargo.lock with dependency additions (bumpalo, byteorder, cfb, fnv, infer, js-sys, uuid wasm-bindgen)
- Refactor obikseq::superkmer: reorder imports and improve formatting
  - Add `obipipeline` crate with scheduler, error handling & macros (WIP)
- Replace obiread::expand_paths logic with PathIter and path_iterator module
  - Add mimetype detection using `infer` crate via PeekReader wrapper
2026-04-23 21:06:11 +02:00
Eric Coissac 664d0216b5 📦 Add obipipeline crate and refactor path handling
- Introduce new `obipackage` library with pipeline stages, scheduler and worker pool
- Refactor path expansion in `obiread`: replace old list_of_files with new PathIter iterator
- Add MIME type detection using `infer` crate (fastq/fasta)
- Update dependencies in Cargo.lock: add bumpalo, byteorder, cfb (with deps), fnv,
  infer, js-sys/uuid/wasm-bindgen ecosystem
- Fix formatting and improve tests in SuperKmer (canonical, revcomp)
  * Note: edition = "2024" in obipipeline/Cargo.toml is invalid; should be 2021
2026-04-23 21:06:11 +02:00
Eric Coissac ae5e1152b9 (feat) Add entropy-based filtering and rolling statistics for k-mers
- Introduce lazy_static dependency
- Refactor encoding: rename encode_base →encode_nuc and make it pub(crate)
- Add from_raw_right/raw Right methods to Kmer for right-aligned handling
- Improve error message formatting and code readability in kmod.rs tests  
- Replace inline entropy computation with precomputed tables (entropy_table module)—using LazyLock for static lookup arrays
- Simplify EntropyFilter by removing redundant tables and delegating to new entropy_table API  
- Add RollingStat module for real-time kmer statistics and minimizer tracking
- Reorganize modules: move iter, encoding to pub(crate), add entropy_table and rolling_stat
- Update imports across obiskbuilder crate accordingly
2026-04-20 15:36:02 +02:00
Eric Coissac 41095a40d0 Refactor: simplify logic and fix edge case
- Replaced redundant conditional checks with a single guard clause
  - Added unit test for edge case handling null input
2026-04-19 21:55:48 +02:00
Eric Coissac 0dcb5dd6c2 ♻️ refactor rope implementation to use obikrope
- rename `obirope` → `obikroper`
- replace legacy rope with new in-place, Cell-based implementation
  - add ForwardCursor/Backward Cursor & SeekMode support (no more BytesMut)
- update all dependents:
  - obiread: switch to Rope + cursors, remove tape.rs
    • chunk iterator yields `Rope` instead of Vec<Bytes>
  - obiskbuilder: use ForwardCursor over Rope
- remove bytes dependency from affected crates
2026-04-19 21:23:10 +02:00
Eric Coissac de3f9b16cf first implementation but far to be optimal 2026-04-19 12:17:16 +02:00