73 Commits

Author SHA1 Message Date
Eric Coissac 17c9e076bd refactor: extract obikindex crate and remove deprecated CLI commands
Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.
2026-05-20 18:54:12 +02:00
Eric Coissac f8cfb493b8 refactor: extract pipeline stages and centralize partition directory paths
Extracts the scatter, dereplicate/count, and index pipeline stages into a new `steps` module to improve modularity. Centralizes partition directory path construction by introducing a `part_dir()` helper, replacing manual string formatting across multiple command files. Adds `--with-counts` and `--keep-intermediate` CLI flags to the index command and fixes a typo in the `partition_dir` parameter name.
2026-05-20 18:42:09 +02:00
Eric Coissac cc2ed4bd31 feat: Add progress tracking and timing instrumentation to index
Introduces comprehensive progress tracking and timing instrumentation using indicatif and obisys::Reporter/Stage. Adds an EMA-based throughput calculator for the scatter phase and wraps parallel progress bars in Arc<Mutex> for thread-safe concurrent updates across all pipeline stages.
2026-05-20 18:34:28 +02:00
Eric Coissac e66c4d81ef feat(obikmer): add index subcommand for kmer counting pipeline
Introduce the `index` CLI subcommand, implementing a resumable, multi-stage pipeline to partition, dereplicate, and count kmers from input sequences. The command builds a layered de Bruijn graph index per partition, applies optional abundance filtering, and persists unitigs alongside an MPHF-based count matrix. Update `Cargo.toml` and `Cargo.lock` to include new dependencies (`epserde`, `ptr_hash`, `cacheline-ef`, `obicompactvec`, `obilayeredmap`) required for the index builder, and refresh the profiling data files.
2026-05-20 18:25:12 +02:00
Eric Coissac c20a1ed465 perf: optimize k-mer pipeline with compile-time tables
This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.
2026-05-20 15:54:20 +02:00
Eric Coissac 9a1c0c0ee0 Add CLI progress bars and throughput metrics to partitioning
Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.
2026-05-20 15:46:52 +02:00
Eric Coissac b80ab77d66 perf: Switch to sequential PHF construction to avoid thread contention
The outer partition loop already saturates parallelism, making parallel PHF construction redundant and causing Rayon thread pool contention. This change switches to a sequential variant to improve performance. Additionally, explicit error handling is now added for construction failures, while preserving the existing mmap-backed kmer slice.
2026-05-20 12:48:42 +02:00
Eric Coissac 6e2a4c977b fix: Replace unreliable memory pressure check with swap indicator
The previous `major_faults > 10` check is unreliable on macOS, as it counts file-backed mmap page-ins rather than true memory pressure. This change replaces it with `swaps > 0`, a more accurate cross-platform indicator of RAM exhaustion. The swap diagnosis message is also updated to clarify that the working set exceeds available RAM, and comments are added to document this rationale.
2026-05-19 11:41:41 +02:00
Eric Coissac 8c16b79983 feat(obikmer): add obisys profiling to partition pipeline
Added obisys as a local dependency and integrated its Reporter and Stage instrumentation into the partition command. Each major phase (scatter, dereplicate, and kmer-counting) is now wrapped in timing blocks, with aggregated execution times printed to stdout upon completion.
2026-05-19 11:40:20 +02:00
Eric Coissac d0c277d5b6 feat(obisys): Add stage-based performance profiler
Establishes the `obisys` crate using Rust 2024 and the `libc` dependency. Introduces a lightweight profiler that captures wall-clock time and `getrusage` system metrics per pipeline stage. Automatically computes parallelism and efficiency ratios, detects bottlenecks such as memory pressure and disk I/O, and prints a formatted diagnostic summary to stderr.
2026-05-19 11:40:20 +02:00
Eric Coissac 4736a7b6de refactor: restructure k-mer partitioning pipeline for memory efficiency
Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 16:08:47 +08:00
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00
Eric Coissac cf693f17f2 refactor: delegate MPHF construction and I/O to MphfLayer
Extracts MPHF construction, evidence encoding, and unitig I/O into a new `MphfLayer` module. This removes direct dependencies on `Evidence`, `PersistentBitMatrix`, and `PersistentCompactIntMatrix` from `Layer`. The `query` method is simplified to perform direct MPHF lookups, while build logic and serialization are consolidated within the new module.
2026-05-16 19:32:40 +08:00
Eric Coissac 13e69e23c9 feat: introduce trait-based distance aggregation and layered store
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
2026-05-16 14:41:49 +08:00
Eric Coissac 45d49ed501 docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:24:30 +08:00
Eric Coissac 8409c852ef feat: Add parallel column counts and partial distance metrics
Introduces parallel `count_ones` for `BitMatrix` and parallel column-sum aggregation alongside three pairwise distance constructors (Bray-Curtis, Euclidean, Hellinger) for `IntMatrix`. These methods support partial, layer-wise data by accepting precomputed global column sums for normalization, enabling additive decomposition across partitions. Includes unit tests verifying mathematical equivalence and partition additivity.
2026-05-15 20:44:58 +08:00
Eric Coissac 8bee9f3017 feat: add parallel distance matrix computation for bit and int matrices
Introduce parallel distance matrix generation using `ndarray` and `rayon` for both `BitMatrix` and `IntMatrix`. Adds full and additive-partial variants for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics. Includes comprehensive unit tests verifying matrix symmetry, zero diagonals, and numerical correctness against pairwise calculations.
2026-05-15 17:23:12 +08:00
Eric Coissac 1881e98bad feat(bitvec): add partial Jaccard, fix padding, optimize constructor
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
2026-05-15 15:04:10 +08:00
Eric Coissac b218bf012b feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
2026-05-14 21:19:18 +08:00
Eric Coissac f48f7500cd refactor(obilayeredmap): support generic payload types
Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.
2026-05-14 09:33:18 +08:00
Eric Coissac 0b3fcf3cf0 feat: add PersistentBitVec and upgrade PersistentCompactIntVec format
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
2026-05-14 09:01:36 +08:00
Eric Coissac c18c5d2600 feat: add sum and sumadd methods to PersistentCompactIntVec
Adds a `sum()` method to compute the aggregate of all elements using u64 arithmetic to safely handle potential overflows. Introduces a `sumadd()` method for element-wise addition into the builder, enforcing strict length equality and using `checked_add` for safe accumulation. Includes unit tests to verify correct aggregation and overflow safety.
2026-05-13 21:49:22 +08:00
Eric Coissac 0733287de5 feat(obicompactvec): migrate to memory-mapped file storage
Refactor `PersistentCompactIntVecBuilder` to replace in-memory `Vec<u8>` with `MmapMut` for primary storage. Decouples initialization from finalization by updating `finalize_pciv` to truncate, append overflow data, and overwrite the header placeholder. Adds file path tracking via a new `path()` accessor, implements `build_from` for efficient copying, and introduces arithmetic/set operations (`min`, `max`, `sum`, `diff`). Expands test coverage for persistence roundtrips, mutations, iteration, and the new vector operations.
2026-05-13 10:56:36 +08:00
Eric Coissac dfce956162 feat(obicompactvec): implement Iterator for PersistentCompactIntVec
Add an `Iter` struct that implements `Iterator` and `ExactSizeIterator` to enable idiomatic traversal of `&PersistentCompactIntVec`. The iterator maintains `slot` and `overflow_pos` state to correctly yield `u32` values from both primary and overflow memory regions. Includes three unit tests validating iteration correctness against direct indexing, accurate `len()` tracking, and proper reference-based iteration.
2026-05-13 10:28:17 +08:00
Eric Coissac 4d5fcd4340 refactor: split obicompactvec storage into primary and overflow files
Refactors the storage format to separate primary and overflow data into distinct files. Introduces a cache-friendly sparse index with dynamically computed step and entry counts. Consolidates dual memory-mapped regions into a single file with explicit header parsing and validation, replacing unsafe slice casting with direct byte-offset indexing. Updates the test suite to accommodate the new file structure.
2026-05-13 10:25:14 +08:00
Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 10:09:46 +08:00
Eric Coissac 84ed752b78 perf: optimize packed_seq sub() with direct bit-slice copying
Replaces per-nucleotide iteration with direct bit-slice copying via `bitvec`. This eliminates per-element decoding overhead and intermediate allocations by computing the target byte length, copying the packed bit range `[start*2, end*2)` directly into a pre-allocated buffer, and constructing the result in a single pass.
2026-05-13 06:26:29 +08:00
Eric Coissac ff75c9198d feat: add kmer iterators and optimize layered map performance
Replace `ph` with `ptr_hash` and introduce `epserde` and `rayon` dependencies. Refactor MPHF construction to leverage parallel iteration, eliminating intermediate `Vec<u64>` allocations and reducing memory footprint. Add a `n_kmers` field to track and serialize total kmer counts, alongside three zero-allocation iterators for efficient chunk traversal. Include comprehensive unit tests for the new iterators and update CLAUDE.md to enforce explicit dependency validation policies.
2026-05-12 22:35:21 +08:00
Eric Coissac 9c41891cc8 feat: add obilayeredmap crate for disk-backed k-mer indexing
Introduces the `obilayeredmap` crate (v0.1.0), implementing an append-only, disk-backed k-mer index using a minimal perfect hash function (MPHF). The module features memory-mapped reads, buffered writes, custom error handling, partition metadata persistence, and comprehensive unit tests. Also adds a reverse complement benchmark for `obikseq` and updates `Cargo.lock` with the new dependencies.
2026-05-12 15:26:39 +08:00
Eric Coissac 962e386f8b docs: clarify iterator borrowing vs consuming semantics
Updates documentation for `PackedSeqKmerIter` and `SuperKmer` iterator methods to explicitly clarify borrowing versus consuming semantics. Documents ownership transfer requirements for `into_*` variants and explains lifetime constraints that prevent borrowing forms from being used in `flat_map` closures. Highlights optimal usage contexts, particularly zero-allocation iteration for owned values.
2026-05-11 11:43:07 +08:00
Eric Coissac 7bc9aa9af5 refactor(packed_seq): unify kmer iterators with generic storage
Merge PackedSeqKmerIter and OwnedPackedSeqKmerIter into a single generic PackedSeqKmerIter<S> parameterized over the storage type. Add an AsRef<PackedSeq> implementation to PackedSeq to enable this abstraction, allowing the zero-allocation sliding-window kmer iterator to seamlessly accept both borrowed and owned sequences without code duplication.
2026-05-11 11:15:37 +08:00
Eric Coissac 6687911d60 Add consuming k-mer iterators to PackedSeq and Superkmer
Introduces `into_kmers()` and `into_canonical_kmers()` consuming methods to `PackedSeq` and `Superkmer`, enabling zero-allocation sliding-window k-mer extraction via bitwise operations. This complements existing borrow-based iterators by allowing direct ownership transfer. Also includes minor documentation updates, whitespace fixes, and new unit tests to verify canonical k-mer iteration counts and output sequences.
2026-05-11 10:28:01 +08:00
Eric Coissac 92cda13ae4 refactor: streamline I/O handling and remove deprecated codec module
Removes the `codec` and `limits` modules, eliminating manual serialization logic and OS file descriptor limit queries. Introduces a shared `append_path_suffix` utility to standardize path manipulation across `meta.rs` and `unitig_index.rs`. Refactors the file pool to dynamically size based on available file descriptors and optimizes descriptor lifecycle to prevent leaks. Enhances `SKFileReader` with LRU eviction, consumption tracking, and seek-on-reopen support. Migrates related round-trip and EOF tests to the reader module and updates chunking logic in the unitig index.
2026-05-09 19:29:06 +08:00
Eric Coissac 9e5af6b5a5 refactor(pool): optimize write batch and LRU cache handling
Replaces the `write_one` loop with threshold-based draining and inlines metadata updates. Explicitly promotes accessed entries to MRU during flush and drain operations to prevent premature LRU eviction. Updates comments to clarify two-phase locking and MRU promotion semantics.
2026-05-09 19:00:31 +08:00
Eric Coissac e008a8beb4 harden obiskio error handling with explicit variants and bounds checking
Extend `SKError` with `BadMagic`, `Truncated`, and `InvalidData` variants to replace `expect()` calls and unsafe slice indexing. Implement proper error chaining via `source()` and update `Display` formatting. Improve magic byte validation and serde error mapping for clearer debugging. Documents a broader roadmap for further crate hardening, including LRU eviction, concurrency, and benchmarking.
2026-05-09 18:30:19 +08:00
Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
2026-05-09 17:38:29 +08:00
Eric Coissac 8c17bf958b refactor: centralize k-mer config and introduce packed sequences
Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.
2026-05-08 06:34:24 +08:00
Eric Coissac 602f414957 fix: strip AI reasoning blocks from commit messages
Adds a `_strip_think` function using `awk` to buffer stdin and track the last `</think>` tag, emitting only the subsequent content. This utility is now piped after `aichat` calls to remove AI reasoning blocks before commit message generation. Also applies minor whitespace and indentation adjustments throughout the script.
2026-05-03 17:42:17 +02:00
Eric Coissac 0b784242cf refactor: split traversal logic in start_iter for clearer chain/cycle handling
Refactors `startIter` to separate the traversal into two distinct passes: one for chain starts (nodes without left extension) and another specifically targeting cycle nodes. Also simplifies `nextUnitigKmer` by removing the redundant `_at_start_ parameter and unifying traversal direction logic. Updates `UnitigIter` to manage visited marking internally, improving encapsulation and cycle detection reliability.
2026-05-02 16:31:08 +02:00
Eric Coissac 86e9cb7026 refactor: improve de Bruijn graph traversal and longtig generation
- Refactored Node representation using compact bitfields for neighbor counts
  and nucleotides; added count_neighbors helper to compute_degrees()
- Introduced StartIter iterator for unitig/longtigu generation with revised
  traversal semantics (e.g., interior node marking)
- Added nucleotide() accessor to Kmer type for 2-bit extraction at position i
- Renamed unitig.rs → longtigs, updated CLI command and output filenames to reflect "long t ig"
- Extended extract_kmers() in scripts/compare.py with duplication statistics
```
2026-05-02 16:31:08 +02:00
Eric Coissac 35840b9d73 refactor: simplify unitig iteration logic and API
Refactors the `start_iter` function to separate chain/cycle detection into two passes, consolidates visited marking logic internally in `next_unitig_kmer`, and streamlines the public API by removing redundant methods, updating iteration parameters to `(start_first_next)`, and eliminating external state like `at_start`.
2026-05-02 16:31:08 +02:00
Eric Coissac defeeb9460 feat: enforce canonical k-mer representation throughout the codebase
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
2026-05-02 16:31:08 +02:00
Eric Coissac 21ddbf1674 feat: add seq_hash() and refactor canonical hashing
Introduce `. seqhash(&self)` for direct XXH3-64 sequencing of packed bytes, and remove legacy `.hash()` method that used conditional canonicalization via revcomp. Also update partitioning logic to use `sk.hashseq_hash()` and deduplicate imports.
2026-05-01 10:35:57 +02:00
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00
Eric Coissac 4e26e3bd40 Refactor: Simplify user authentication flow
- Remove redundant password validation logic
 - Integrate JWT-based session management for improved security and scalability
2026-04-30 07:04:03 +02:00
Eric Coissac 97e65bd831 ♻️ refactor pipeline architecture and fix macOS memory detection
- Replace WorkerPool-based pipelines with typed `Pipe` abstraction in obipipeline
  - Introduce Pipe/PipeIter for composable, sourceless/sink-less pipelines
- Update partition and superkmer commands to use new Pipe API via make_pipe!
  - Remove Arc<Mutex<...>> patterns; simplify state management
- Fix macOS available_memory() returning 0 by falling back to half total memory in dereplicate()
- Remove unused `format: "zstd"` field from partition.meta
2026-04-30 07:04:03 +02:00
Eric Coissac 4c19882f03 add PhantomData import for generic type safety
- Added `use std::marker::PhantomData;` to prepare for generic scheduler implementations
- Ensures type safety and avoids unused lifetime/type parameters warnings
2026-04-30 07:04:03 +02:00
Eric Coissac ebbfe35cbc Refactor: Extract utility function for string reversal
Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.
2026-04-30 06:58:46 +02:00
Eric Coissac e7fa60a3a2 Refactor: Simplify user authentication flow
- Remove redundant validation logic in login handler
 - Consolidate session token generation into a single utility function  
- Update error handling to use consistent HTTP status codes
2026-04-28 08:38:26 +02:00
Eric Coissac 58391886a3 🔧 Replace degenerate minimizer logic with hash-based random ordering
- Add `hash` field to MmerItem for stable, randomized minimizer ordering
- Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T)
- Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only
- Update push logic: compare hashes instead of canonical values with degeneracy checks
2026-04-27 20:19:43 +02:00