obikmer

Author	SHA1	Message	Date
Eric Coissac	13599dd444	feat: Implement query subcommand for sequence-to-genome mapping This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.	2026-05-21 18:56:41 +02:00
Eric Coissac	d9aa211b8f	feat: add k-mer index rebuild and compaction feature This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.	2026-05-21 15:08:19 +02:00
Eric Coissac	3fa1dbf8cc	feat: add pairwise distance computation and phylogenetic trees This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.	2026-05-21 15:03:08 +02:00
Eric Coissac	11182005a2	feat: enhance merge label resolution, debug dump, and layer metadata This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.	2026-05-21 14:52:59 +02:00
Eric Coissac	1a1f95e59d	feat: add CLI command to export indexed k-mers to CSV This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.	2026-05-21 13:48:07 +02:00
Eric Coissac	e1d59fde54	feat: add merge command to consolidate k-mer indexes Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.	2026-05-21 13:44:50 +02:00
Eric Coissac	bfa436ad15	feat: add merge operation specs and partition progress bar Added implementation specifications for the `merge` operation, detailing parallel partition processing, I/O paths, and kmer matrix aggregation across multiple indexes. Integrated an `indicatif` progress bar into the `rayon` parallel loop to monitor processing position, throughput, ETA, and recent partition duration.	2026-05-21 13:36:50 +02:00
Eric Coissac	7d1b62ddf3	refactor: replace single spectrum file with per-partition outputs Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.	2026-05-21 13:35:06 +02:00
Eric Coissac	c5bcb7b8fa	feat: introduce layered MPHF indexing and partition metadata Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.	2026-05-21 13:31:37 +02:00
Eric Coissac	f8cfb493b8	refactor: extract pipeline stages and centralize partition directory paths Extracts the scatter, dereplicate/count, and index pipeline stages into a new `steps` module to improve modularity. Centralizes partition directory path construction by introducing a `part_dir()` helper, replacing manual string formatting across multiple command files. Adds `--with-counts` and `--keep-intermediate` CLI flags to the index command and fixes a typo in the `partition_dir` parameter name.	2026-05-20 18:42:09 +02:00
Eric Coissac	9a1c0c0ee0	Add CLI progress bars and throughput metrics to partitioning Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.	2026-05-20 15:46:52 +02:00
Eric Coissac	b80ab77d66	perf: Switch to sequential PHF construction to avoid thread contention The outer partition loop already saturates parallelism, making parallel PHF construction redundant and causing Rayon thread pool contention. This change switches to a sequential variant to improve performance. Additionally, explicit error handling is now added for construction failures, while preserving the existing mmap-backed kmer slice.	2026-05-20 12:48:42 +02:00
Eric Coissac	4736a7b6de	refactor: restructure k-mer partitioning pipeline for memory efficiency Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.	2026-05-17 16:08:47 +08:00
Eric Coissac	5169f65dc9	feat: implement persistent layered index and chunked binary format Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.	2026-05-09 17:38:29 +08:00
Eric Coissac	8c17bf958b	refactor: centralize k-mer config and introduce packed sequences Centralize k-mer and minimizer configuration using a thread-safe global module, and replace manual bit-packing with a memory-efficient `PackedSeq` type. Refactor core sequence and k-mer types to use compile-time length enforcement and centralized hashing. Introduce a new De Bruijn graph implementation with compact node encoding and traversal iterators. Update I/O, partitioning, and builder modules to align with the new architecture, and add the `xxhash-rust` dependency.	2026-05-08 06:34:24 +08:00
Eric Coissac	defeeb9460	feat: enforce canonical k-mer representation throughout the codebase Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.	2026-05-02 16:31:08 +02:00
Eric Coissac	21ddbf1674	feat: add `seq_hash()` and refactor canonical hashing Introduce `. seqhash(&self)` for direct XXH3-64 sequencing of packed bytes, and remove legacy `.hash()` method that used conditional canonicalization via revcomp. Also update partitioning logic to use `sk.hashseq_hash()` and deduplicate imports.	2026-05-01 10:35:57 +02:00
Eric Coissac	27f5e88a7b	refactor: implement RoutableSuperKmer and update k-mer indexing pipeline Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.	2026-05-01 09:33:26 +02:00
Eric Coissac	97e65bd831	♻️ refactor pipeline architecture and fix macOS memory detection - Replace WorkerPool-based pipelines with typed `Pipe` abstraction in obipipeline - Introduce Pipe/PipeIter for composable, sourceless/sink-less pipelines - Update partition and superkmer commands to use new Pipe API via make_pipe! - Remove Arc<Mutex<...>> patterns; simplify state management - Fix macOS available_memory() returning 0 by falling back to half total memory in dereplicate() - Remove unused `format: "zstd"` field from partition.meta	2026-04-30 07:04:03 +02:00
Eric Coissac	4c19882f03	✨ add PhantomData import for generic type safety - Added `use std::marker::PhantomData;` to prepare for generic scheduler implementations - Ensures type safety and avoids unused lifetime/type parameters warnings	2026-04-30 07:04:03 +02:00
Eric Coissac	ebbfe35cbc	Refactor: Extract utility function for string reversal Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.	2026-04-30 06:58:46 +02:00
Eric Coissac	e7fa60a3a2	Refactor: Simplify user authentication flow - Remove redundant validation logic in login handler - Consolidate session token generation into a single utility function - Update error handling to use consistent HTTP status codes	2026-04-28 08:38:26 +02:00
Eric Coissac	7efec54b27	.gitignore: ignore zstandard-compressed files - Add *.zst pattern to .gitignore - Prevents tracking of zstandard-compressed archives	2026-04-27 16:56:15 +02:00
Eric Coissac	1f466bf113	Refactor: simplify user authentication flow - Replaced manual token validation with built-in middleware - Removed redundant session checks in controllers	2026-04-27 16:55:04 +02:00
Eric Coissac	c09d17401d	+ obiskio: add binary I/O with LRU pool and compression - Add new obiskio crate for high-performance SuperKmer serialization/deserialization - Implement binary codec with 2-bit packed sequence encoding and raw header format (32 bits) - Add transparent compression support via niffler: Zstd, Gzip/Bgzf/Lz4 - Implement SKFilePool with LRU-based fd management, max-concurrent-fd limiting (75% of ulimit) - Add SKFileWriter with batched writes, configurable flush threshold (8 KiB default), and two-phase locking - Add SKFileReader with sequential access, LRU recovery via reopen_and_seek() + New obikpartitionner crate: basic header/seq handling for binary super-kmer format - Bump niffler from 2.7 to v3, add dependencies: allocator-api2, bitflags(>=1), errno/fastrand/rustix/tempfile/lru/hashbrown/bzip2/thiserror - Update workspace members to include obikpartitionner andobiskio	2026-04-25 14:15:01 +02:00

25 Commits