obikmer

Author	SHA1	Message	Date
Eric Coissac	ca71e100ef	feat: Implement RAII-based file handle throttling Introduces a thread-safe, RAII-based throttling mechanism (`throttle_paths`, `FileSlots`, `SlotsGuard`) to enforce a new `max_open_files` configuration limit. This replaces direct file opening in the scatter and superkmer pipelines with a concurrency semaphore that automatically releases handles upon completion, preventing resource exhaustion and deadlocks during concurrent I/O.	2026-05-22 14:03:24 +02:00
coissac	6c1a8da2d1	Merge pull request 'feat: limit concurrent open files during scatter' (#6 ) from push-rkytvkympxrn into main Reviewed-on: #6	2026-05-22 09:33:39 +00:00
Eric Coissac	85e1901898	feat: limit concurrent open files during scatter Introduces a `max_open_files` CLI argument (default: 20) to cap concurrently open input files during scatter operations. The scatter phase now parallelizes sequence file partitioning across worker threads while enforcing a configurable concurrency limit using a custom semaphore and `GuardedIter` wrapper. This ensures bounded resource usage and prevents file handle exhaustion during index construction.	2026-05-22 11:28:44 +02:00
coissac	1ba6690256	Merge pull request 'refactor(scatter): move logging call into pipeline closure' (#5 ) from push-txyqunyttswl into main Reviewed-on: #5	2026-05-22 09:15:05 +00:00
Eric Coissac	9b37e848d4	refactor(scatter): move logging call into pipeline closure Add an explicit `PathBuf` type annotation and consolidate the "indexing" info log with the chunk-opening logic. This reduces pipeline boilerplate by keeping the logging directly in the initial stage closure.	2026-05-22 11:14:52 +02:00
coissac	df28cadc41	Merge pull request 'feat: add input file logging and optimize path traversal' (#4 ) from push-zoyvrpponqqo into main Reviewed-on: #4	2026-05-22 09:04:42 +00:00
Eric Coissac	fe2127c463	feat: add input file logging and optimize path traversal Instrument index and scatter stages with `tracing::info` to log input file paths for better runtime observability. Additionally, optimize the path iterator by replacing redundant `is_dir()` checks with explicit `is_file()` validation and deferring metadata resolution, eliminating unnecessary `stat()` syscalls and improving traversal performance on high-latency network filesystems like Lustre and NFS.	2026-05-22 11:04:04 +02:00
coissac	fe0832190b	Merge pull request 'feat(obiread): add static bzip2 and lzma compression support' (#3 ) from push-qpvrrxpnqlkw into main Reviewed-on: #3	2026-05-22 08:29:42 +00:00
Eric Coissac	72d054c06b	feat(obiread): add static bzip2 and lzma compression support Explicitly add `bzip2-sys` and `liblzma-sys` with the `static` feature to the `obiread` crate. This enforces static linking for BZ2 and LZMA/XZ backends, eliminating runtime dynamic library dependencies and ensuring consistent binary distribution.	2026-05-22 10:29:22 +02:00
coissac	3d58a32613	Merge pull request 'feat: introduce genome metadata tracking and CSV export' (#2 ) from push-zrlrptrsrlkk into main Reviewed-on: #2	2026-05-22 07:36:11 +00:00
Eric Coissac	0f8f61d3dd	feat: introduce genome metadata tracking and CSV export This commit replaces raw string genome labels with a structured `GenomeInfo` type for better metadata tracking. It adds a `--meta` flag to the index command, and implements a new `annotate` CLI subcommand to import metadata from CSV files or export it via `--dump`. Distance and shared-count matrices are now serialized to CSV, with UPGMA clustering trees exported as Newick files. Query outputs now include per-genome k-mer match counts in JSON, while fixing syntax and variable naming issues in index merging and dump generation.	2026-05-22 09:35:20 +02:00
coissac	77a0186fae	Merge pull request 'Push qkpyqurltlpk' (#1 ) from push-qkpyqurltlpk into main Reviewed-on: #1	2026-05-21 16:57:19 +00:00
Eric Coissac	13599dd444	feat: Implement query subcommand for sequence-to-genome mapping This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.	2026-05-21 18:56:41 +02:00
Eric Coissac	c8e591fc78	feat: add superkmer CLI setup and partition bit handling This commit introduces CLI argument parsing for the `superkmer` command via a new `SuperkmerArgs` struct. It also adds a `partitions_to_bits` utility to compute the minimum bit width for partition encoding, enforcing a 1-bit floor. Finally, the index configuration automatically rounds the partition count up to the nearest power of two to ensure compatibility with bitmask-based indexing operations.	2026-05-21 15:10:48 +02:00
Eric Coissac	d9aa211b8f	feat: add k-mer index rebuild and compaction feature This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.	2026-05-21 15:08:19 +02:00
Eric Coissac	3fa1dbf8cc	feat: add pairwise distance computation and phylogenetic trees This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.	2026-05-21 15:03:08 +02:00
Eric Coissac	9e1d6f2f25	feat: implement partition-based merge command for k-mer indices Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.	2026-05-21 14:55:38 +02:00
Eric Coissac	11182005a2	feat: enhance merge label resolution, debug dump, and layer metadata This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.	2026-05-21 14:52:59 +02:00
Eric Coissac	1a1f95e59d	feat: add CLI command to export indexed k-mers to CSV This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.	2026-05-21 13:48:07 +02:00
Eric Coissac	e1d59fde54	feat: add merge command to consolidate k-mer indexes Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.	2026-05-21 13:44:50 +02:00
Eric Coissac	bfa436ad15	feat: add merge operation specs and partition progress bar Added implementation specifications for the `merge` operation, detailing parallel partition processing, I/O paths, and kmer matrix aggregation across multiple indexes. Integrated an `indicatif` progress bar into the `rayon` parallel loop to monitor processing position, throughput, ETA, and recent partition duration.	2026-05-21 13:36:50 +02:00
Eric Coissac	7d1b62ddf3	refactor: replace single spectrum file with per-partition outputs Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.	2026-05-21 13:35:06 +02:00
Eric Coissac	c5bcb7b8fa	feat: introduce layered MPHF indexing and partition metadata Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.	2026-05-21 13:31:37 +02:00
Eric Coissac	17c9e076bd	refactor: extract obikindex crate and remove deprecated CLI commands Extracted core indexing logic, state tracking, and metadata management into a new `obikindex` crate. Refactored the `index` and `unitig` commands to leverage the `KmerIndex` abstraction and state-driven pipeline transitions. Removed obsolete CLI subcommands (`count`, `fasta`, `longtig`, `partition`) and their associated pipeline steps. Updated FASTA writing utilities for single-line output and deterministic identifiers, and refreshed workspace dependencies.	2026-05-20 18:54:12 +02:00
Eric Coissac	f8cfb493b8	refactor: extract pipeline stages and centralize partition directory paths Extracts the scatter, dereplicate/count, and index pipeline stages into a new `steps` module to improve modularity. Centralizes partition directory path construction by introducing a `part_dir()` helper, replacing manual string formatting across multiple command files. Adds `--with-counts` and `--keep-intermediate` CLI flags to the index command and fixes a typo in the `partition_dir` parameter name.	2026-05-20 18:42:09 +02:00
Eric Coissac	cc2ed4bd31	feat: Add progress tracking and timing instrumentation to index Introduces comprehensive progress tracking and timing instrumentation using indicatif and obisys::Reporter/Stage. Adds an EMA-based throughput calculator for the scatter phase and wraps parallel progress bars in Arc<Mutex> for thread-safe concurrent updates across all pipeline stages.	2026-05-20 18:34:28 +02:00
Eric Coissac	e66c4d81ef	feat(obikmer): add index subcommand for kmer counting pipeline Introduce the `index` CLI subcommand, implementing a resumable, multi-stage pipeline to partition, dereplicate, and count kmers from input sequences. The command builds a layered de Bruijn graph index per partition, applies optional abundance filtering, and persists unitigs alongside an MPHF-based count matrix. Update `Cargo.toml` and `Cargo.lock` to include new dependencies (`epserde`, `ptr_hash`, `cacheline-ef`, `obicompactvec`, `obilayeredmap`) required for the index builder, and refresh the profiling data files.	2026-05-20 18:25:12 +02:00
Eric Coissac	c20a1ed465	perf: optimize k-mer pipeline with compile-time tables This commit shifts entropy and lookup table generation to compile time via a new build script, eliminating runtime overhead. It replaces heap-allocated queues in rolling statistics with a stack-allocated, const-generic ring buffer for cache-friendly operations, and implements `size_hint` on `SuperKmerIter` for efficient iterator consumption. Additionally, it establishes the baseline profile configuration and sets global k-mer parameters.	2026-05-20 15:54:20 +02:00
Eric Coissac	9a1c0c0ee0	Add CLI progress bars and throughput metrics to partitioning Add `indicatif` v0.17 to `obikmer` and `obikpartitionner` to instrument CLI workflows with real-time progress tracking. The changes integrate progress spinners and bars into the batch processing and parallel kmer counting loops, displaying processed base pairs, throughput rates, and elapsed time. Updates occur every 0.1s to enhance observability without modifying core partitioning logic.	2026-05-20 15:46:52 +02:00
Eric Coissac	b80ab77d66	perf: Switch to sequential PHF construction to avoid thread contention The outer partition loop already saturates parallelism, making parallel PHF construction redundant and causing Rayon thread pool contention. This change switches to a sequential variant to improve performance. Additionally, explicit error handling is now added for construction failures, while preserving the existing mmap-backed kmer slice.	2026-05-20 12:48:42 +02:00
Eric Coissac	6e2a4c977b	fix: Replace unreliable memory pressure check with swap indicator The previous `major_faults > 10` check is unreliable on macOS, as it counts file-backed mmap page-ins rather than true memory pressure. This change replaces it with `swaps > 0`, a more accurate cross-platform indicator of RAM exhaustion. The swap diagnosis message is also updated to clarify that the working set exceeds available RAM, and comments are added to document this rationale.	2026-05-19 11:41:41 +02:00
Eric Coissac	8c16b79983	feat(obikmer): add obisys profiling to partition pipeline Added obisys as a local dependency and integrated its Reporter and Stage instrumentation into the partition command. Each major phase (scatter, dereplicate, and kmer-counting) is now wrapped in timing blocks, with aggregated execution times printed to stdout upon completion.	2026-05-19 11:40:20 +02:00
Eric Coissac	d0c277d5b6	feat(obisys): Add stage-based performance profiler Establishes the `obisys` crate using Rust 2024 and the `libc` dependency. Introduces a lightweight profiler that captures wall-clock time and `getrusage` system metrics per pipeline stage. Automatically computes parallelism and efficiency ratios, detects bottlenecks such as memory pressure and disk I/O, and prints a formatted diagnostic summary to stderr.	2026-05-19 11:40:20 +02:00
Eric Coissac	4736a7b6de	refactor: restructure k-mer partitioning pipeline for memory efficiency Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.	2026-05-17 16:08:47 +08:00
Eric Coissac	f36b095ce2	docs: clarify MPHF indexing, storage layout, and distance traits Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.	2026-05-17 15:59:10 +08:00
Eric Coissac	cf693f17f2	refactor: delegate MPHF construction and I/O to MphfLayer Extracts MPHF construction, evidence encoding, and unitig I/O into a new `MphfLayer` module. This removes direct dependencies on `Evidence`, `PersistentBitMatrix`, and `PersistentCompactIntMatrix` from `Layer`. The `query` method is simplified to perform direct MPHF lookups, while build logic and serialization are consolidated within the new module.	2026-05-16 19:32:40 +08:00
Eric Coissac	13e69e23c9	feat: introduce trait-based distance aggregation and layered store Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.	2026-05-16 14:41:49 +08:00
Eric Coissac	45d49ed501	docs: document k-mer index architecture and refactor distance metrics Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.	2026-05-15 21:24:30 +08:00
Eric Coissac	8409c852ef	feat: Add parallel column counts and partial distance metrics Introduces parallel `count_ones` for `BitMatrix` and parallel column-sum aggregation alongside three pairwise distance constructors (Bray-Curtis, Euclidean, Hellinger) for `IntMatrix`. These methods support partial, layer-wise data by accepting precomputed global column sums for normalization, enabling additive decomposition across partitions. Includes unit tests verifying mathematical equivalence and partition additivity.	2026-05-15 20:44:58 +08:00
Eric Coissac	8bee9f3017	feat: add parallel distance matrix computation for bit and int matrices Introduce parallel distance matrix generation using `ndarray` and `rayon` for both `BitMatrix` and `IntMatrix`. Adds full and additive-partial variants for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics. Includes comprehensive unit tests verifying matrix symmetry, zero diagonals, and numerical correctness against pairwise calculations.	2026-05-15 17:23:12 +08:00
Eric Coissac	1881e98bad	feat(bitvec): add partial Jaccard, fix padding, optimize constructor Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.	2026-05-15 15:04:10 +08:00
Eric Coissac	b218bf012b	feat: introduce column-major matrix storage and migrate layered map Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.	2026-05-14 21:19:18 +08:00
Eric Coissac	f48f7500cd	refactor(obilayeredmap): support generic payload types Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.	2026-05-14 09:33:18 +08:00
Eric Coissac	0b3fcf3cf0	feat: add PersistentBitVec and upgrade PersistentCompactIntVec format Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.	2026-05-14 09:01:36 +08:00
Eric Coissac	c18c5d2600	feat: add sum and sumadd methods to PersistentCompactIntVec Adds a `sum()` method to compute the aggregate of all elements using u64 arithmetic to safely handle potential overflows. Introduces a `sumadd()` method for element-wise addition into the builder, enforcing strict length equality and using `checked_add` for safe accumulation. Includes unit tests to verify correct aggregation and overflow safety.	2026-05-13 21:49:22 +08:00
Eric Coissac	0733287de5	feat(obicompactvec): migrate to memory-mapped file storage Refactor `PersistentCompactIntVecBuilder` to replace in-memory `Vec<u8>` with `MmapMut` for primary storage. Decouples initialization from finalization by updating `finalize_pciv` to truncate, append overflow data, and overwrite the header placeholder. Adds file path tracking via a new `path()` accessor, implements `build_from` for efficient copying, and introduces arithmetic/set operations (`min`, `max`, `sum`, `diff`). Expands test coverage for persistence roundtrips, mutations, iteration, and the new vector operations.	2026-05-13 10:56:36 +08:00
Eric Coissac	dfce956162	feat(obicompactvec): implement Iterator for PersistentCompactIntVec Add an `Iter` struct that implements `Iterator` and `ExactSizeIterator` to enable idiomatic traversal of `&PersistentCompactIntVec`. The iterator maintains `slot` and `overflow_pos` state to correctly yield `u32` values from both primary and overflow memory regions. Includes three unit tests validating iteration correctness against direct indexing, accurate `len()` tracking, and proper reference-based iteration.	2026-05-13 10:28:17 +08:00
Eric Coissac	4d5fcd4340	refactor: split obicompactvec storage into primary and overflow files Refactors the storage format to separate primary and overflow data into distinct files. Introduces a cache-friendly sparse index with dynamically computed step and entry counts. Consolidates dual memory-mapped regions into a single file with explicit header parsing and validation, replacing unsafe slice casting with direct byte-offset indexing. Updates the test suite to accommodate the new file structure.	2026-05-13 10:25:14 +08:00
Eric Coissac	f2de79acde	Add persistent compact integer vector and cache-line-optimized MPHF Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.	2026-05-13 10:09:46 +08:00
Eric Coissac	84ed752b78	perf: optimize packed_seq sub() with direct bit-slice copying Replaces per-nucleotide iteration with direct bit-slice copying via `bitvec`. This eliminates per-element decoding overhead and intermediate allocations by computing the target byte length, copying the packed bit range `[start2, end2)` directly into a pre-allocated buffer, and constructing the result in a single pass.	2026-05-13 06:26:29 +08:00

1 2 3

146 Commits