obikmer

Author	SHA1	Message	Date
coissac	94e0a370b3	Merge pull request 'Push tmpsxsztwpxl' (#21 ) from push-tmpsxsztwpxl into main Reviewed-on: #21	2026-06-09 13:31:25 +00:00
Eric Coissac	970460be42	refactor: rename rebuild subcommand to filter Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.	2026-06-09 15:26:37 +02:00
Eric Coissac	e66adef23d	feat: add select command for genome column projection and aggregation Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.	2026-06-09 15:09:47 +02:00
coissac	b0dab452f6	Merge pull request 'refactor: optimize dump partition iteration and add progress tracking' (#20 ) from push-xqswlxlvmyrq into main Reviewed-on: #20	2026-06-09 09:34:13 +00:00
Eric Coissac	db730e9cf6	refactor: optimize dump partition iteration and add progress tracking Refactor partition iteration to support a generic `on_partition` callback executed after each parallel partition completes. Split the logic into bounded and unbounded paths; the bounded path uses an `AtomicUsize` to enforce row limits, while the unbounded path eliminates atomic contention to improve throughput. Additionally, integrate a progress bar into the dump command by passing an increment callback to `idx.dump()`, ensuring proper initialization and cleanup.	2026-06-09 11:07:48 +02:00
coissac	f65ecd19cc	Merge pull request 'Push lrwmyplxxzkn' (#19 ) from push-lrwmyplxxzkn into main Reviewed-on: #19	2026-06-09 08:28:20 +00:00
Eric Coissac	7dd8db1409	docs: document conservative rounding strategy for filtering thresholds Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.	2026-06-09 10:26:21 +02:00
Eric Coissac	ce45e2fbe1	refactor: centralize k-mer filtering logic and add validation Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.	2026-06-09 10:22:25 +02:00
Eric Coissac	2465cfbc4b	Parallelize partition iteration using Rayon Introduce thread-local `Vec<u8>` buffers to eliminate concurrent I/O contention. Replace the mutable row counter with an `AtomicUsize` and `fetch_update` to enable lock-free early termination when the limit is reached. Collected chunks are then written sequentially to preserve partition ordering.	2026-06-09 10:04:25 +02:00
Eric Coissac	d626d42ec7	feat: add --head and --presence-threshold to dump and distance Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.	2026-06-09 10:04:25 +02:00
coissac	650eea43b6	Merge pull request 'Push quqlpklvxsqx' (#18 ) from push-quqlpklvxsqx into main Reviewed-on: #18	2026-06-08 18:15:01 +00:00
Eric Coissac	eb7805c747	feat: add configurable presence threshold to kmer distance Introduce a `--presence-threshold` CLI argument (default: 1) and update `KmerIndex::distance` to accept a `presence_threshold` parameter. This replaces hardcoded zero thresholds, enabling configurable filtering of low-abundance kmers during Jaccard distance calculations.	2026-06-08 20:14:33 +02:00
Eric Coissac	1ec65922df	feat: implement parallel pairwise distance matrices Introduces parallelized pairwise distance matrix computation for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics across `Columnar`, `Packed`, and `Implicit` matrix variants. Adds trait methods and convenience wrappers, safely handles normalization and zero-denominator edge cases, and updates test suites to import required traits for validation.	2026-06-08 20:08:09 +02:00
Eric Coissac	09d9e21744	feat: integrate tracing and enhance bit matrix operations Add the `tracing` crate to `obidebruinj`, `obisys`, and resolve it in `Cargo.lock`. Replace `eprintln!` statements with structured `debug!` and `info!` macros. Introduce a `TracedBar` wrapper for progress bars and enhance the `Stage` lifecycle to emit structured events for timing, memory metrics, and swap warnings. Add a progress spinner for unitig degree computation. Extend `PersistentBitMatrix` with columnar bit-vector operations and parallel distance methods, enabling uniform distance computations across all storage layouts while replacing previous panics with dimension-based fallbacks.	2026-06-08 19:55:06 +02:00
coissac	3f47e22083	Merge pull request 'Push pvqkqxlkkwry' (#17 ) from push-pvqkqxlkkwry into main Reviewed-on: #17	2026-06-06 04:44:10 +00:00
Eric Coissac	03c7bb0b99	Relax unitig assertion in debruijn test Replace the strict `unitigs.len() == 1` assertion with a non-empty check to allow multiple unitigs. Update the test comment to describe the general non-repetitive sequence recovery principle instead of a specific example. The core k-mer roundtrip validation logic remains unchanged.	2026-06-06 06:41:45 +02:00
Eric Coissac	b39eee688a	refactor(debruijn): unify graph traversal with WalkState iterator Replaces deeply nested branching with early returns and `then_some`. Introduces a cycle-detecting `find_chain_start` method and updates `UnitigNucIter` to use step-based iteration with atomic node claiming. This eliminates nested iterators and redundant state management, improving code readability and maintainability.	2026-06-06 06:38:28 +02:00
Eric Coissac	95b3461405	refactor: centralize graph traversal logic in walk Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.	2026-06-06 06:36:48 +02:00
Eric Coissac	952a21eef8	refactor: remove naked_asm and extract collect_unitigs helper Remove the `std::arch::naked_asm` import as it is no longer required. Introduce a `collect_unitigs` helper to abstract nucleotide sequence extraction from `GraphDeBruijn`, and refactor the test suite to use it, eliminating repetitive collection code and standardizing graph iteration logic.	2026-06-06 04:33:59 +02:00
Eric Coissac	5c2f48535f	refactor: rename compute_degrees and mark start nodes Renames `compute_degrees` to `compute_degrees_and_mark_starts` across the De Bruijn graph and partitioner layers to consolidate degree calculation and start-node flagging. Introduces safe neighbor iteration methods and a debug validation block to verify graph consistency. Refactors unitig extraction to use sequential execution with a `Mutex` for safe error propagation. Fixes malformed and duplicated method calls, adds auto-generation of missing `meta.json` files, and ensures persistent matrix builders are explicitly closed to finalize metadata.	2026-06-05 19:48:59 +02:00
Eric Coissac	27088ab810	refactor: optimize unitig iteration and graph traversal Switches unitig processing to a lazy, fallible `try_for_each_unitig` API across partitioner layers, reducing intermediate allocations and enabling proper error propagation. Refactors de Bruijn graph traversal into a two-pass algorithm with explicit node flags, named constants, and diagnostic logging. Introduces parallel chain processing and staged performance profiling for the unitig command, and adds a memory-efficient `FromIterator` implementation for packed nucleotide sequences.	2026-06-05 19:48:59 +02:00
coissac	ea2c594c86	Merge pull request 'Push ruqusmkoyvwm' (#16 ) from push-ruqusmkoyvwm into main Reviewed-on: #16	2026-06-05 08:41:08 +00:00
Eric Coissac	d202ead385	feat: parallelize unitig extraction and FASTA output Replace the non-atomic `set_visited` with atomic `fetch_or` bitmask operations to enable thread-safe node claiming. Introduce a two-phase extraction pipeline where `par_for_each_chain_unitig` builds chains in parallel and `for_each_remaining_unitig` sequentially handles residual cycles and junctions. Add `is_start` and `collect_from_start` to explicitly define unitig boundaries. Wrap `BufWriter` in a `Mutex` and use an `AtomicUsize` counter to ensure thread-safe concurrent FASTA output, refactoring the write logic into a shared closure for safe multi-threaded execution.	2026-06-05 10:33:52 +02:00
Eric Coissac	249998beed	perf: add structured performance profiling for unitig stages Wraps graph construction, degree computation, and unitig enumeration phases with `Stage` start/stop calls. Intervals are recorded in a `Reporter` instance and printed upon completion to provide granular timing metrics for each computational stage.	2026-06-05 10:28:45 +02:00
Eric Coissac	2f29ee2240	feat: Add parallel execution and thread-safe graph operations Integrate rayon to enable parallel processing of k-mer partitions and degree computation. Replace Cell with AtomicU8 to ensure thread-safe node state management, and add a merge method for combining disjoint graphs. Additionally, introduce progress tracking utilities and a test-utils feature flag for development dependencies.	2026-06-04 23:22:55 +02:00
Eric Coissac	edd5e3f8ee	feat: add bits-per-kmer diagnostic and stats module Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.	2026-06-04 23:17:17 +02:00
Eric Coissac	bb7adc1154	docs: expand kmer indexing, filtering, and merging documentation Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.	2026-06-04 22:59:41 +02:00
Eric Coissac	9306ec1c56	perf: Replace manual window tracking with monotonic deque algorithm Eliminates intermediate allocations by computing per-genome window minimums (`win_min`) directly. Unifies the `z ≤ 1` and `z > 1` branches into a single buffer-reused accumulation loop, efficiently validating k-mer presence.	2026-06-04 21:37:09 +02:00
Eric Coissac	712a03a3a6	refactor: replace unitig extraction with de Bruijn graph pipeline This change replaces direct partition-based extraction with a pipeline that reconstructs a de Bruijn graph from filtered k-mers. It introduces `FilterArgs` for k-mer selection, collects filtered k-mers in parallel into a `GraphDeBruijn`, computes node degrees, and enumerates unitigs from the graph for output instead of reading pre-computed partition files.	2026-06-04 21:32:49 +02:00
Eric Coissac	3e62ffe010	feat: add selective k-mer filtering to dump and rebuild commands Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.	2026-06-04 21:29:58 +02:00
Eric Coissac	a1499e6153	feat: add kmer filtering and refactor layer iteration Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.	2026-06-04 21:08:15 +02:00
Eric Coissac	476c7a6394	feat: add metadata-driven k-mer filtering for rebuild command Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.	2026-06-04 21:01:58 +02:00
coissac	edc18b4908	Merge pull request 'Push rrwpnquuzsvr' (#15 ) from push-rrwpnquuzsvr into main Reviewed-on: #15	2026-06-03 17:04:25 +00:00
Eric Coissac	02cb30c0ef	feat: add obisys crate for standardized CLI progress reporting This commit introduces the `obisys` crate, which wraps `indicatif` to provide reusable `spinner` and `progress_bar` utilities with consistent styling and tick intervals. It refactors progress reporting across `obikindex`, `obikpartitionner`, and `obikmer` to use these shared functions, eliminating inline UI configuration and ensuring uniform terminal feedback.	2026-06-03 19:03:59 +02:00
Eric Coissac	4677d6f177	refactor: improve resource cleanup and index packing Explicitly close file handles and remove temporary artifacts after serialization to prevent disk space leaks. Additionally, compact internal matrix structures immediately upon loading the KmerIndex to improve memory efficiency and prepare for downstream operations.	2026-06-03 15:35:56 +02:00
coissac	7a29ca6305	Merge pull request 'Push ywwwypqxrtmy' (#14 ) from push-ywwwypqxrtmy into main Reviewed-on: #14	2026-06-03 13:18:41 +00:00
Eric Coissac	bba5147f0f	fix: account for k-mer overlap in total_bases calculation Introduces a `kmer_overlap` variable (`k-1`) and modifies the `total_bases` accumulation to subtract this overlap from each sequence's length. This ensures the base count accurately reflects only valid k-mer starting positions rather than raw sequence length.	2026-06-03 15:11:48 +02:00
Eric Coissac	bfe0cb4b82	feat: integrate obikseq to configure global k-mer and minimizer sizes This change adds the `obikseq` crate as a local dependency and inserts `set_k` and `set_m` calls across index creation and command modules. By synchronizing the runtime's global k-mer and minimizer dimensions with the loaded index parameters, downstream sequence processing and partitioning operations now consistently use the correct structural constraints.	2026-06-03 14:31:14 +02:00
Eric Coissac	173ac9fb42	feat: introduce packed matrix storage and layer metadata Unifies bit and integer matrix storage into `PersistentBitMatrix` and `PersistentCompactIntMatrix` enums, supporting both columnar and memory-mapped single-file layouts. Introduces `LayerMeta` to persist layer dimensions as `layer_meta.json`, enabling correct initialization of implicit presence matrices. Adds CLI commands (`pack` and `--upgrade-index`) to convert existing columnar indices to the compact format and backfill missing metadata. Updates partitionner and layered map logic to use the new persistent builders, optimized memory allocation, and auto-detected storage backends.	2026-06-03 14:16:04 +02:00
Eric Coissac	de1a41810a	perf: enable zero-allocation queries and memory-mapped indexes Introduce zero-allocation row extraction and query result buffers across `obicompactvec` and `obikpartitionner` to eliminate per-kmer heap allocations. Replace in-memory MPHF deserialization with memory-mapped, zero-copy views to reduce runtime memory footprint. Add configurable I/O chunking, a RAM-aware `--chunk-size` parameter, and system memory monitoring via the new `sysinfo` dependency. Re-export `PreloadedIndex` for external consumers.	2026-06-03 10:24:12 +02:00
Eric Coissac	1661dd6b1c	feat: introduce preloaded index cache and thread-safe progress tracker Introduce `PreloadedIndex` to cache partition indices and eliminate redundant I/O during repeated queries. Refactor the query pipeline to route through this pre-loaded index, and expose it publicly in `obikpartitionner`. Additionally, add a thread-safe, lazily-initialized `MultiProgress` singleton for improved progress tracking.	2026-06-03 09:42:18 +02:00
Eric Coissac	2ebc5f0d75	chore: add logging infrastructure to merge routine Adds comprehensive logging for source metadata, merge modes, and forced approximation detection. Introduces `format_evidence` and `is_trivial` helpers to format `IndexMode` variants and identify single-genome presence indices. The core merge algorithm remains unmodified, with all changes focused on enhanced runtime observability.	2026-06-01 15:23:37 +02:00
coissac	4fd0eb989f	Merge pull request 'Push pnxswqpxlyso' (#13 ) from push-pnxswqpxlyso into main Reviewed-on: #13	2026-06-01 12:45:46 +00:00
Eric Coissac	add6d7f873	enforce uniform index mode and optimize base index selection Adds validation to ensure all input sources share the same `IndexMode`. Introduces base index selection logic that prioritizes approximate or hybrid evidence and maximizes base size to minimize newly indexed k-mers. Includes helper functions for triviality evaluation, cumulative size calculation, and mode consistency checks.	2026-06-01 14:43:51 +02:00
Eric Coissac	0350ca855b	refactor: streamline merge pipeline and MPHF indexing Replace mphf.find() with direct mphf.index() calls to eliminate absence checks and fallback vectors. Introduce a lightweight MphfOnly wrapper for faster index loading, and standardize k-mer iteration across merge and rebuild layers. Update IndexMeta configuration and n_new calculation to leverage MPHF cardinality, streamlining the overall merge pipeline.	2026-06-01 14:37:35 +02:00
Eric Coissac	1e2115a1b0	docs: Update README to reflect new indexing workflow Replace the documented direct combined indexing command with a two-step workflow. This involves building separate exact indexes per genome, merging them into a single multi-genome index via `obikmer merge`, and keeping the approximate index conversion unchanged.	2026-06-01 13:56:48 +02:00
Eric Coissac	657f964dda	refactor: abstract k-mer types and fix bit alignment Abstracts k-mer storage using a `RawKmer` alias and `KMER_BITS` constant to simplify bit manipulation and enable future extension to larger types. Updates bit-shifting and masking logic across `kmer.rs` and `packed_seq.rs` to prevent overflow and improve type safety. Adapts the MPHF layer to iterate over indexed canonical k-mers with explicit slot bounds validation and bit-level encoding. Fixes test suite compilation errors by correcting method names, adding tuple destructuring, and passing the required `IndexMode::Exact` parameter.	2026-05-31 20:46:49 +02:00
Eric Coissac	8b57c91ab7	feat: add iter_canonical_kmers iterator and tests Implements `iter_canonical_kmers` in `unitig_index` to yield canonical kmers by traversing unitig chunks. Includes unit tests to verify that the iterator exclusively yields canonical kmers and matches `iter_kmers` in count and canonical equivalence.	2026-05-30 16:30:02 +02:00
coissac	728476a0a6	Merge pull request 'Push rxrxptuxltlp' (#12 ) from push-rxrxptuxltlp into main Reviewed-on: #12	2026-05-30 14:03:13 +00:00
Eric Coissac	8a0b898b4b	docs: clarify query pipeline, Findere trick, and input formats Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.	2026-05-30 15:59:12 +02:00

1 2 3 4

178 Commits