Push qkpyqurltlpk #1

Merged
coissac merged 11 commits from push-qkpyqurltlpk into main 2026-05-21 16:57:19 +00:00
Owner
No description provided.
coissac added 11 commits 2026-05-21 16:57:09 +00:00
Refactors obikindex and obikpartitionner to delegate index construction to a new layered MPHF implementation. Adds resume-safe building with abundance filtering and count persistence, while introducing a PartitionMeta struct for JSON configuration persistence. Updates OKIError to wrap layer-specific errors, replaces single-path extraction with full path collection and logging, and registers new internal dependencies across the workspace.
Replace the single `kmer_spectrum_raw.json` output with per-partition JSON files in a `spectrums/` directory. Add a `keep_intermediate` parameter to control intermediate file cleanup, and introduce a `write_spectrum` helper for serialization. Update the completion sentinel to `count.done` and align state documentation accordingly.
Added implementation specifications for the `merge` operation, detailing parallel partition processing, I/O paths, and kmer matrix aggregation across multiple indexes. Integrated an `indicatif` progress bar into the `rayon` parallel loop to monitor processing position, throughput, ETA, and recent partition duration.
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
This change introduces a new `dump` subcommand that exports all indexed k-mers to a CSV stream. The implementation spans multiple crates, adding core export logic to `obikindex` and partition iteration to `obikpartitionner`. The command supports a `--force-presence` flag to output binary presence/absence data instead of stored counts, and includes necessary module registrations and structural updates across the codebase.
This commit enhances the CLI and index pipelines by introducing `--force-presence` to normalize output to binary values, `--debug` to expose partition and layer metadata, and `--rename-duplicates` to automatically disambiguate overlapping genome labels. It updates the partitioner and index layers to auto-discover layers, persist `meta.json` for single-genome builds, and fix per-source column offsets during merging. A `DuplicateGenomeLabel` error variant is also added, and stale directories are properly managed in presence/absence mode.
Implements a new `merge` command that aggregates k-mer counts and presence/absence matrices from multiple source indices using a parallelized, partition-based algorithm. Adds CLI progress bars and execution timing across the bootstrap, spectrum rebuild, and merge phases. Updates logging to report the aggregate genome count and introduces a bounds check in the perfect hash layer to safely return `None` for unknown k-mers, preventing out-of-bounds access in downstream operations.
This commit introduces a new `distance` CLI subcommand that computes pairwise genomic distance matrices using configurable metrics (Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger). It optionally generates phylogenetic trees (NJ or UPGMA) in Newick format and outputs results as CSV. The implementation adds a robust distance computation backend that dynamically routes to optimized backends based on index configuration, supports parallel iteration, and gracefully handles missing data. Additionally, it adds a `dump` task for exporting k-mer to genome mappings as CSV, introduces an `InvalidInput` error variant, updates dependencies to support numerical operations and tree construction, and performs minor module reorganizations.
This commit introduces a new `rebuild` CLI subcommand that reconstructs an existing multi-layer k-mer index into a compact, single-layer index. It implements a configurable filtering pipeline supporting min/max genome fraction/count and total count thresholds, parallel partition processing via `rayon`, and CLI progress tracking. The change also restructures module declarations across `obikindex` and `obikpartitionner` to integrate the new rebuild and layer-handling logic.
This commit introduces CLI argument parsing for the `superkmer` command via a new `SuperkmerArgs` struct. It also adds a `partitions_to_bits` utility to compute the minimum bit width for partition encoding, enforcing a 1-bit floor. Finally, the index configuration automatically rounds the partition count up to the nearest power of two to ensure compatibility with bitmask-based indexing operations.
This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.
coissac merged commit 77a0186fae into main 2026-05-21 16:57:19 +00:00
coissac deleted branch push-qkpyqurltlpk 2026-05-21 16:57:19 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: OBIKmers/obikmer#1