99 Commits

Author SHA1 Message Date
coissac e22afe9621 Merge pull request 'chore: bump obikmer to 1.1.9 and update release workflow' (#38) from push-noxuppsknsol into main
CI / build (push) Successful in 3m11s
Release / build-linux-static (push) Failing after 3m4s
Reviewed-on: #38
2026-06-22 13:32:50 +00:00
Eric Coissac bdfac71e65 chore: bump obikmer to 1.1.9 and update release workflow
CI / build (push) Successful in 3m24s
CI / build (pull_request) Failing after 0s
Bumps the obikmer crate version from 1.1.7 to 1.1.9 in Cargo.toml and Cargo.lock. Updates the Gitea release workflow to dynamically locate the Zig compiler via Python, generating musl-targeted gcc/g++ wrapper scripts installed to /usr/local/bin for static Linux cross-compilation during releases.
2026-06-22 15:32:10 +02:00
coissac a00bb37478 Merge pull request 'ci: switch to Zig build toolchain and bump obikmer to 1.1.7' (#37) from push-nvvqmzmrotxx into main
CI / build (push) Successful in 3m14s
Release / build-linux-static (push) Failing after 2m42s
Reviewed-on: #37
2026-06-22 13:20:12 +00:00
Eric Coissac d30a4efd9b ci: switch to Zig build toolchain and bump obikmer to 1.1.7
CI / build (push) Successful in 3m12s
CI / build (pull_request) Successful in 3m16s
Replaces the musl-based static Linux build toolchain with Zig (`ziglang` via pip and `cargo-zigbuild`), removing `musl-tools` dependencies. The workflow now invokes `cargo zigbuild` for cross-compiling the static binary. Additionally, bumps the `obikmer` crate version to 1.1.7.
2026-06-22 15:19:39 +02:00
coissac 6baf2e64ca Merge pull request 'chore: bump obikmer to 1.1.6 and automate git tagging' (#36) from push-yxmtknzsynpx into main
CI / build (push) Successful in 3m20s
Release / build-linux-static (push) Failing after 4m30s
Reviewed-on: #36
2026-06-22 13:01:06 +00:00
Eric Coissac c0a71a2d49 chore: bump obikmer to 1.1.6 and automate git tagging
CI / build (push) Successful in 4m46s
CI / build (pull_request) Successful in 4m27s
Bumps the obikmer crate version from 0.1.4 to 1.1.6 in Cargo.toml and Cargo.lock. Updates the Makefile release target to automatically extract the version, create a Git tag, and push it to the remote repository, extending the existing workflow with standard Git publishing steps.
2026-06-22 15:00:36 +02:00
coissac a609c1af95 Merge pull request 'ci: streamline release workflow and bump obikmer to 0.1.4' (#35) from push-zokprynyqunu into main
CI / build (push) Successful in 4m41s
Release / build-linux-static (push) Failing after 6m4s
Reviewed-on: #35
2026-06-22 09:38:36 +00:00
Eric Coissac 3d32be8a83 ci: streamline release workflow and bump obikmer to 0.1.4
CI / build (push) Successful in 4m33s
CI / build (pull_request) Successful in 4m17s
Replaces the intermediate artifact upload step in the Gitea release workflow with a direct REST API call, eliminating unnecessary dependencies and adding `jq`. Also increments the obikmer crate version to 0.1.4.
2026-06-22 11:38:19 +02:00
coissac c4c71dc892 Merge pull request 'Push mtzqmmrlmzzx' (#34) from push-mtzqmmrlmzzx into main
CI / build (push) Successful in 4m34s
Reviewed-on: #34
2026-06-22 08:47:24 +00:00
Eric Coissac 4e625afaba refactor: update CI toolchain setup and optimize parallel indexing
CI / build (push) Successful in 4m56s
CI / build (pull_request) Successful in 4m11s
Update CI workflows to explicitly install the Rust toolchain via rustup and configure musl targets for deterministic static builds in Docker containers. Bump obikmer dependency to 0.1.3. Refactor obicompactvec to reduce peak memory usage by computing column sizes from filesystem metadata, add atomic writes, and implement cleanup guards. Replace parallel iteration patterns in obikindex with a structured PartitionRunner pipeline for simplified error handling and progress tracking.
2026-06-22 10:46:24 +02:00
Eric Coissac a522c0907e feat: add CI/CD workflows, release automation, and CLI version flag
CI / build (push) Failing after 2m41s
CI / build (pull_request) Failing after 6s
Adds Gitea Actions for continuous integration and tagged releases, including static musl binary compilation and artifact upload. Introduces a Makefile target to automate semantic version bumping and publishing. Bumps the package version to 0.1.1 and enables automatic `--version` output via Clap.
2026-06-22 10:36:40 +02:00
Eric Coissac c1d6f277ce feat(select): add metrics reporting to selection methods
Integrates an obisys::Reporter across indexing and command modules to capture execution metrics. Replaces discarded timer stops with explicit rep.push() calls, adds timing instrumentation for the pack stage, and prints collected reports after each selection branch.
2026-06-22 10:25:24 +02:00
Eric Coissac 9356be4ec0 feat: introduce obitaxonomy crate for hierarchical taxonomy parsing
Adds the `obitaxonomy` crate to parse and validate hierarchical taxonomy paths using a strict `taxonomy:/name@rank/...` syntax. Replaces generic string-based path matching in predicates with structured `TaxPath` and `TaxPattern` types, enforcing explicit anchor constraints and rank-aware semantics. Updates filtering documentation to clarify optional leading slashes and segment-boundary matching rules.
2026-06-22 10:24:04 +02:00
Eric Coissac c694e1f2b0 feat: add benchmark pipeline, expose APIs, and enforce strict paths
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
2026-06-22 10:18:33 +02:00
Eric Coissac 280ca1f5a3 feat: add optimized new_ones constructor for all-ones bit vectors
Introduces `new_ones` and `add_col_ones` methods to directly initialize all-ones bit vectors and matrix columns. This replaces redundant initialization sequences that created zero-filled structures and applied bitwise NOT, with a single pass that writes contiguous 0xFF bytes to disk. The change eliminates inversion overhead, streamlines test setup, and improves performance for filter mask intersection logic while preserving identical semantics.
2026-06-22 10:00:01 +02:00
Eric Coissac 9abb2db92f refactor: replace explicit bit-setting loops with optimized bulk operations
Refactor bitmatrix, colgroup, and layer modules to replace manual iteration with concise `or_where` predicates and bulk inversion calls. This simplifies the codebase and leverages optimized internal implementations for improved performance.
2026-06-22 09:56:41 +02:00
Eric Coissac 7c1efa9cbb feat: add vectorized column filters and optimize partitioner iteration
Adds `FilterMask` and conditional bitwise methods (`*_where`) to `obicompactvec` for composable column-based slot filtering. Extends `obikpartitionner` with a `MatrixGroupOps` trait and `column_mask_expr` method to express aggregate constraints as vectorized masks. Refactors matrix builder management into a unified `Builders` enum and introduces `try_compute_combined_mask`, enabling O(1) slot checks and skipping unnecessary row reads during partitioning and rebuilding passes.
2026-06-22 09:54:51 +02:00
Eric Coissac 4c4524766c feat(matrix): add partial group reductions and column persistence
Expands MatrixGroupOps with partial_group_min/max helpers for bitwise reductions and introduces add_col_from methods to persist external vectors as matrix columns. Refactors column aggregation in the partitioner to leverage these group operations directly, replacing iterative row processing with simplified builder lifecycle management and explicit metadata serialization.
2026-06-22 09:49:04 +02:00
Eric Coissac 7eea71fdcd docs(obicompactvec): update API docs and algorithm descriptions
Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
2026-06-22 09:46:19 +02:00
Eric Coissac f91c5a3f79 refactor(obicompactvec): unify bit and int vector slice views
Refactors column and matrix access to use unified `BitSliceView` and `IntSliceView` abstractions, replacing legacy `PackedCol`/`IntColView` types. Introduces `BitSlice`/`IntSlice` traits for zero-copy, trait-based bitwise and arithmetic operations across persistent and temporary vector types. Removes deprecated in-memory `MemoryBitVec` and `MemoryIntVec` implementations and their tests, while updating dependent crates to use the new view-based API and `BitSliceMut` trait.
2026-06-17 23:51:32 +02:00
Eric Coissac fb4962c4fe refactor: replace in-memory vectors with temp-file-backed storage
Introduces `TempCompactIntVec` and `TempBitVec` as temporary, file-backed intermediates to replace eager in-memory vectors, enabling OS-level paging under memory pressure. Updates the `MatrixGroupOps` trait to return `io::Result` types, allowing proper error propagation and supporting chunked accumulation for large column groups. Includes builder patterns with `.freeze()` finalization, automatic `TempDir` cleanup on drop, and necessary test updates to handle the new fallible signatures. Also fixes `Cargo.toml` section ordering.
2026-06-17 23:36:15 +02:00
Eric Coissac 1d38d87ff9 Add column group operations and mask_with trait
Introduce the `ColGroup` struct and `MatrixGroupOps` trait to manage named subsets of column indices and perform additive aggregations (count, sum, any). Implement these operations for `PersistentBitMatrix` and `PersistentCompactIntMatrix`, applying size-optimized branches for presence counts and direct accumulation for small groups. Additionally, add a `mask_with` trait method that efficiently zero-sets elements based on a mask, optimized for sparse masks with O(n_zeros) complexity. Include comprehensive tests covering overflow handling, slot masking, and result additivity across partitioned data.
2026-06-17 23:28:52 +02:00
Eric Coissac 93559c3294 feat: introduce unified column view types for bit and int matrices
This commit introduces `BitColView` and `IntColView` to abstract over Columnar and Packed storage formats, implementing `BitSlice` and `IntSlice` for uniform column access. It adds `col_view()` accessors to `PersistentBitMatrix` and `PackedCompactIntMatrix`, explicitly panicking on implicit variants. The new types are publicly re-exported, and unit tests are added to validate per-element retrieval, aggregation methods, and parity with the original columnar representation.
2026-06-17 23:24:11 +02:00
Eric Coissac 1f0d77d5bf docs: document compact vector implementation with Mermaid diagrams
Add Mermaid diagrams to visualize the trait hierarchy, compact int storage layout, and SIMD-vectorizable arithmetic operations for MemoryIntVec and PersistentCompactIntVec. Also document concrete type structures and planned layer/partition composition rules to improve documentation clarity.
2026-06-17 23:20:55 +02:00
Eric Coissac eeba43ac4f docs: add technical reference for obicompactvec module
Document the two-tier compact integer encoding, BitSlice/IntSlice trait hierarchy, and SIMD-friendly O(n+k) algorithms. Include details on concrete memory and persistent vector types, matrix aggregation traits, and planned group-filtering APIs.
2026-06-17 23:19:31 +02:00
Eric Coissac 7ed7b26039 perf: optimize vec arithmetic and add overflow tests
Refactor `cmp_scalar`, `min`, `max`, `add`, and `diff` to operate directly on the primary byte array, deferring overflow slot resolution to a secondary pass. This eliminates HashMap lookups in the hot path and enables SIMD vectorization. Add six unit tests to validate correct promotion and demotion between storage slots when values cross the 255 threshold.
2026-06-17 23:18:18 +02:00
Eric Coissac 26de90f18d feat: add iteration and aggregation to compact int vec
Implemented `sum()`, `count_nonzero()`, and `iter()` to complete the numeric vector interface. The builder now computes aggregate values across memory-mapped regions and overflow entries, while the reader delegates these operations to its inherent methods. The iterator provides zero-copy access to underlying `u32` elements.
2026-06-17 23:15:56 +02:00
Eric Coissac 497d250d8a refactor: replace byte-level bit iteration with 64-bit words
Refactor `BitIter` to process `u64` chunks using word-aligned shifts instead of byte-level operations. Introduce a dedicated `MemoryBitIter` for `MemoryBitVec`, updating its `iter()` and `IntoIterator` implementations accordingly. Hide `MemoryBitIter` from the public API to narrow the crate's interface, while leveraging explicit alignment guarantees for safer and more efficient bit extraction.
2026-06-17 23:11:12 +02:00
Eric Coissac aa98e82875 refactor: introduce PackedIntCol view and use iterators
Centralizes overflow handling and improves modularity by replacing manual mmap indexing and row loops with composable iterator patterns. This change leverages Rust's iterator traits for efficient, idiomatic column traversal while encapsulating data access in a dedicated view struct.
2026-06-17 23:09:18 +02:00
Eric Coissac 5ff5b04d2d refactor: replace manual bit ops with BitSlice traits
Refactors bit manipulation and distance calculations to leverage standardized `BitSlice` traits, replacing manual byte/word logic with safer, reusable methods. Extends `IntSlice` and `IntSliceMut` traits to expose direct memory-mapped access and overflow management, enabling efficient bulk data extraction and serialization. Replaces manual bit-shifting loops with optimized table-based unpacking and adds population count and distance metric methods for improved performance. Updates `PersistentBitVecBuilder` with file tracking and safe flushing, and aligns test imports with new trait bounds.
2026-06-17 15:28:44 +02:00
Eric Coissac df7b400fda perf: optimize aggregation with byte-level helpers and direct mmap
Introduce `byte_sum` and `byte_count_nonzero` to efficiently aggregate compact-int byte slices, bypassing per-element decoding and overflow map lookups. Refactor `sum()` and `count_nonzero()` across the matrix, reader, and traits modules to use direct memory-mapped slice iteration and idiomatic Rust iterators. Additionally, expose `MemoryIntIter` publicly and implement `IntoIterator` and `IntSlice` for `MemoryIntVec` to enable standard iteration and delegate aggregation to the new helpers.
2026-06-17 15:21:21 +02:00
Eric Coissac d1717688d2 refactor: extract matrix helpers and improve bit iteration ergonomics
Refactor parallel matrix construction by extracting reusable `pairwise_matrix` and `pairwise2_matrix` helpers, and consolidate binary record deserialization into dedicated parsing functions. Add `set` and `iter` methods to `BitSliceMut` and `MemoryBitVec` for ergonomic bit manipulation and iteration. Standardize JSON field extraction via `meta::field`, expose `MemoryBitIter`, and improve test reliability by automatically cleaning up temporary directories.
2026-06-17 15:15:39 +02:00
Eric Coissac cde6457eea feat: add memory vectors, slice traits, and column extraction methods
Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.
2026-06-17 15:03:18 +02:00
Eric Coissac b6fcbc545f refactor: replace rayon with NUMA-aware PartitionRunner
Replaces `rayon` parallel iteration across index, rebuild, reindex, and select modules with a custom `PartitionRunner`. This introduces NUMA-aware task distribution with CPU pinning and round-robin scheduling, eliminating `Arc`, `Mutex`, and atomic synchronization primitives in favor of a flat, pre-spawned worker architecture. Error handling is simplified via `.map_err()` and the `?` operator, while progress bar updates are decoupled into dedicated callbacks.
2026-06-15 18:53:31 +02:00
coissac 9578f991f4 Merge pull request 'Push pslsukyowzrp' (#32) from push-pslsukyowzrp into main
Reviewed-on: #32
2026-06-15 16:29:24 +00:00
Eric Coissac 1cd7916e06 refactor: replace rayon with NUMA-aware PartitionRunner
Replaces `rayon` parallel iteration across index, rebuild, reindex, and select modules with a custom `PartitionRunner`. This introduces NUMA-aware task distribution with CPU pinning and round-robin scheduling, eliminating `Arc`, `Mutex`, and atomic synchronization primitives in favor of a flat, pre-spawned worker architecture. Error handling is simplified via `.map_err()` and the `?` operator, while progress bar updates are decoupled into dedicated callbacks.
2026-06-15 18:29:04 +02:00
Eric Coissac bc92dc4592 refactor: restructure partitioner with shared utilities and pipeline
This commit restructures the partitioner crate by extracting shared utilities and the `ColBuilder` enum into a new `common` module. It introduces a multi-phase `graph_pipeline` for constructing and materializing De Bruijn graphs, replacing manual graph construction in `index_layer`, `merge_layer`, and `rebuild_layer`. All layer workflows now use centralized `build_graph` and `materialize_layer` abstractions, with standardized error context strings for improved diagnostics.
2026-06-15 14:08:16 +02:00
coissac a9567ad023 Merge pull request 'Push rtnzuqxzmkon' (#31) from push-rtnzuqxzmkon into main
Reviewed-on: #31
2026-06-15 09:40:35 +00:00
Eric Coissac 4a64718fd1 perf: replace partition processing with adaptive NUMA worker pool
Replaces the previous partition processing logic with an adaptive, NUMA-aware multi-threaded worker pool that dynamically scales active threads based on real-time CPU efficiency. Introduces pre-spawned, CPU-pinned threads managed via crossbeam channels and Rayon to optimize memory bandwidth and core utilization. Adds a `max_workers()` accessor to aggregate maximum worker capacity across NUMA nodes and updates diagnostics to report active versus maximum worker counts.
2026-06-15 11:40:14 +02:00
Eric Coissac 7a87e911b6 feat: introduce NUMA-aware PartitionRunner for adaptive parallelism
Replace NUMA-naive Rayon loops and ad-hoc adaptive pools with a unified `PartitionRunner` that manages a NUMA-aware worker pool. The implementation uses pinned Rayon thread pools per node and activates dormant threads based on real-time CPU efficiency metrics. This standardizes partition-level parallelism, optimizes memory locality, and eliminates cross-socket traffic. Includes architecture documentation and updates mkdocs navigation.
2026-06-15 11:34:41 +02:00
coissac 313d73838a Merge pull request 'feat: add pipeline concurrency throttling and HPC build docs' (#30) from push-owwylwtskwzw into main
Reviewed-on: #30
2026-06-15 08:33:41 +00:00
Eric Coissac 175ea5bbd0 feat: add pipeline concurrency throttling and HPC build docs
Introduces a counting semaphore-based throttling mechanism to limit concurrent file I/O and pipeline processing. Replaces custom path wrappers with standardized `Throttled` types across `obikmer` and `obikpartitionner`, ensuring RAII-based resource cleanup and explicit backpressure. Additionally, documents how to redirect Cargo build artifacts to local scratch storage on HPC filesystems to prevent compilation slowdowns.
2026-06-15 10:33:23 +02:00
coissac c6ea0c53e3 Merge pull request 'feat: implement NUMA-aware worker pools for merge command' (#29) from push-wusvurukprsr into main
Reviewed-on: #29
2026-06-14 21:57:21 +00:00
Eric Coissac ea767376bd feat: implement NUMA-aware worker pools for merge command
Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.
2026-06-14 23:56:52 +02:00
coissac f1d76f3203 Merge pull request 'refactor(merge): extract adaptive worker spawn logic' (#28) from push-yzruqtyqvopm into main
Reviewed-on: #28
2026-06-13 12:56:34 +00:00
Eric Coissac c4071eb450 refactor(merge): extract adaptive worker spawn logic
Centralize inline spawn checks into a `should_spawn_worker` function with adaptive thresholds. The first worker spawns at <95% CPU efficiency, while subsequent workers only trigger if marginal efficiency gain exceeds 25% of the expected `1/n_workers` (minimum 3%). Also increases the spawn poll interval from 10s to 20s.
2026-06-13 14:56:01 +02:00
coissac 817b02cbc1 Merge pull request 'Push zkspuxlpumpw' (#27) from push-zkspuxlpumpw into main
Reviewed-on: #27
2026-06-13 11:25:12 +00:00
Eric Coissac 547cb72d76 refactor: Enforce Rayon parallelism and fix merge_layer concurrency
Updated memory guidelines and feedback docs to explicitly classify intra-partition phases as parallel, correcting prior assumptions of sequential execution. Refactored merge_layer.rs to wrap column builders in Arc<Mutex<ColBuilder>> and use Arc::try_unwrap for safe concurrent access, eliminating race conditions and preventing double-closes during pass2.
2026-06-13 13:24:55 +02:00
Eric Coissac 6d85387077 feat: add performance instrumentation and dynamic worker scaling
This change enhances observability and adaptability in the merge pipeline. Performance timing and debug logging are added to the De Bruijn graph and partition merge layers to track phase durations and pipeline metrics. The merge module replaces blocking receives with timed polls to sample CPU efficiency, dynamically spawning workers when utilization drops below a threshold. A new script is also introduced to parse merge debug logs and generate structured Markdown reports detailing throughput, phase breakdowns, and partition performance.
2026-06-13 13:21:53 +02:00
coissac fb5b53dca9 Merge pull request 'Push ooxwzorvsqvy' (#26) from push-ooxwzorvsqvy into main
Reviewed-on: #26
2026-06-13 09:59:07 +00:00
Eric Coissac fddf630772 style: apply consistent formatting and whitespace normalization
Applies consistent formatting, whitespace normalization, and indentation standardization to `debruijn.rs` and `merge.rs`. Reorganizes imports and downgrades a unitig traversal log from `info!` to `debug!`. No functional logic or runtime behavior is altered.
2026-06-13 11:58:20 +02:00
Eric Coissac bc14346f5f feat: add CPU-aware parallel worker pool for partition merging
Introduce CpuSample to measure process-level CPU efficiency and wall-clock time. Use crossbeam-channel to distribute partition merging tasks to a dynamic worker pool that scales based on CPU utilization, capped at half the available cores. Update diagnostics to track pool usage.
2026-06-13 11:58:20 +02:00
Eric Coissac fb8c6e427c refactor: pass Unitig objects directly instead of raw byte slices
Refactored `try_for_each_unitig` and related pipelines across `obidebruinj` and `obikpartitionner` to accept `Unitig` instances directly. This eliminates manual `Unitig::from_nucleotides()` conversions, simplifies the data flow, and reduces unnecessary allocation overhead.
2026-06-13 11:52:50 +02:00
Eric Coissac 1f336fe496 refactor: replace mutex with channels for parallel debruijn processing
Add `rayon` and `crossbeam-channel` dependencies to support concurrent execution. Replace the synchronous, mutex-protected closure pattern with a channel-based producer-consumer approach using `std::thread::scope`. This decouples unitig iteration from processing, eliminating lock contention and `Mutex` overhead while enabling parallel workloads.
2026-06-13 11:49:27 +02:00
Eric Coissac 5f98d2ef96 refactor: replace explicit collect with Unitig::from_nucleotides
Introduce a thread-local buffer to materialize nucleotide iterators into contiguous slices. Update `try_for_each_unitig` across the debruijn, index, merge, and rebuild layers to directly instantiate `Unitig` via `from_nucleotides()` instead of explicitly collecting iterators. This eliminates intermediate allocations and aligns test code with the new approach.
2026-06-13 11:47:06 +02:00
Eric Coissac 8b563d0804 refactor: migrate pipeline stages and improve graph processing
Refactored neighbor resolution to explicitly track unvisited indices for degree-1 nodes, updated display formatting, and added timing and debug logging to the degree computation routine. Migrated pipeline stages from eager vector returns to explicit flat implementations, enabling backpressure-aware streaming, configurable batch processing, incremental yielding, and progress tracking via a delta channel.
2026-06-13 11:44:17 +02:00
coissac 7208dcbb4a Merge pull request 'refactor: defer SrcLayerData lookups in RawBatch' (#25) from push-nxrynoorswrw into main
Reviewed-on: #25
2026-06-12 20:19:21 +00:00
Eric Coissac 2e69b0b7fe refactor: defer SrcLayerData lookups in RawBatch
Replace eager resolution of `Vec<u32>` values with an `Arc<SrcLayerData>` handle passed alongside `Vec<CanonicalKmer>`. This shifts the lookup logic to the subsequent transform step, reducing memory overhead and enabling shared, thread-safe access to the source layer data.
2026-06-12 22:18:57 +02:00
coissac 9ea1dff5d6 Merge pull request 'Push rwqsmuvystym' (#24) from push-rwqsmuvystym into main
Reviewed-on: #24
2026-06-12 19:33:20 +00:00
Eric Coissac b2c8373586 refactor: parallelize merge layer with WorkerPool pipeline
Replaces the synchronous sequential loop with a multi-threaded pipeline using `WorkerPool` and custom stage macros. Shared mutable state is wrapped in `Arc<Mutex<>>` for thread-safe updates, while pipeline errors are centralized via `Arc<Mutex<Option<String>>>` to propagate failures before thread join.
2026-06-12 21:32:53 +02:00
Eric Coissac ba49af6f9e refactor: parallelize merge and partition logic with obipipeline
Introduce the `obipipeline` dependency and refactor merge and partition logic to leverage parallel execution. Update `merge_partitions` to use rayon with dynamic memory budgeting and concurrency control via a pilot run. Refactor Pass 1 to concurrently read unitigs, filter kmers through a shared `LayeredMap`, and populate the graph safely. Simplify diagnostics to report total kmer counts and replace manual flags with graph length validation.
2026-06-12 21:32:04 +02:00
Eric Coissac 2bc189e962 feat: dynamically compute seed expansion based on RSS
Introduce a `peak_rss_bytes()` utility for accurate per-phase RAM measurement. Replace the genome-length heuristic with a dynamic seed expansion ratio based on actual RSS delta. Explicitly drop the `GraphDeBruijn` instance before MPHF construction to prevent resource contention and ensure proper memory management.
2026-06-12 16:39:38 +02:00
coissac db9c604199 Merge pull request 'feat: enhance memory budgeting and add rebuild diagnostics' (#23) from push-nptzpkomspkv into main
Reviewed-on: #23
2026-06-12 13:21:12 +00:00
Eric Coissac 52fd2cf801 feat: enhance memory budgeting and add rebuild diagnostics
This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
2026-06-12 15:20:38 +02:00
coissac 97e3fb9761 Merge pull request 'Push ylnwstyzqwrt' (#22) from push-ylnwstyzqwrt into main
Reviewed-on: #22
2026-06-12 10:10:03 +00:00
Eric Coissac b5e027f23b feat: add memory-aware parallel merge scheduling and CLI flags
Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.
2026-06-12 11:44:10 +02:00
Eric Coissac f44fe042bc feat: add parallel k-mer counting and stats CLI
Introduces allocation-free `sum()` and `count_nonzero()` methods for compact integer vectors, extending the `ColumnWeights` trait with `partial_kmer_counts`. Adds parallel partition scanning to the k-mer index for computing per-genome distinct k-mer counts, and exposes a new `--stats` CLI flag to output these statistics as CSV.
2026-06-12 11:29:32 +02:00
coissac 94e0a370b3 Merge pull request 'Push tmpsxsztwpxl' (#21) from push-tmpsxsztwpxl into main
Reviewed-on: #21
2026-06-09 13:31:25 +00:00
Eric Coissac 970460be42 refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
2026-06-09 15:26:37 +02:00
Eric Coissac e66adef23d feat: add select command for genome column projection and aggregation
Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.
2026-06-09 15:09:47 +02:00
coissac b0dab452f6 Merge pull request 'refactor: optimize dump partition iteration and add progress tracking' (#20) from push-xqswlxlvmyrq into main
Reviewed-on: #20
2026-06-09 09:34:13 +00:00
Eric Coissac db730e9cf6 refactor: optimize dump partition iteration and add progress tracking
Refactor partition iteration to support a generic `on_partition` callback executed after each parallel partition completes. Split the logic into bounded and unbounded paths; the bounded path uses an `AtomicUsize` to enforce row limits, while the unbounded path eliminates atomic contention to improve throughput. Additionally, integrate a progress bar into the dump command by passing an increment callback to `idx.dump()`, ensuring proper initialization and cleanup.
2026-06-09 11:07:48 +02:00
coissac f65ecd19cc Merge pull request 'Push lrwmyplxxzkn' (#19) from push-lrwmyplxxzkn into main
Reviewed-on: #19
2026-06-09 08:28:20 +00:00
Eric Coissac 7dd8db1409 docs: document conservative rounding strategy for filtering thresholds
Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.
2026-06-09 10:26:21 +02:00
Eric Coissac ce45e2fbe1 refactor: centralize k-mer filtering logic and add validation
Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
2026-06-09 10:22:25 +02:00
Eric Coissac 2465cfbc4b Parallelize partition iteration using Rayon
Introduce thread-local `Vec<u8>` buffers to eliminate concurrent I/O contention. Replace the mutable row counter with an `AtomicUsize` and `fetch_update` to enable lock-free early termination when the limit is reached. Collected chunks are then written sequentially to preserve partition ordering.
2026-06-09 10:04:25 +02:00
Eric Coissac d626d42ec7 feat: add --head and --presence-threshold to dump and distance
Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.
2026-06-09 10:04:25 +02:00
coissac 650eea43b6 Merge pull request 'Push quqlpklvxsqx' (#18) from push-quqlpklvxsqx into main
Reviewed-on: #18
2026-06-08 18:15:01 +00:00
Eric Coissac eb7805c747 feat: add configurable presence threshold to kmer distance
Introduce a `--presence-threshold` CLI argument (default: 1) and update `KmerIndex::distance` to accept a `presence_threshold` parameter. This replaces hardcoded zero thresholds, enabling configurable filtering of low-abundance kmers during Jaccard distance calculations.
2026-06-08 20:14:33 +02:00
Eric Coissac 1ec65922df feat: implement parallel pairwise distance matrices
Introduces parallelized pairwise distance matrix computation for Jaccard, Hamming, Bray-Curtis, Euclidean, and Hellinger metrics across `Columnar`, `Packed`, and `Implicit` matrix variants. Adds trait methods and convenience wrappers, safely handles normalization and zero-denominator edge cases, and updates test suites to import required traits for validation.
2026-06-08 20:08:09 +02:00
Eric Coissac 09d9e21744 feat: integrate tracing and enhance bit matrix operations
Add the `tracing` crate to `obidebruinj`, `obisys`, and resolve it in `Cargo.lock`. Replace `eprintln!` statements with structured `debug!` and `info!` macros. Introduce a `TracedBar` wrapper for progress bars and enhance the `Stage` lifecycle to emit structured events for timing, memory metrics, and swap warnings. Add a progress spinner for unitig degree computation. Extend `PersistentBitMatrix` with columnar bit-vector operations and parallel distance methods, enabling uniform distance computations across all storage layouts while replacing previous panics with dimension-based fallbacks.
2026-06-08 19:55:06 +02:00
coissac 3f47e22083 Merge pull request 'Push pvqkqxlkkwry' (#17) from push-pvqkqxlkkwry into main
Reviewed-on: #17
2026-06-06 04:44:10 +00:00
Eric Coissac 03c7bb0b99 Relax unitig assertion in debruijn test
Replace the strict `unitigs.len() == 1` assertion with a non-empty check to allow multiple unitigs. Update the test comment to describe the general non-repetitive sequence recovery principle instead of a specific example. The core k-mer roundtrip validation logic remains unchanged.
2026-06-06 06:41:45 +02:00
Eric Coissac b39eee688a refactor(debruijn): unify graph traversal with WalkState iterator
Replaces deeply nested branching with early returns and `then_some`. Introduces a cycle-detecting `find_chain_start` method and updates `UnitigNucIter` to use step-based iteration with atomic node claiming. This eliminates nested iterators and redundant state management, improving code readability and maintainability.
2026-06-06 06:38:28 +02:00
Eric Coissac 95b3461405 refactor: centralize graph traversal logic in walk
Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.
2026-06-06 06:36:48 +02:00
Eric Coissac 952a21eef8 refactor: remove naked_asm and extract collect_unitigs helper
Remove the `std::arch::naked_asm` import as it is no longer required. Introduce a `collect_unitigs` helper to abstract nucleotide sequence extraction from `GraphDeBruijn`, and refactor the test suite to use it, eliminating repetitive collection code and standardizing graph iteration logic.
2026-06-06 04:33:59 +02:00
Eric Coissac 5c2f48535f refactor: rename compute_degrees and mark start nodes
Renames `compute_degrees` to `compute_degrees_and_mark_starts` across the De Bruijn graph and partitioner layers to consolidate degree calculation and start-node flagging. Introduces safe neighbor iteration methods and a debug validation block to verify graph consistency. Refactors unitig extraction to use sequential execution with a `Mutex` for safe error propagation. Fixes malformed and duplicated method calls, adds auto-generation of missing `meta.json` files, and ensures persistent matrix builders are explicitly closed to finalize metadata.
2026-06-05 19:48:59 +02:00
Eric Coissac 27088ab810 refactor: optimize unitig iteration and graph traversal
Switches unitig processing to a lazy, fallible `try_for_each_unitig` API across partitioner layers, reducing intermediate allocations and enabling proper error propagation. Refactors de Bruijn graph traversal into a two-pass algorithm with explicit node flags, named constants, and diagnostic logging. Introduces parallel chain processing and staged performance profiling for the unitig command, and adds a memory-efficient `FromIterator` implementation for packed nucleotide sequences.
2026-06-05 19:48:59 +02:00
coissac ea2c594c86 Merge pull request 'Push ruqusmkoyvwm' (#16) from push-ruqusmkoyvwm into main
Reviewed-on: #16
2026-06-05 08:41:08 +00:00
Eric Coissac d202ead385 feat: parallelize unitig extraction and FASTA output
Replace the non-atomic `set_visited` with atomic `fetch_or` bitmask operations to enable thread-safe node claiming. Introduce a two-phase extraction pipeline where `par_for_each_chain_unitig` builds chains in parallel and `for_each_remaining_unitig` sequentially handles residual cycles and junctions. Add `is_start` and `collect_from_start` to explicitly define unitig boundaries. Wrap `BufWriter` in a `Mutex` and use an `AtomicUsize` counter to ensure thread-safe concurrent FASTA output, refactoring the write logic into a shared closure for safe multi-threaded execution.
2026-06-05 10:33:52 +02:00
Eric Coissac 249998beed perf: add structured performance profiling for unitig stages
Wraps graph construction, degree computation, and unitig enumeration phases with `Stage` start/stop calls. Intervals are recorded in a `Reporter` instance and printed upon completion to provide granular timing metrics for each computational stage.
2026-06-05 10:28:45 +02:00
Eric Coissac 2f29ee2240 feat: Add parallel execution and thread-safe graph operations
Integrate rayon to enable parallel processing of k-mer partitions and degree computation. Replace Cell with AtomicU8 to ensure thread-safe node state management, and add a merge method for combining disjoint graphs. Additionally, introduce progress tracking utilities and a test-utils feature flag for development dependencies.
2026-06-04 23:22:55 +02:00
Eric Coissac edd5e3f8ee feat: add bits-per-kmer diagnostic and stats module
Introduce a `stats` module to compute normalized storage efficiency metrics. The new `KmerIndex::bits_per_kmer()` method parallelizes disk I/O across partitions to aggregate file sizes for MPHF, evidence, and matrix components. Publicly export `IndexBitsPerKmer` and add a `--bits-per-kmer` CLI flag to trigger the diagnostic routine and print detailed statistics.
2026-06-04 23:17:17 +02:00
Eric Coissac bb7adc1154 docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 22:59:41 +02:00
Eric Coissac 9306ec1c56 perf: Replace manual window tracking with monotonic deque algorithm
Eliminates intermediate allocations by computing per-genome window minimums (`win_min`) directly. Unifies the `z ≤ 1` and `z > 1` branches into a single buffer-reused accumulation loop, efficiently validating k-mer presence.
2026-06-04 21:37:09 +02:00
Eric Coissac 712a03a3a6 refactor: replace unitig extraction with de Bruijn graph pipeline
This change replaces direct partition-based extraction with a pipeline that reconstructs a de Bruijn graph from filtered k-mers. It introduces `FilterArgs` for k-mer selection, collects filtered k-mers in parallel into a `GraphDeBruijn`, computes node degrees, and enumerates unitigs from the graph for output instead of reading pre-computed partition files.
2026-06-04 21:32:49 +02:00
Eric Coissac 3e62ffe010 feat: add selective k-mer filtering to dump and rebuild commands
Add the `obidebruinj` dependency and introduce `FilterArgs` CLI arguments for ingroup/outgroup predicates and count/fraction thresholds. Extend `GroupFilterParams` to support outgroup filtering, and integrate the filter collection into `KmerIndex::dump` and `rebuild` commands. This enables selective k-mer filtering during index operations and CSV exports.
2026-06-04 21:29:58 +02:00
Eric Coissac a1499e6153 feat: add kmer filtering and refactor layer iteration
Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.
2026-06-04 21:08:15 +02:00
Eric Coissac 476c7a6394 feat: add metadata-driven k-mer filtering for rebuild command
Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
2026-06-04 21:01:58 +02:00
172 changed files with 45817 additions and 3568 deletions
Vendored
BIN
View File
Binary file not shown.
+36
View File
@@ -0,0 +1,36 @@
name: CI
on:
push:
branches: ['**']
pull_request:
jobs:
build:
runs-on: ubuntu-latest
defaults:
run:
working-directory: src
steps:
- uses: actions/checkout@v4
- name: Install Rust
run: |
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
- name: Cache cargo registry
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
src/target
key: ${{ runner.os }}-cargo-${{ hashFiles('src/Cargo.lock') }}
restore-keys: ${{ runner.os }}-cargo-
- name: Build
run: cargo build --release
- name: Test
run: cargo test --release
+65
View File
@@ -0,0 +1,65 @@
name: Release
on:
push:
tags:
- "v*"
jobs:
build-linux-static:
runs-on: ubuntu-latest
defaults:
run:
working-directory: src
steps:
- uses: actions/checkout@v4
- name: Install Rust + zigbuild
run: |
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
sudo apt-get update -qq && sudo apt-get install -y -qq jq
pip install ziglang --quiet
$HOME/.cargo/bin/cargo install cargo-zigbuild
$HOME/.cargo/bin/rustup target add x86_64-unknown-linux-musl
- name: Create musl C/C++ wrappers
run: |
ZIG=$(python3 -c "import ziglang, os; print(os.path.join(os.path.dirname(ziglang.__file__), 'zig'))")
printf '#!/bin/sh\nexec "%s" cc -target x86_64-linux-musl "$@"\n' "$ZIG" | sudo tee /usr/local/bin/x86_64-linux-musl-gcc > /dev/null
printf '#!/bin/sh\nexec "%s" c++ -target x86_64-linux-musl "$@"\n' "$ZIG" | sudo tee /usr/local/bin/x86_64-linux-musl-g++ > /dev/null
sudo chmod +x /usr/local/bin/x86_64-linux-musl-gcc /usr/local/bin/x86_64-linux-musl-g++
- name: Cache cargo registry
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
src/target
key: linux-musl-cargo-${{ hashFiles('src/Cargo.lock') }}
restore-keys: linux-musl-cargo-
- name: Build static binary
run: cargo zigbuild --release --target x86_64-unknown-linux-musl
- name: Prepare artifact
run: |
mkdir -p /tmp/dist
cp target/x86_64-unknown-linux-musl/release/obikmer /tmp/dist/obikmer-linux-x86_64
strip /tmp/dist/obikmer-linux-x86_64
- name: Create Gitea release and upload binary
env:
GITEA_TOKEN: ${{ secrets.GITEATOKEN }}
TAG: ${{ github.ref_name }}
run: |
release_id=$(curl -s -X POST \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/releases" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"tag_name\":\"$TAG\",\"name\":\"$TAG\"}" | jq -r '.id')
curl -s -X POST \
"${{ github.server_url }}/api/v1/repos/${{ github.repository }}/releases/$release_id/assets" \
-H "Authorization: token $GITEA_TOKEN" \
-F "attachment=@/tmp/dist/obikmer-linux-x86_64"
+10
View File
@@ -9,3 +9,13 @@ data-stress
./**/*.json ./**/*.json
*.bin *.bin
Betula_exilis--IGA-24-33 Betula_exilis--IGA-24-33
benchmark/genomes
benchmark/simulated_data
benchmark/specimen_index_presence
benchmark/specimen_index_count
benchmark/global_index_presence
benchmark/global_index_count
benchmark/stats
benchmark/reference_index
benchmark/specific_index_count
benchmark/specific_index_presence
+2
View File
@@ -0,0 +1,2 @@
/cache
/project.local.yml
+133
View File
@@ -0,0 +1,133 @@
# the name by which the project can be referenced within Serena
project_name: "obikmer"
# list of languages for which language servers are started; choose from:
# al angular ansible bash clojure
# cpp cpp_ccls crystal csharp csharp_omnisharp
# dart elixir elm erlang fortran
# fsharp go groovy haskell haxe
# hlsl html java json julia
# kotlin lean4 lua luau markdown
# matlab msl nix ocaml pascal
# perl php php_phpactor powershell python
# python_jedi python_ty r rego ruby
# ruby_solargraph rust scala scss solidity
# svelte swift systemverilog terraform toml
# typescript typescript_vts vue yaml zig
# (This list may be outdated. For the current list, see values of Language enum here:
# https://github.com/oraios/serena/blob/main/src/solidlsp/ls_config.py
# For some languages, there are alternative language servers, e.g. csharp_omnisharp, ruby_solargraph.)
# Note:
# - For C, use cpp
# - For JavaScript, use typescript
# - For Angular projects, use angular (subsumes typescript+html; requires `npm install` in the project root)
# - For Svelte projects, use svelte (subsumes typescript/javascript for .svelte projects; requires npm)
# - For SCSS / Sass / plain CSS, use scss (some-sass-language-server handles all three)
# - For Free Pascal/Lazarus, use pascal
# Special requirements:
# Some languages require additional setup/installations.
# See here for details: https://oraios.github.io/serena/01-about/020_programming-languages.html#language-servers
# When using multiple languages, the first language server that supports a given file will be used for that file.
# The first language is the default language and the respective language server will be used as a fallback.
# Note that when using the JetBrains backend, language servers are not used and this list is correspondingly ignored.
languages:
- rust
# the encoding used by text files in the project
# For a list of possible encodings, see https://docs.python.org/3.11/library/codecs.html#standard-encodings
encoding: "utf-8"
# line ending convention to use when writing source files.
# Possible values: unset (use global setting), "lf", "crlf", or "native" (platform default)
# This does not affect Serena's own files (e.g. memories and configuration files), which always use native line endings.
line_ending:
# The language backend to use for this project.
# If not set, the global setting from serena_config.yml is used.
# Valid values: LSP, JetBrains
# Note: the backend is fixed at startup. If a project with a different backend
# is activated post-init, an error will be returned.
language_backend:
# whether to use project's .gitignore files to ignore files
ignore_all_files_in_gitignore: true
# advanced configuration option allowing to configure language server-specific options.
# Maps the language key to the options.
# Have a look at the docstring of the constructors of the LS implementations within solidlsp (e.g., for C# or PHP) to see which options are available.
# No documentation on options means no options are available.
ls_specific_settings: {}
# list of additional workspace folder paths for cross-package reference support (e.g. in monorepos).
# Paths can be absolute or relative to the project root.
# Each folder is registered as an LSP workspace folder, enabling language servers to discover
# symbols and references across package boundaries.
# Currently supported for: TypeScript.
# Example:
# additional_workspace_folders:
# - ../sibling-package
# - ../shared-lib
additional_workspace_folders: []
# list of additional paths to ignore in this project.
# Same syntax as gitignore, so you can use * and **.
# Note: global ignored_paths from serena_config.yml are also applied additively.
ignored_paths: []
# whether the project is in read-only mode
# If set to true, all editing tools will be disabled and attempts to use them will result in an error
# Added on 2025-04-18
read_only: false
# list of tool names to exclude.
# This extends the existing exclusions (e.g. from the global configuration)
# Find the list of tools here: https://oraios.github.io/serena/01-about/035_tools.html
excluded_tools: []
# list of tools to include that would otherwise be disabled (particularly optional tools that are disabled by default).
# This extends the existing inclusions (e.g. from the global configuration).
# Find the list of tools here: https://oraios.github.io/serena/01-about/035_tools.html
included_optional_tools: []
# fixed set of tools to use as the base tool set (if non-empty), replacing Serena's default set of tools.
# This cannot be combined with non-empty excluded_tools or included_optional_tools.
# Find the list of tools here: https://oraios.github.io/serena/01-about/035_tools.html
fixed_tools: []
# list of mode names that are to be activated by default, overriding the setting in the global configuration.
# The full set of modes to be activated is base_modes (from global config) + default_modes + added_modes.
# If the setting is undefined/empty, the default_modes from the global configuration (serena_config.yml) apply.
# Otherwise, this overrides the setting from the global configuration (serena_config.yml).
# Therefore, you can set this to [] if you do not want the default modes defined in the global config to apply
# for this project.
# This setting can, in turn, be overridden by CLI parameters (--mode).
# See https://oraios.github.io/serena/02-usage/050_configuration.html#modes
default_modes:
# list of mode names to be activated additionally for this project, e.g. ["query-projects"]
# The full set of modes to be activated is base_modes (from global config) + default_modes + added_modes.
# See https://oraios.github.io/serena/02-usage/050_configuration.html#modes
added_modes:
# initial prompt for the project. It will always be given to the LLM upon activating the project
# (contrary to the memories, which are loaded on demand).
initial_prompt: ""
# time budget (seconds) per tool call for the retrieval of additional symbol information
# such as docstrings or parameter information.
# This overrides the corresponding setting in the global configuration; see the documentation there.
# If null or missing, use the setting from the global configuration.
symbol_info_budget:
# list of regex patterns which, when matched, mark a memory entry as readonly.
# Extends the list from the global configuration, merging the two lists.
read_only_memory_patterns: []
# list of regex patterns for memories to completely ignore.
# Matching memories will not appear in list_memories or activate_project output
# and cannot be accessed via read_memory or write_memory.
# To access ignored memory files, use the read_file tool on the raw file path.
# Extends the list from the global configuration, merging the two lists.
# Example: ["_archive/.*", "_episodes/.*"]
ignored_memory_patterns: []
+26
View File
@@ -73,3 +73,29 @@ Lors de l'ajout de nouveaux fichiers Markdown dans `docmd/`, mettre à jour la s
--- ---
Je continue à poser mes questions et à guider la discussion. Je continue à poser mes questions et à guider la discussion.
---
## MCP Tools
**Règle absolue : avant tout travail de code, appeler `mcp__serena__initial_instructions` pour charger les instructions Serena.**
### Hiérarchie des outils pour ce projet Rust
**Navigation et édition de code → serena en priorité**
- Trouver un symbole, une déclaration, les implémentations d'un trait : `mcp__serena__find_symbol`, `mcp__serena__find_declaration`, `mcp__serena__find_implementations`
- Trouver les usages d'un symbole : `mcp__serena__find_referencing_symbols`
- Diagnostics LSP (erreurs de compilation) : `mcp__serena__get_diagnostics_for_file`
- Vue d'ensemble d'un fichier : `mcp__serena__get_symbols_overview`
- Modifier le corps d'une fonction/impl : `mcp__serena__replace_symbol_body`
- Ne pas utiliser `cclsp` quand serena couvre le besoin
**Analyse architecturale → jcodemunch**
- Hotspots, couplage, dead code, dépendances entre modules
- Utiliser avant de refactorer une zone critique
**Raisonnement complexe → sequential-thinking**
- Décisions d'architecture, choix d'algorithme, trade-offs non triviaux
**Documentation de crates → context7**
- Toujours consulter avant d'utiliser une API de bibliothèque externe
+30
View File
@@ -22,6 +22,7 @@ $(MKDOCS): $(VENV)/bin/activate
mkdocs mkdocs-material \ mkdocs mkdocs-material \
mkdocs-mermaid2-plugin \ mkdocs-mermaid2-plugin \
mkdocs-bibtex mkdocs-bibtex
$(PIP) install --quiet --upgrade InSilicoSeq
# ── obikmer binary ─────────────────────────────────────────────────────────── # ── obikmer binary ───────────────────────────────────────────────────────────
@@ -62,3 +63,32 @@ clean-doc:
.PHONY: clean .PHONY: clean
clean: clean-doc clean: clean-doc
rm -rf $(VENV) rm -rf $(VENV)
# ── release ───────────────────────────────────────────────────────────────────
CARGO_TOML := $(CARGO_DIR)/obikmer/Cargo.toml
.PHONY: bump-version
bump-version:
@current=$$(grep '^version = ' $(CARGO_TOML) | head -n 1 | sed 's/version = "\(.*\)"/\1/'); \
if [ -n "$(RELEASE)" ]; then \
new_version="$(RELEASE)"; \
else \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
fi; \
echo "Version: $$current -> $$new_version"; \
sed -i.bak "s/^version = \"$$current\"/version = \"$$new_version\"/" $(CARGO_TOML) && \
rm $(CARGO_TOML).bak
.PHONY: release
release: bump-version
@new_version=$$(grep '^version = ' $(CARGO_TOML) | head -n 1 | sed 's/version = "\(.*\)"/\1/'); \
git tag "v$$new_version"
@jj auto-describe
@jj git push --change @
@new_version=$$(grep '^version = ' $(CARGO_TOML) | head -n 1 | sed 's/version = "\(.*\)"/\1/'); \
git push origin "v$$new_version"
+15 -2
View File
@@ -51,7 +51,13 @@ Non-ACGT characters act as hard breaks between k-mer segments in all formats.
Runs scatter → dereplicate → count → layered MPHF. Runs scatter → dereplicate → count → layered MPHF.
Resumes automatically if interrupted. Resumes automatically if interrupted.
merge Merge multiple independently built indexes into one. merge Merge multiple independently built indexes into one.
rebuild Filter and compact an existing index: apply count thresholds, Schedules partitions largest-first under a memory budget semaphore
to avoid OOM on machines with many cores. The worst partition runs
alone first to calibrate the expansion estimator; subsequent
partitions run in parallel within the budget.
--budget-fraction F fraction of available RAM to use as budget
(default 0.5; reduce if OOM persists).
filter Filter and compact an existing index: apply count thresholds,
drop layers, rewrite as a single-layer index. drop layers, rewrite as a single-layer index.
reindex Convert evidence in-place across all layers: reindex Convert evidence in-place across all layers:
exact (evidence.bin) ↔ approximate (fingerprint.bin). exact (evidence.bin) ↔ approximate (fingerprint.bin).
@@ -74,7 +80,14 @@ Non-ACGT characters act as hard breaks between k-mer segments in all formats.
Diagnostic / pipeline use. Diagnostic / pipeline use.
unitig Dump the unitig sequences stored in a built index. Debug use. unitig Dump the unitig sequences stored in a built index. Debug use.
utils Miscellaneous utilities. utils Miscellaneous utilities.
--new-label NEW=OLD renames a genome label in-place. --new-label NEW=OLD rename a genome label in-place.
--bits-per-kmer print MPHF / evidence / matrix size breakdown.
--stats per-genome k-mer counts as CSV.
--partition-stats partition size distribution across one or more
indexes (markdown report to stdout). Useful to
diagnose minimizer imbalance before a large merge.
--csv FILE write per-(partition, source) raw data to FILE
(used with --partition-stats).
## Quick start ## Quick start
+144
View File
@@ -0,0 +1,144 @@
# Requires GNU Make >= 4.3 (grouped targets &:) — use gmake on macOS
BINARY := ../src/target/release/obikmer
VENV_PY := ../.venv/bin/python3
GENOMES := $(wildcard genomes/*.fna.gz)
# SPECIMENS, SPECIES, and the full dependency graph are generated by
# make_deps.py from the genome FASTA headers — like .d files in C.
# Make rebuilds deps.mk whenever genomes/ changes and restarts.
-include deps.mk
REF_NPZS := $(SPECIMENS:%=reference_index/%.npz)
PRESENCE_DONE := $(SPECIMENS:%=specimen_index_presence/%/index.done)
PRESENCE_STATS := $(SPECIMENS:%=stats/indexing_presence/%.stats)
COUNT_DONE := $(SPECIMENS:%=specimen_index_count/%/index.done)
COUNT_STATS := $(SPECIMENS:%=stats/indexing_count/%.stats)
VERIFY_PRESENCE_STATS := $(SPECIMENS:%=stats/verify_presence/%.stats)
VERIFY_COUNT_STATS := $(SPECIMENS:%=stats/verify_count/%.stats)
SPECIFIC_PRESENCE_DONE := $(SPECIES:%=specific_index_presence/%/index.done)
SPECIFIC_PRESENCE_STATS := $(SPECIES:%=stats/specific_kmer_presence/%.stats)
SPECIFIC_COUNT_DONE := $(SPECIES:%=specific_index_count/%/index.done)
SPECIFIC_COUNT_STATS := $(SPECIES:%=stats/specific_kmer_count/%.stats)
SIMULATED_READS := $(foreach s,$(SPECIMENS),simulated_data/$(subst --,/,$s)/reads_R1.fastq.gz)
.NOTPARALLEL:
.PHONY: all simulate reference \
index_presence index_count \
aggregate_index_presence aggregate_index_count \
merge_presence merge_count \
verify_presence verify_count \
aggregate_verify_presence aggregate_verify_count \
verify_merge_presence verify_merge_count \
filter_presence filter_count \
aggregate_filter_presence aggregate_filter_count
verify_merge_presence: stats/verify_merge_presence/current.csv
verify_merge_count: stats/verify_merge_count/current.csv
all: aggregate_verify_presence aggregate_verify_count \
verify_merge_presence verify_merge_count \
aggregate_filter_presence aggregate_filter_count
# ── dependency file ───────────────────────────────────────────────────────────
deps.mk: $(GENOMES)
$(VENV_PY) make_deps.py $^ > $@
# ── simulation ────────────────────────────────────────────────────────────────
# Prerequisites (genome → reads) are in deps.mk; $< is the genome file.
$(SIMULATED_READS):
bash simulate_one.sh $< $(dir $@)
simulate: $(SIMULATED_READS)
# ── reference kmer sets ───────────────────────────────────────────────────────
# Prerequisites (reads → npz) are in deps.mk.
reference_index/%.npz:
bash build_reference.sh $*
reference: $(REF_NPZS)
# ── per-specimen indexing ─────────────────────────────────────────────────────
# Prerequisites (reads → index.done + .stats) are in deps.mk.
specimen_index_presence/%/index.done \
stats/indexing_presence/%.stats &: $(BINARY)
bash index_one_presence.sh $*
specimen_index_count/%/index.done \
stats/indexing_count/%.stats &: $(BINARY)
bash index_one_count.sh $*
index_presence: $(PRESENCE_DONE)
index_count: $(COUNT_DONE)
# ── indexing stats aggregation ────────────────────────────────────────────────
aggregate_index_presence: $(PRESENCE_STATS)
bash aggregate_stats.sh indexing_presence
aggregate_index_count: $(COUNT_STATS)
bash aggregate_stats.sh indexing_count
# ── global merge ──────────────────────────────────────────────────────────────
global_index_presence/index.done: $(PRESENCE_DONE) $(BINARY)
bash merge_presence.sh
global_index_count/index.done: $(COUNT_DONE) $(BINARY)
bash merge_count.sh
merge_presence: global_index_presence/index.done
merge_count: global_index_count/index.done
# ── per-specimen verification ─────────────────────────────────────────────────
# Prerequisites (index.done + npz → .stats) are in deps.mk.
stats/verify_presence/%.stats:
bash verify_one_presence.sh $*
stats/verify_count/%.stats:
bash verify_one_count.sh $*
verify_presence: $(VERIFY_PRESENCE_STATS)
verify_count: $(VERIFY_COUNT_STATS)
# ── verification stats aggregation ───────────────────────────────────────────
aggregate_verify_presence: $(VERIFY_PRESENCE_STATS)
bash aggregate_stats.sh verify_presence
aggregate_verify_count: $(VERIFY_COUNT_STATS)
bash aggregate_stats.sh verify_count
# ── species-specific indexes ──────────────────────────────────────────────────
# Prerequisites (global index → specific index) are in deps.mk.
specific_index_presence/%/index.done \
stats/specific_kmer_presence/%.stats &: $(BINARY)
bash filter_one_presence.sh $*
specific_index_count/%/index.done \
stats/specific_kmer_count/%.stats &: $(BINARY)
bash filter_one_count.sh $*
filter_presence: $(SPECIFIC_PRESENCE_DONE)
filter_count: $(SPECIFIC_COUNT_DONE)
aggregate_filter_presence: $(SPECIFIC_PRESENCE_STATS)
bash aggregate_stats.sh specific_kmer_presence
aggregate_filter_count: $(SPECIFIC_COUNT_STATS)
bash aggregate_stats.sh specific_kmer_count
# ── merged index verification ─────────────────────────────────────────────────
stats/verify_merge_presence/current.csv: $(REF_NPZS) global_index_presence/index.done
bash verify_merge_presence.sh
stats/verify_merge_count/current.csv: $(REF_NPZS) global_index_count/index.done
bash verify_merge_count.sh
+132
View File
@@ -0,0 +1,132 @@
# Benchmark pipeline
Requires **GNU Make ≥ 4.3** (grouped targets `&:`). On macOS use `gmake`.
```
gmake all # full pipeline
gmake simulate # simulation only
gmake reference # reference kmer sets only
```
## Pipeline overview
```mermaid
flowchart TD
GENOMES["genomes/*.fna.gz"]
BIN["obikmer binary"]
GENOMES --> simulate
simulate --> simdata[("simulated_data/")]
simdata --> reference
reference --> refnpz[("reference_index/*.npz")]
subgraph presence ["Presence track"]
simdata --> index_presence
BIN --> index_presence
index_presence --> pres_done[("specimen_index_presence/")]
index_presence --> pres_istats[("stats/indexing_presence/")]
pres_istats --> aggregate_index_presence
pres_done --> merge_presence
BIN --> merge_presence
merge_presence --> gpres[("global_index_presence/")]
refnpz --> verify_presence
pres_done --> verify_presence
verify_presence --> vpres_stats[("stats/verify_presence/")]
vpres_stats --> aggregate_verify_presence
gpres --> filter_presence
BIN --> filter_presence
filter_presence --> spec_pres[("specific_index_presence/")]
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
spec_pres_stats --> aggregate_filter_presence
refnpz --> verify_merge_presence
gpres --> verify_merge_presence
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
end
subgraph count ["Count track"]
simdata --> index_count
BIN --> index_count
index_count --> count_done[("specimen_index_count/")]
index_count --> count_istats[("stats/indexing_count/")]
count_istats --> aggregate_index_count
count_done --> merge_count
BIN --> merge_count
merge_count --> gcount[("global_index_count/")]
refnpz --> verify_count
count_done --> verify_count
verify_count --> vcount_stats[("stats/verify_count/")]
vcount_stats --> aggregate_verify_count
gcount --> filter_count
BIN --> filter_count
filter_count --> spec_count[("specific_index_count/")]
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
spec_count_stats --> aggregate_filter_count
refnpz --> verify_merge_count
gcount --> verify_merge_count
verify_merge_count --> vmc[("stats/verify_merge_count/")]
end
aggregate_verify_presence --> all
aggregate_verify_count --> all
vmp --> all
vmc --> all
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
```
## Steps
| Target | Script | Description |
|---|---|---|
| `simulate` | `simulate.sh` | Simulate sequencing reads from the reference genomes |
| `reference` | `build_reference.sh` | Build reference kmer sets (`.npz`) from simulation truth |
| `index_presence` | `index_one_presence.sh` | Index each specimen (presence mode) |
| `index_count` | `index_one_count.sh` | Index each specimen (count mode) |
| `aggregate_index_presence` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (presence) |
| `aggregate_index_count` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (count) |
| `merge_presence` | `merge_presence.sh` | Merge all specimen presence indexes into a global index |
| `merge_count` | `merge_count.sh` | Merge all specimen count indexes into a global index |
| `verify_presence` | `verify_one_presence.sh` | Verify each specimen presence index against reference |
| `verify_count` | `verify_one_count.sh` | Verify each specimen count index against reference |
| `aggregate_verify_presence` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (presence) |
| `aggregate_verify_count` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (count) |
| `filter_presence` | `filter_one_presence.sh` | Extract species-specific presence indexes from global index |
| `filter_count` | `filter_one_count.sh` | Extract species-specific count indexes from global index |
| `aggregate_filter_presence` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (presence) |
| `aggregate_filter_count` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (count) |
| `verify_merge_presence` | `verify_merge_presence.sh` | Verify global presence index against all reference sets |
| `verify_merge_count` | `verify_merge_count.sh` | Verify global count index against all reference sets |
## Directory layout
```
benchmark/
├── genomes/ # input reference genomes (.fna.gz)
├── simulated_data/ # generated by simulate
│ └── <species>/<specimen>/
├── reference_index/ # reference kmer sets (.npz)
├── specimen_index_presence/ # per-specimen presence indexes
├── specimen_index_count/ # per-specimen count indexes
├── global_index_presence/ # merged global presence index
├── global_index_count/ # merged global count index
├── specific_index_presence/ # species-specific presence indexes
├── specific_index_count/ # species-specific count indexes
└── stats/ # all benchmark statistics
├── indexing_presence/
├── indexing_count/
├── verify_presence/
├── verify_count/
├── specific_kmer_presence/
├── specific_kmer_count/
├── verify_merge_presence/
└── verify_merge_count/
```
+53
View File
@@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Usage: aggregate_stats.sh TYPE
# TYPE = indexing_presence | indexing_count | verify_presence | verify_count
#
# Reads all stats/TYPE/*.stats files (one CSV data row each, no header).
# Creates a new stats/TYPE/run_NNN.csv only if any .stats file is newer than
# the most recent run CSV (idempotent when nothing changed).
set -euo pipefail
TYPE="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
STATS_DIR="${SCRIPT_DIR}/stats/${TYPE}"
case "${TYPE}" in
indexing_presence|indexing_count)
HEADER="run,species,strain,scatter_wall_s,scatter_rss_b,dereplicate_wall_s,dereplicate_rss_b,count_kmer_wall_s,count_kmer_rss_b,index_wall_s,index_rss_b,total_wall_s,total_rss_b"
;;
verify_presence)
HEADER="run,species,strain,ref_kmers,idx_kmers,false_neg,false_pos,fn_pct,fp_pct"
;;
verify_count)
HEADER="run,species,strain,ref_kmers,idx_kmers,false_neg,false_pos,count_mismatch,fn_pct,fp_pct,cm_pct"
;;
specific_kmer_presence|specific_kmer_count)
HEADER="run,species,rebuild_wall_s,rebuild_rss_b,pack_wall_s,pack_rss_b,filter_total_wall_s,filter_total_rss_b,select_wall_s,select_rss_b,select_total_wall_s,select_total_rss_b"
;;
*)
echo "ERROR: unknown stats type '${TYPE}'" >&2
exit 1
;;
esac
# Find most recent existing run CSV (empty string if none).
latest_csv=$(find "${STATS_DIR}" -maxdepth 1 -name 'run_*.csv' 2>/dev/null | sort | tail -1)
# Check if any .stats file is newer than the latest run CSV.
if [[ -n "${latest_csv}" ]] && \
[[ -z "$(find "${STATS_DIR}" -maxdepth 1 -name '*.stats' -newer "${latest_csv}" 2>/dev/null)" ]]; then
echo "[${TYPE}] stats up to date (${latest_csv})"
exit 0
fi
run_n=$(printf '%03d' "$(find "${STATS_DIR}" -maxdepth 1 -name 'run_*.csv' 2>/dev/null | wc -l | tr -d ' ')")
CSV="${STATS_DIR}/run_${run_n}.csv"
echo "${HEADER}" >"${CSV}"
# Sort .stats files by name for reproducible row order.
while IFS= read -r stats_file; do
sed "s/^/${run_n},/" "${stats_file}"
done < <(find "${STATS_DIR}" -maxdepth 1 -name '*.stats' | sort) >>"${CSV}"
echo "[${TYPE}] run ${run_n}${CSV}"
+137
View File
@@ -0,0 +1,137 @@
#!/usr/bin/env python3
"""Build a reference kmer index from paired-end FASTQ reads.
Extracts canonical kmers — min(kmer, revcomp(kmer)) encoded as uint64 —
counts their abundances, and saves a sorted numpy pair (kmers, counts).
Output .npz arrays
kmers : uint64, sorted ascending — canonical kmer integers
counts : uint32, same order — raw read abundances
"""
import argparse
import gzip
import sys
from collections import defaultdict
import numpy as np
# ── encoding ────────────────────────────────────────────────────────────────
_ENCODE = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
# Lookup table: revcomp of one byte (4 bases, 8 bits).
# Precomputed once at import time.
_REVCOMP8 = [0] * 256
for _i in range(256):
_rc, _x = 0, _i
for _ in range(4):
_rc = (_rc << 2) | (3 - (_x & 3))
_x >>= 2
_REVCOMP8[_i] = _rc
del _i, _rc, _x
def revcomp_int(kmer: int, k: int) -> int:
"""Reverse-complement of a kmer encoded as an integer (2 bits/base).
Uses byte-level lookup (4 bases at a time) for speed.
"""
rc = 0
bits_left = 2 * k
while bits_left > 0:
chunk = min(8, bits_left)
rc_byte = _REVCOMP8[kmer & 0xFF] >> (8 - chunk)
rc = (rc << chunk) | rc_byte
kmer >>= chunk
bits_left -= chunk
return rc
# ── FASTQ parsing ────────────────────────────────────────────────────────────
def iter_sequences(path: str):
"""Yield raw sequences from a (gzipped) FASTQ file."""
opener = gzip.open if path.endswith('.gz') else open
with opener(path, 'rt') as fh:
while True:
if not fh.readline(): # '@' header
break
seq = fh.readline().rstrip('\n')
fh.readline() # '+'
fh.readline() # quality
yield seq
# ── kmer counting ────────────────────────────────────────────────────────────
def count_kmers(paths: list[str], k: int) -> dict[int, int]:
mask = (1 << (2 * k)) - 1
counts: dict[int, int] = defaultdict(int)
n_reads = 0
for path in paths:
for seq in iter_sequences(path):
n_reads += 1
kmer = 0
run = 0 # consecutive valid bases
for c in seq:
b = _ENCODE.get(c)
if b is None: # N or unexpected character → reset
kmer = 0
run = 0
continue
kmer = ((kmer << 2) | b) & mask
run += 1
if run >= k:
rc = revcomp_int(kmer, k)
counts[kmer if kmer <= rc else rc] += 1
if n_reads % 100_000 == 0:
print(f' {n_reads:,} reads processed, '
f'{len(counts):,} distinct kmers so far',
file=sys.stderr)
print(f' {n_reads:,} reads total, {len(counts):,} distinct kmers',
file=sys.stderr)
return counts
# ── main ─────────────────────────────────────────────────────────────────────
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument('reads', nargs='+', metavar='FASTQ',
help='Input reads (FASTQ, gzip OK)')
ap.add_argument('-k', '--kmer-size', type=int, default=31,
metavar='K')
ap.add_argument('--min-abundance', type=int, default=1,
metavar='N', help='Drop kmers with count < N (default 1)')
ap.add_argument('-o', '--output', required=True,
metavar='FILE', help='Output .npz path')
args = ap.parse_args()
print(f'k={args.kmer_size} files={len(args.reads)}', file=sys.stderr)
counts = count_kmers(args.reads, args.kmer_size)
if args.min_abundance > 1:
before = len(counts)
counts = {k: v for k, v in counts.items() if v >= args.min_abundance}
print(f' min-abundance={args.min_abundance}: '
f'{before - len(counts):,} kmers dropped, '
f'{len(counts):,} retained',
file=sys.stderr)
print(f'Sorting and saving → {args.output}', file=sys.stderr)
kmers_arr = np.fromiter(sorted(counts), dtype=np.uint64, count=len(counts))
counts_arr = np.array([counts[int(k)] for k in kmers_arr], dtype=np.uint32)
np.savez_compressed(args.output, kmers=kmers_arr, counts=counts_arr)
print(f'Done {len(kmers_arr):,} kmers → {args.output}', file=sys.stderr)
if __name__ == '__main__':
main()
+39
View File
@@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SIMDATA_DIR="${SCRIPT_DIR}/simulated_data"
REF_DIR="${SCRIPT_DIR}/reference_index"
PYTHON="${SCRIPT_DIR}/../.venv/bin/python3"
BUILD_PY="${SCRIPT_DIR}/build_reference.py"
KMER_SIZE="${KMER_SIZE:-31}"
MIN_ABUNDANCE="${MIN_ABUNDANCE:-1}"
mkdir -p "${REF_DIR}"
for species_dir in "${SIMDATA_DIR}"/*/; do
[[ -d "${species_dir}" ]] || continue
species=$(basename "${species_dir}")
for strain_dir in "${species_dir}"*/; do
[[ -d "${strain_dir}" ]] || continue
strain=$(basename "${strain_dir}")
r1="${strain_dir}/reads_R1.fastq.gz"
r2="${strain_dir}/reads_R2.fastq.gz"
if [[ ! -f "${r1}" || ! -f "${r2}" ]]; then
echo "SKIP ${species}--${strain}: reads not found" >&2
continue
fi
out="${REF_DIR}/${species}--${strain}.npz"
echo "[${species}--${strain}] → ${out}"
"${PYTHON}" "${BUILD_PY}" \
--kmer-size "${KMER_SIZE}" \
--min-abundance "${MIN_ABUNDANCE}" \
--output "${out}" \
"${r1}" "${r2}"
done
done
+199
View File
@@ -0,0 +1,199 @@
SPECIMENS := Escherichia_coli--K-12_MG1655 Escherichia_coli--EDL933 Salmonella_enterica--LT2 Escherichia_coli--CFT073 Bacillus_subtilis--168 Salmonella_enterica--P125109 Shouchella_clausii--KSM-K16 Escherichia_coli--K-12_W3110 Klebsiella_pneumoniae--MGH_78578 Opitutus_terrae--PB90-1 Saccharolobus_islandicus--M.16.4 Acidobacterium_capsulatum--ATCC_51196 Salmonella_enterica--AKU_12601 Proteus_mirabilis--HI4320 Salmonella_enterica--CT18 Klebsiella_pneumoniae--HS11286 Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1 Klebsiella_pneumoniae--ATCC_13883 Yersinia_ruckeri--YRB Candidozyma_auris--GCF_003013715.1_ASM301371v2
SPECIES := Escherichia_coli Salmonella_enterica Bacillus_subtilis Shouchella_clausii Klebsiella_pneumoniae Opitutus_terrae Saccharolobus_islandicus Acidobacterium_capsulatum Proteus_mirabilis Wolbachia_endosymbiont Yersinia_ruckeri Candidozyma_auris
# Escherichia_coli--K-12_MG1655
simulated_data/Escherichia_coli/K-12_MG1655/reads_R1.fastq.gz: genomes/GCF_000005845.2_ASM584v2_genomic.fna.gz
reference_index/Escherichia_coli--K-12_MG1655.npz: simulated_data/Escherichia_coli/K-12_MG1655/reads_R1.fastq.gz
specimen_index_presence/Escherichia_coli--K-12_MG1655/index.done stats/indexing_presence/Escherichia_coli--K-12_MG1655.stats: simulated_data/Escherichia_coli/K-12_MG1655/reads_R1.fastq.gz
specimen_index_count/Escherichia_coli--K-12_MG1655/index.done stats/indexing_count/Escherichia_coli--K-12_MG1655.stats: simulated_data/Escherichia_coli/K-12_MG1655/reads_R1.fastq.gz
stats/verify_presence/Escherichia_coli--K-12_MG1655.stats: reference_index/Escherichia_coli--K-12_MG1655.npz specimen_index_presence/Escherichia_coli--K-12_MG1655/index.done
stats/verify_count/Escherichia_coli--K-12_MG1655.stats: reference_index/Escherichia_coli--K-12_MG1655.npz specimen_index_count/Escherichia_coli--K-12_MG1655/index.done
# Escherichia_coli--EDL933
simulated_data/Escherichia_coli/EDL933/reads_R1.fastq.gz: genomes/GCF_000006665.1_ASM666v1_genomic.fna.gz
reference_index/Escherichia_coli--EDL933.npz: simulated_data/Escherichia_coli/EDL933/reads_R1.fastq.gz
specimen_index_presence/Escherichia_coli--EDL933/index.done stats/indexing_presence/Escherichia_coli--EDL933.stats: simulated_data/Escherichia_coli/EDL933/reads_R1.fastq.gz
specimen_index_count/Escherichia_coli--EDL933/index.done stats/indexing_count/Escherichia_coli--EDL933.stats: simulated_data/Escherichia_coli/EDL933/reads_R1.fastq.gz
stats/verify_presence/Escherichia_coli--EDL933.stats: reference_index/Escherichia_coli--EDL933.npz specimen_index_presence/Escherichia_coli--EDL933/index.done
stats/verify_count/Escherichia_coli--EDL933.stats: reference_index/Escherichia_coli--EDL933.npz specimen_index_count/Escherichia_coli--EDL933/index.done
# Salmonella_enterica--LT2
simulated_data/Salmonella_enterica/LT2/reads_R1.fastq.gz: genomes/GCF_000006945.2_ASM694v2_genomic.fna.gz
reference_index/Salmonella_enterica--LT2.npz: simulated_data/Salmonella_enterica/LT2/reads_R1.fastq.gz
specimen_index_presence/Salmonella_enterica--LT2/index.done stats/indexing_presence/Salmonella_enterica--LT2.stats: simulated_data/Salmonella_enterica/LT2/reads_R1.fastq.gz
specimen_index_count/Salmonella_enterica--LT2/index.done stats/indexing_count/Salmonella_enterica--LT2.stats: simulated_data/Salmonella_enterica/LT2/reads_R1.fastq.gz
stats/verify_presence/Salmonella_enterica--LT2.stats: reference_index/Salmonella_enterica--LT2.npz specimen_index_presence/Salmonella_enterica--LT2/index.done
stats/verify_count/Salmonella_enterica--LT2.stats: reference_index/Salmonella_enterica--LT2.npz specimen_index_count/Salmonella_enterica--LT2/index.done
# Escherichia_coli--CFT073
simulated_data/Escherichia_coli/CFT073/reads_R1.fastq.gz: genomes/GCF_000007445.1_ASM744v1_genomic.fna.gz
reference_index/Escherichia_coli--CFT073.npz: simulated_data/Escherichia_coli/CFT073/reads_R1.fastq.gz
specimen_index_presence/Escherichia_coli--CFT073/index.done stats/indexing_presence/Escherichia_coli--CFT073.stats: simulated_data/Escherichia_coli/CFT073/reads_R1.fastq.gz
specimen_index_count/Escherichia_coli--CFT073/index.done stats/indexing_count/Escherichia_coli--CFT073.stats: simulated_data/Escherichia_coli/CFT073/reads_R1.fastq.gz
stats/verify_presence/Escherichia_coli--CFT073.stats: reference_index/Escherichia_coli--CFT073.npz specimen_index_presence/Escherichia_coli--CFT073/index.done
stats/verify_count/Escherichia_coli--CFT073.stats: reference_index/Escherichia_coli--CFT073.npz specimen_index_count/Escherichia_coli--CFT073/index.done
# Bacillus_subtilis--168
simulated_data/Bacillus_subtilis/168/reads_R1.fastq.gz: genomes/GCF_000009045.1_ASM904v1_genomic.fna.gz
reference_index/Bacillus_subtilis--168.npz: simulated_data/Bacillus_subtilis/168/reads_R1.fastq.gz
specimen_index_presence/Bacillus_subtilis--168/index.done stats/indexing_presence/Bacillus_subtilis--168.stats: simulated_data/Bacillus_subtilis/168/reads_R1.fastq.gz
specimen_index_count/Bacillus_subtilis--168/index.done stats/indexing_count/Bacillus_subtilis--168.stats: simulated_data/Bacillus_subtilis/168/reads_R1.fastq.gz
stats/verify_presence/Bacillus_subtilis--168.stats: reference_index/Bacillus_subtilis--168.npz specimen_index_presence/Bacillus_subtilis--168/index.done
stats/verify_count/Bacillus_subtilis--168.stats: reference_index/Bacillus_subtilis--168.npz specimen_index_count/Bacillus_subtilis--168/index.done
# Salmonella_enterica--P125109
simulated_data/Salmonella_enterica/P125109/reads_R1.fastq.gz: genomes/GCF_000009505.1_ASM950v1_genomic.fna.gz
reference_index/Salmonella_enterica--P125109.npz: simulated_data/Salmonella_enterica/P125109/reads_R1.fastq.gz
specimen_index_presence/Salmonella_enterica--P125109/index.done stats/indexing_presence/Salmonella_enterica--P125109.stats: simulated_data/Salmonella_enterica/P125109/reads_R1.fastq.gz
specimen_index_count/Salmonella_enterica--P125109/index.done stats/indexing_count/Salmonella_enterica--P125109.stats: simulated_data/Salmonella_enterica/P125109/reads_R1.fastq.gz
stats/verify_presence/Salmonella_enterica--P125109.stats: reference_index/Salmonella_enterica--P125109.npz specimen_index_presence/Salmonella_enterica--P125109/index.done
stats/verify_count/Salmonella_enterica--P125109.stats: reference_index/Salmonella_enterica--P125109.npz specimen_index_count/Salmonella_enterica--P125109/index.done
# Shouchella_clausii--KSM-K16
simulated_data/Shouchella_clausii/KSM-K16/reads_R1.fastq.gz: genomes/GCF_000009825.1_ASM982v1_genomic.fna.gz
reference_index/Shouchella_clausii--KSM-K16.npz: simulated_data/Shouchella_clausii/KSM-K16/reads_R1.fastq.gz
specimen_index_presence/Shouchella_clausii--KSM-K16/index.done stats/indexing_presence/Shouchella_clausii--KSM-K16.stats: simulated_data/Shouchella_clausii/KSM-K16/reads_R1.fastq.gz
specimen_index_count/Shouchella_clausii--KSM-K16/index.done stats/indexing_count/Shouchella_clausii--KSM-K16.stats: simulated_data/Shouchella_clausii/KSM-K16/reads_R1.fastq.gz
stats/verify_presence/Shouchella_clausii--KSM-K16.stats: reference_index/Shouchella_clausii--KSM-K16.npz specimen_index_presence/Shouchella_clausii--KSM-K16/index.done
stats/verify_count/Shouchella_clausii--KSM-K16.stats: reference_index/Shouchella_clausii--KSM-K16.npz specimen_index_count/Shouchella_clausii--KSM-K16/index.done
# Escherichia_coli--K-12_W3110
simulated_data/Escherichia_coli/K-12_W3110/reads_R1.fastq.gz: genomes/GCF_000010245.2_ASM1024v1_genomic.fna.gz
reference_index/Escherichia_coli--K-12_W3110.npz: simulated_data/Escherichia_coli/K-12_W3110/reads_R1.fastq.gz
specimen_index_presence/Escherichia_coli--K-12_W3110/index.done stats/indexing_presence/Escherichia_coli--K-12_W3110.stats: simulated_data/Escherichia_coli/K-12_W3110/reads_R1.fastq.gz
specimen_index_count/Escherichia_coli--K-12_W3110/index.done stats/indexing_count/Escherichia_coli--K-12_W3110.stats: simulated_data/Escherichia_coli/K-12_W3110/reads_R1.fastq.gz
stats/verify_presence/Escherichia_coli--K-12_W3110.stats: reference_index/Escherichia_coli--K-12_W3110.npz specimen_index_presence/Escherichia_coli--K-12_W3110/index.done
stats/verify_count/Escherichia_coli--K-12_W3110.stats: reference_index/Escherichia_coli--K-12_W3110.npz specimen_index_count/Escherichia_coli--K-12_W3110/index.done
# Klebsiella_pneumoniae--MGH_78578
simulated_data/Klebsiella_pneumoniae/MGH_78578/reads_R1.fastq.gz: genomes/GCF_000016305.1_ASM1630v1_genomic.fna.gz
reference_index/Klebsiella_pneumoniae--MGH_78578.npz: simulated_data/Klebsiella_pneumoniae/MGH_78578/reads_R1.fastq.gz
specimen_index_presence/Klebsiella_pneumoniae--MGH_78578/index.done stats/indexing_presence/Klebsiella_pneumoniae--MGH_78578.stats: simulated_data/Klebsiella_pneumoniae/MGH_78578/reads_R1.fastq.gz
specimen_index_count/Klebsiella_pneumoniae--MGH_78578/index.done stats/indexing_count/Klebsiella_pneumoniae--MGH_78578.stats: simulated_data/Klebsiella_pneumoniae/MGH_78578/reads_R1.fastq.gz
stats/verify_presence/Klebsiella_pneumoniae--MGH_78578.stats: reference_index/Klebsiella_pneumoniae--MGH_78578.npz specimen_index_presence/Klebsiella_pneumoniae--MGH_78578/index.done
stats/verify_count/Klebsiella_pneumoniae--MGH_78578.stats: reference_index/Klebsiella_pneumoniae--MGH_78578.npz specimen_index_count/Klebsiella_pneumoniae--MGH_78578/index.done
# Opitutus_terrae--PB90-1
simulated_data/Opitutus_terrae/PB90-1/reads_R1.fastq.gz: genomes/GCF_000019965.1_ASM1996v1_genomic.fna.gz
reference_index/Opitutus_terrae--PB90-1.npz: simulated_data/Opitutus_terrae/PB90-1/reads_R1.fastq.gz
specimen_index_presence/Opitutus_terrae--PB90-1/index.done stats/indexing_presence/Opitutus_terrae--PB90-1.stats: simulated_data/Opitutus_terrae/PB90-1/reads_R1.fastq.gz
specimen_index_count/Opitutus_terrae--PB90-1/index.done stats/indexing_count/Opitutus_terrae--PB90-1.stats: simulated_data/Opitutus_terrae/PB90-1/reads_R1.fastq.gz
stats/verify_presence/Opitutus_terrae--PB90-1.stats: reference_index/Opitutus_terrae--PB90-1.npz specimen_index_presence/Opitutus_terrae--PB90-1/index.done
stats/verify_count/Opitutus_terrae--PB90-1.stats: reference_index/Opitutus_terrae--PB90-1.npz specimen_index_count/Opitutus_terrae--PB90-1/index.done
# Saccharolobus_islandicus--M.16.4
simulated_data/Saccharolobus_islandicus/M.16.4/reads_R1.fastq.gz: genomes/GCF_000022445.1_ASM2244v1_genomic.fna.gz
reference_index/Saccharolobus_islandicus--M.16.4.npz: simulated_data/Saccharolobus_islandicus/M.16.4/reads_R1.fastq.gz
specimen_index_presence/Saccharolobus_islandicus--M.16.4/index.done stats/indexing_presence/Saccharolobus_islandicus--M.16.4.stats: simulated_data/Saccharolobus_islandicus/M.16.4/reads_R1.fastq.gz
specimen_index_count/Saccharolobus_islandicus--M.16.4/index.done stats/indexing_count/Saccharolobus_islandicus--M.16.4.stats: simulated_data/Saccharolobus_islandicus/M.16.4/reads_R1.fastq.gz
stats/verify_presence/Saccharolobus_islandicus--M.16.4.stats: reference_index/Saccharolobus_islandicus--M.16.4.npz specimen_index_presence/Saccharolobus_islandicus--M.16.4/index.done
stats/verify_count/Saccharolobus_islandicus--M.16.4.stats: reference_index/Saccharolobus_islandicus--M.16.4.npz specimen_index_count/Saccharolobus_islandicus--M.16.4/index.done
# Acidobacterium_capsulatum--ATCC_51196
simulated_data/Acidobacterium_capsulatum/ATCC_51196/reads_R1.fastq.gz: genomes/GCF_000022565.1_ASM2256v1_genomic.fna.gz
reference_index/Acidobacterium_capsulatum--ATCC_51196.npz: simulated_data/Acidobacterium_capsulatum/ATCC_51196/reads_R1.fastq.gz
specimen_index_presence/Acidobacterium_capsulatum--ATCC_51196/index.done stats/indexing_presence/Acidobacterium_capsulatum--ATCC_51196.stats: simulated_data/Acidobacterium_capsulatum/ATCC_51196/reads_R1.fastq.gz
specimen_index_count/Acidobacterium_capsulatum--ATCC_51196/index.done stats/indexing_count/Acidobacterium_capsulatum--ATCC_51196.stats: simulated_data/Acidobacterium_capsulatum/ATCC_51196/reads_R1.fastq.gz
stats/verify_presence/Acidobacterium_capsulatum--ATCC_51196.stats: reference_index/Acidobacterium_capsulatum--ATCC_51196.npz specimen_index_presence/Acidobacterium_capsulatum--ATCC_51196/index.done
stats/verify_count/Acidobacterium_capsulatum--ATCC_51196.stats: reference_index/Acidobacterium_capsulatum--ATCC_51196.npz specimen_index_count/Acidobacterium_capsulatum--ATCC_51196/index.done
# Salmonella_enterica--AKU_12601
simulated_data/Salmonella_enterica/AKU_12601/reads_R1.fastq.gz: genomes/GCF_000026565.1_ASM2656v1_genomic.fna.gz
reference_index/Salmonella_enterica--AKU_12601.npz: simulated_data/Salmonella_enterica/AKU_12601/reads_R1.fastq.gz
specimen_index_presence/Salmonella_enterica--AKU_12601/index.done stats/indexing_presence/Salmonella_enterica--AKU_12601.stats: simulated_data/Salmonella_enterica/AKU_12601/reads_R1.fastq.gz
specimen_index_count/Salmonella_enterica--AKU_12601/index.done stats/indexing_count/Salmonella_enterica--AKU_12601.stats: simulated_data/Salmonella_enterica/AKU_12601/reads_R1.fastq.gz
stats/verify_presence/Salmonella_enterica--AKU_12601.stats: reference_index/Salmonella_enterica--AKU_12601.npz specimen_index_presence/Salmonella_enterica--AKU_12601/index.done
stats/verify_count/Salmonella_enterica--AKU_12601.stats: reference_index/Salmonella_enterica--AKU_12601.npz specimen_index_count/Salmonella_enterica--AKU_12601/index.done
# Proteus_mirabilis--HI4320
simulated_data/Proteus_mirabilis/HI4320/reads_R1.fastq.gz: genomes/GCF_000069965.1_ASM6996v1_genomic.fna.gz
reference_index/Proteus_mirabilis--HI4320.npz: simulated_data/Proteus_mirabilis/HI4320/reads_R1.fastq.gz
specimen_index_presence/Proteus_mirabilis--HI4320/index.done stats/indexing_presence/Proteus_mirabilis--HI4320.stats: simulated_data/Proteus_mirabilis/HI4320/reads_R1.fastq.gz
specimen_index_count/Proteus_mirabilis--HI4320/index.done stats/indexing_count/Proteus_mirabilis--HI4320.stats: simulated_data/Proteus_mirabilis/HI4320/reads_R1.fastq.gz
stats/verify_presence/Proteus_mirabilis--HI4320.stats: reference_index/Proteus_mirabilis--HI4320.npz specimen_index_presence/Proteus_mirabilis--HI4320/index.done
stats/verify_count/Proteus_mirabilis--HI4320.stats: reference_index/Proteus_mirabilis--HI4320.npz specimen_index_count/Proteus_mirabilis--HI4320/index.done
# Salmonella_enterica--CT18
simulated_data/Salmonella_enterica/CT18/reads_R1.fastq.gz: genomes/GCF_000195995.1_ASM19599v1_genomic.fna.gz
reference_index/Salmonella_enterica--CT18.npz: simulated_data/Salmonella_enterica/CT18/reads_R1.fastq.gz
specimen_index_presence/Salmonella_enterica--CT18/index.done stats/indexing_presence/Salmonella_enterica--CT18.stats: simulated_data/Salmonella_enterica/CT18/reads_R1.fastq.gz
specimen_index_count/Salmonella_enterica--CT18/index.done stats/indexing_count/Salmonella_enterica--CT18.stats: simulated_data/Salmonella_enterica/CT18/reads_R1.fastq.gz
stats/verify_presence/Salmonella_enterica--CT18.stats: reference_index/Salmonella_enterica--CT18.npz specimen_index_presence/Salmonella_enterica--CT18/index.done
stats/verify_count/Salmonella_enterica--CT18.stats: reference_index/Salmonella_enterica--CT18.npz specimen_index_count/Salmonella_enterica--CT18/index.done
# Klebsiella_pneumoniae--HS11286
simulated_data/Klebsiella_pneumoniae/HS11286/reads_R1.fastq.gz: genomes/GCF_000240185.1_ASM24018v2_genomic.fna.gz
reference_index/Klebsiella_pneumoniae--HS11286.npz: simulated_data/Klebsiella_pneumoniae/HS11286/reads_R1.fastq.gz
specimen_index_presence/Klebsiella_pneumoniae--HS11286/index.done stats/indexing_presence/Klebsiella_pneumoniae--HS11286.stats: simulated_data/Klebsiella_pneumoniae/HS11286/reads_R1.fastq.gz
specimen_index_count/Klebsiella_pneumoniae--HS11286/index.done stats/indexing_count/Klebsiella_pneumoniae--HS11286.stats: simulated_data/Klebsiella_pneumoniae/HS11286/reads_R1.fastq.gz
stats/verify_presence/Klebsiella_pneumoniae--HS11286.stats: reference_index/Klebsiella_pneumoniae--HS11286.npz specimen_index_presence/Klebsiella_pneumoniae--HS11286/index.done
stats/verify_count/Klebsiella_pneumoniae--HS11286.stats: reference_index/Klebsiella_pneumoniae--HS11286.npz specimen_index_count/Klebsiella_pneumoniae--HS11286/index.done
# Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1
simulated_data/Wolbachia_endosymbiont/GCF_000306885.1_ASM30688v1/reads_R1.fastq.gz: genomes/GCF_000306885.1_ASM30688v1_genomic.fna.gz
reference_index/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.npz: simulated_data/Wolbachia_endosymbiont/GCF_000306885.1_ASM30688v1/reads_R1.fastq.gz
specimen_index_presence/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1/index.done stats/indexing_presence/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.stats: simulated_data/Wolbachia_endosymbiont/GCF_000306885.1_ASM30688v1/reads_R1.fastq.gz
specimen_index_count/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1/index.done stats/indexing_count/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.stats: simulated_data/Wolbachia_endosymbiont/GCF_000306885.1_ASM30688v1/reads_R1.fastq.gz
stats/verify_presence/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.stats: reference_index/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.npz specimen_index_presence/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1/index.done
stats/verify_count/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.stats: reference_index/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1.npz specimen_index_count/Wolbachia_endosymbiont--GCF_000306885.1_ASM30688v1/index.done
# Klebsiella_pneumoniae--ATCC_13883
simulated_data/Klebsiella_pneumoniae/ATCC_13883/reads_R1.fastq.gz: genomes/GCF_000742135.1_ASM74213v1_genomic.fna.gz
reference_index/Klebsiella_pneumoniae--ATCC_13883.npz: simulated_data/Klebsiella_pneumoniae/ATCC_13883/reads_R1.fastq.gz
specimen_index_presence/Klebsiella_pneumoniae--ATCC_13883/index.done stats/indexing_presence/Klebsiella_pneumoniae--ATCC_13883.stats: simulated_data/Klebsiella_pneumoniae/ATCC_13883/reads_R1.fastq.gz
specimen_index_count/Klebsiella_pneumoniae--ATCC_13883/index.done stats/indexing_count/Klebsiella_pneumoniae--ATCC_13883.stats: simulated_data/Klebsiella_pneumoniae/ATCC_13883/reads_R1.fastq.gz
stats/verify_presence/Klebsiella_pneumoniae--ATCC_13883.stats: reference_index/Klebsiella_pneumoniae--ATCC_13883.npz specimen_index_presence/Klebsiella_pneumoniae--ATCC_13883/index.done
stats/verify_count/Klebsiella_pneumoniae--ATCC_13883.stats: reference_index/Klebsiella_pneumoniae--ATCC_13883.npz specimen_index_count/Klebsiella_pneumoniae--ATCC_13883/index.done
# Yersinia_ruckeri--YRB
simulated_data/Yersinia_ruckeri/YRB/reads_R1.fastq.gz: genomes/GCF_000834255.1_ASM83425v1_genomic.fna.gz
reference_index/Yersinia_ruckeri--YRB.npz: simulated_data/Yersinia_ruckeri/YRB/reads_R1.fastq.gz
specimen_index_presence/Yersinia_ruckeri--YRB/index.done stats/indexing_presence/Yersinia_ruckeri--YRB.stats: simulated_data/Yersinia_ruckeri/YRB/reads_R1.fastq.gz
specimen_index_count/Yersinia_ruckeri--YRB/index.done stats/indexing_count/Yersinia_ruckeri--YRB.stats: simulated_data/Yersinia_ruckeri/YRB/reads_R1.fastq.gz
stats/verify_presence/Yersinia_ruckeri--YRB.stats: reference_index/Yersinia_ruckeri--YRB.npz specimen_index_presence/Yersinia_ruckeri--YRB/index.done
stats/verify_count/Yersinia_ruckeri--YRB.stats: reference_index/Yersinia_ruckeri--YRB.npz specimen_index_count/Yersinia_ruckeri--YRB/index.done
# Candidozyma_auris--GCF_003013715.1_ASM301371v2
simulated_data/Candidozyma_auris/GCF_003013715.1_ASM301371v2/reads_R1.fastq.gz: genomes/GCF_003013715.1_ASM301371v2_genomic.fna.gz
reference_index/Candidozyma_auris--GCF_003013715.1_ASM301371v2.npz: simulated_data/Candidozyma_auris/GCF_003013715.1_ASM301371v2/reads_R1.fastq.gz
specimen_index_presence/Candidozyma_auris--GCF_003013715.1_ASM301371v2/index.done stats/indexing_presence/Candidozyma_auris--GCF_003013715.1_ASM301371v2.stats: simulated_data/Candidozyma_auris/GCF_003013715.1_ASM301371v2/reads_R1.fastq.gz
specimen_index_count/Candidozyma_auris--GCF_003013715.1_ASM301371v2/index.done stats/indexing_count/Candidozyma_auris--GCF_003013715.1_ASM301371v2.stats: simulated_data/Candidozyma_auris/GCF_003013715.1_ASM301371v2/reads_R1.fastq.gz
stats/verify_presence/Candidozyma_auris--GCF_003013715.1_ASM301371v2.stats: reference_index/Candidozyma_auris--GCF_003013715.1_ASM301371v2.npz specimen_index_presence/Candidozyma_auris--GCF_003013715.1_ASM301371v2/index.done
stats/verify_count/Candidozyma_auris--GCF_003013715.1_ASM301371v2.stats: reference_index/Candidozyma_auris--GCF_003013715.1_ASM301371v2.npz specimen_index_count/Candidozyma_auris--GCF_003013715.1_ASM301371v2/index.done
# Escherichia_coli
specific_index_presence/Escherichia_coli/index.done stats/specific_kmer_presence/Escherichia_coli.stats: global_index_presence/index.done
specific_index_count/Escherichia_coli/index.done stats/specific_kmer_count/Escherichia_coli.stats: global_index_count/index.done
# Salmonella_enterica
specific_index_presence/Salmonella_enterica/index.done stats/specific_kmer_presence/Salmonella_enterica.stats: global_index_presence/index.done
specific_index_count/Salmonella_enterica/index.done stats/specific_kmer_count/Salmonella_enterica.stats: global_index_count/index.done
# Bacillus_subtilis
specific_index_presence/Bacillus_subtilis/index.done stats/specific_kmer_presence/Bacillus_subtilis.stats: global_index_presence/index.done
specific_index_count/Bacillus_subtilis/index.done stats/specific_kmer_count/Bacillus_subtilis.stats: global_index_count/index.done
# Shouchella_clausii
specific_index_presence/Shouchella_clausii/index.done stats/specific_kmer_presence/Shouchella_clausii.stats: global_index_presence/index.done
specific_index_count/Shouchella_clausii/index.done stats/specific_kmer_count/Shouchella_clausii.stats: global_index_count/index.done
# Klebsiella_pneumoniae
specific_index_presence/Klebsiella_pneumoniae/index.done stats/specific_kmer_presence/Klebsiella_pneumoniae.stats: global_index_presence/index.done
specific_index_count/Klebsiella_pneumoniae/index.done stats/specific_kmer_count/Klebsiella_pneumoniae.stats: global_index_count/index.done
# Opitutus_terrae
specific_index_presence/Opitutus_terrae/index.done stats/specific_kmer_presence/Opitutus_terrae.stats: global_index_presence/index.done
specific_index_count/Opitutus_terrae/index.done stats/specific_kmer_count/Opitutus_terrae.stats: global_index_count/index.done
# Saccharolobus_islandicus
specific_index_presence/Saccharolobus_islandicus/index.done stats/specific_kmer_presence/Saccharolobus_islandicus.stats: global_index_presence/index.done
specific_index_count/Saccharolobus_islandicus/index.done stats/specific_kmer_count/Saccharolobus_islandicus.stats: global_index_count/index.done
# Acidobacterium_capsulatum
specific_index_presence/Acidobacterium_capsulatum/index.done stats/specific_kmer_presence/Acidobacterium_capsulatum.stats: global_index_presence/index.done
specific_index_count/Acidobacterium_capsulatum/index.done stats/specific_kmer_count/Acidobacterium_capsulatum.stats: global_index_count/index.done
# Proteus_mirabilis
specific_index_presence/Proteus_mirabilis/index.done stats/specific_kmer_presence/Proteus_mirabilis.stats: global_index_presence/index.done
specific_index_count/Proteus_mirabilis/index.done stats/specific_kmer_count/Proteus_mirabilis.stats: global_index_count/index.done
# Wolbachia_endosymbiont
specific_index_presence/Wolbachia_endosymbiont/index.done stats/specific_kmer_presence/Wolbachia_endosymbiont.stats: global_index_presence/index.done
specific_index_count/Wolbachia_endosymbiont/index.done stats/specific_kmer_count/Wolbachia_endosymbiont.stats: global_index_count/index.done
# Yersinia_ruckeri
specific_index_presence/Yersinia_ruckeri/index.done stats/specific_kmer_presence/Yersinia_ruckeri.stats: global_index_presence/index.done
specific_index_count/Yersinia_ruckeri/index.done stats/specific_kmer_count/Yersinia_ruckeri.stats: global_index_count/index.done
# Candidozyma_auris
specific_index_presence/Candidozyma_auris/index.done stats/specific_kmer_presence/Candidozyma_auris.stats: global_index_presence/index.done
specific_index_count/Candidozyma_auris/index.done stats/specific_kmer_count/Candidozyma_auris.stats: global_index_count/index.done
+48
View File
@@ -0,0 +1,48 @@
#!/usr/bin/env bash
set -euo pipefail
assemblies=(
GCF_000005845.2
GCF_000010245.2
GCF_000007445.1
GCF_000006665.1
GCF_000006945.2
GCF_000195995.1
GCF_000009505.1
GCF_000026565.1
GCF_000016305.1
GCF_000019965.1
GCF_000240185.1
GCF_000742135.1
GCF_000069965.1
GCF_000022565.1
GCF_000306885.1
GCF_003013715.1
GCF_000009045.1
GCF_000009825.1
GCF_000022445.1
GCF_000834255.1
)
mkdir -p genomes
for acc in "${assemblies[@]}"; do
echo "Downloading ${acc}"
datasets download genome accession "${acc}" \
--include genome \
--filename "${acc}.zip"
unzip -q "${acc}.zip" -d "${acc}"
find "${acc}" -name "*.fna" |
while read file; do
obiconvert -Z ${file} >genomes/$(basename ${file}).gz
done
rm -rf "${acc}" "${acc}.zip"
done
+108
View File
@@ -0,0 +1,108 @@
#!/usr/bin/env bash
# Usage: filter_one_count.sh SPECIES
# Filters global_index_count to keep only kmers specific to SPECIES,
# then selects the SPECIES column in-place.
# Outputs:
# specific_index_count/SPECIES/index.done (written by obikmer select)
# stats/specific_kmer_count/SPECIES.stats (one CSV data row, no header)
set -euo pipefail
SPECIES="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
SOURCE="${SCRIPT_DIR}/global_index_count"
OUTPUT="${SCRIPT_DIR}/specific_index_count/${SPECIES}"
STATS_DIR="${SCRIPT_DIR}/stats/specific_kmer_count"
STATS_FILE="${STATS_DIR}/${SPECIES}.stats"
mkdir -p "${STATS_DIR}"
echo "[${SPECIES}] filter (count) → ${OUTPUT}"
LOG_FILTER=$(mktemp)
LOG_SELECT=$(mktemp)
trap 'rm -f "${LOG_FILTER}" "${LOG_SELECT}"' EXIT
"${BINARY}" filter \
--output "${OUTPUT}" \
--force \
--ingroup "species=${SPECIES}" \
--outgroup all \
--min-frac 0.5 \
--max-frac 1.0 \
--max-outgroup-count 0 \
"${SOURCE}" \
2>"${LOG_FILTER}"
cat "${LOG_FILTER}" >&2
"${BINARY}" select \
--in-place \
--group "${SPECIES}:species=${SPECIES}" \
--group-op "${SPECIES}:any" \
--select "${SPECIES}" \
"${OUTPUT}" \
2>"${LOG_SELECT}"
cat "${LOG_SELECT}" >&2
python3 - "${SPECIES}" "${LOG_FILTER}" "${LOG_SELECT}" <<'PYEOF' >"${STATS_FILE}"
import sys, re
species, log_filter, log_select = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
def parse_reporter(logfile):
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s): state = 'rows'
elif state == 'rows':
if is_sep(s): state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats['TOTAL'] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
return stats
f = parse_reporter(log_filter)
s = parse_reporter(log_select)
row = [species]
for stage, d in [('rebuild', f), ('pack', f), ('filter_total', f), ('select', s), ('select_total', s)]:
key = 'TOTAL' if stage.endswith('_total') else stage
w, r = d.get(key, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
print(','.join(row))
PYEOF
+108
View File
@@ -0,0 +1,108 @@
#!/usr/bin/env bash
# Usage: filter_one_presence.sh SPECIES
# Filters global_index_presence to keep only kmers specific to SPECIES,
# then selects the SPECIES column in-place.
# Outputs:
# specific_index_presence/SPECIES/index.done (written by obikmer select)
# stats/specific_kmer_presence/SPECIES.stats (one CSV data row, no header)
set -euo pipefail
SPECIES="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
SOURCE="${SCRIPT_DIR}/global_index_presence"
OUTPUT="${SCRIPT_DIR}/specific_index_presence/${SPECIES}"
STATS_DIR="${SCRIPT_DIR}/stats/specific_kmer_presence"
STATS_FILE="${STATS_DIR}/${SPECIES}.stats"
mkdir -p "${STATS_DIR}"
echo "[${SPECIES}] filter (presence) → ${OUTPUT}"
LOG_FILTER=$(mktemp)
LOG_SELECT=$(mktemp)
trap 'rm -f "${LOG_FILTER}" "${LOG_SELECT}"' EXIT
"${BINARY}" filter \
--output "${OUTPUT}" \
--force \
--ingroup "species=${SPECIES}" \
--outgroup all \
--min-frac 0.5 \
--max-frac 1.0 \
--max-outgroup-count 0 \
"${SOURCE}" \
2>"${LOG_FILTER}"
cat "${LOG_FILTER}" >&2
"${BINARY}" select \
--in-place \
--group "${SPECIES}:species=${SPECIES}" \
--group-op "${SPECIES}:any" \
--select "${SPECIES}" \
"${OUTPUT}" \
2>"${LOG_SELECT}"
cat "${LOG_SELECT}" >&2
python3 - "${SPECIES}" "${LOG_FILTER}" "${LOG_SELECT}" <<'PYEOF' >"${STATS_FILE}"
import sys, re
species, log_filter, log_select = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
def parse_reporter(logfile):
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s): state = 'rows'
elif state == 'rows':
if is_sep(s): state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats['TOTAL'] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
return stats
f = parse_reporter(log_filter)
s = parse_reporter(log_select)
row = [species]
for stage, d in [('rebuild', f), ('pack', f), ('filter_total', f), ('select', s), ('select_total', s)]:
key = 'TOTAL' if stage.endswith('_total') else stage
w, r = d.get(key, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
print(','.join(row))
PYEOF
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
# Usage: index_one_count.sh SPECIMEN
# SPECIMEN = "species--strain" (Make pattern stem)
# Outputs:
# specimen_index_count/SPECIMEN/index.done (written by obikmer)
# stats/indexing_count/SPECIMEN.stats (one CSV data row, no header)
set -euo pipefail
SPECIMEN="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
species="${SPECIMEN%%--*}"
strain="${SPECIMEN#*--}"
READS_DIR="${SCRIPT_DIR}/simulated_data/${species}/${strain}"
INDEX_PATH="${SCRIPT_DIR}/specimen_index_count/${SPECIMEN}"
STATS_DIR="${SCRIPT_DIR}/stats/indexing_count"
STATS_FILE="${STATS_DIR}/${SPECIMEN}.stats"
mkdir -p "${STATS_DIR}"
r1="${READS_DIR}/reads_R1.fastq.gz"
r2="${READS_DIR}/reads_R2.fastq.gz"
if [[ ! -f "${r1}" || ! -f "${r2}" ]]; then
echo "ERROR: reads not found in ${READS_DIR}" >&2
exit 1
fi
echo "[${SPECIMEN}] indexing (count) → ${INDEX_PATH}"
STDERR_LOG=$(mktemp)
trap 'rm -f "${STDERR_LOG}"' EXIT
"${BINARY}" index \
--output "${INDEX_PATH}" \
--force \
--theta 0 \
--with-counts \
--label "${SPECIMEN}" \
--meta "species=${species}" \
"${r1}" "${r2}" \
2>"${STDERR_LOG}"
cat "${STDERR_LOG}" >&2
python3 - "${species}" "${strain}" "${STDERR_LOG}" <<'PYEOF' >"${STATS_FILE}"
import sys, re
species, strain, logfile = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s): state = 'rows'
elif state == 'rows':
if is_sep(s): state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats[parts[0]] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
STAGE_ORDER = ['scatter', 'dereplicate', 'count_kmer', 'index']
row = [species, strain]
for stage in STAGE_ORDER:
w, r = stats.get(stage, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
tw, tr = stats.get('TOTAL', ('', ''))
row += [f'{tw:.3f}' if isinstance(tw, float) else '', str(tr)]
print(','.join(row))
PYEOF
+102
View File
@@ -0,0 +1,102 @@
#!/usr/bin/env bash
# Usage: index_one_presence.sh SPECIMEN
# SPECIMEN = "species--strain" (Make pattern stem)
# Outputs:
# specimen_index_presence/SPECIMEN/index.done (written by obikmer)
# stats/indexing_presence/SPECIMEN.stats (one CSV data row, no header)
set -euo pipefail
SPECIMEN="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
species="${SPECIMEN%%--*}"
strain="${SPECIMEN#*--}"
READS_DIR="${SCRIPT_DIR}/simulated_data/${species}/${strain}"
INDEX_PATH="${SCRIPT_DIR}/specimen_index_presence/${SPECIMEN}"
STATS_DIR="${SCRIPT_DIR}/stats/indexing_presence"
STATS_FILE="${STATS_DIR}/${SPECIMEN}.stats"
mkdir -p "${STATS_DIR}"
r1="${READS_DIR}/reads_R1.fastq.gz"
r2="${READS_DIR}/reads_R2.fastq.gz"
if [[ ! -f "${r1}" || ! -f "${r2}" ]]; then
echo "ERROR: reads not found in ${READS_DIR}" >&2
exit 1
fi
echo "[${SPECIMEN}] indexing (presence) → ${INDEX_PATH}"
STDERR_LOG=$(mktemp)
trap 'rm -f "${STDERR_LOG}"' EXIT
"${BINARY}" index \
--output "${INDEX_PATH}" \
--force \
--theta 0 \
--label "${SPECIMEN}" \
--meta "species=${species}" \
"${r1}" "${r2}" \
2>"${STDERR_LOG}"
cat "${STDERR_LOG}" >&2
python3 - "${species}" "${strain}" "${STDERR_LOG}" <<'PYEOF' >"${STATS_FILE}"
import sys, re
species, strain, logfile = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s): state = 'rows'
elif state == 'rows':
if is_sep(s): state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats[parts[0]] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
STAGE_ORDER = ['scatter', 'dereplicate', 'count_kmer', 'index']
row = [species, strain]
for stage in STAGE_ORDER:
w, r = stats.get(stage, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
tw, tr = stats.get('TOTAL', ('', ''))
row += [f'{tw:.3f}' if isinstance(tw, float) else '', str(tr)]
print(','.join(row))
PYEOF
+118
View File
@@ -0,0 +1,118 @@
#!/usr/bin/env python3
"""Generate deps.mk — pure dependency declarations for the benchmark pipeline.
Like C .d files: only target: prerequisites lines, no recipes.
Recipes stay in the Makefile as generic rules.
"""
import gzip
import re
import sys
from pathlib import Path
STOP_WORDS = {'complete', 'chromosome', 'whole', 'sequence', 'genome',
'endosymbiont', 'of'}
STOP_PREFIXES = ('scaffold', 'contig', 'plasmid')
def is_stop(tok):
t = tok.lower()
return t in STOP_WORDS or any(t.startswith(p) for p in STOP_PREFIXES)
def sanitize(s):
return re.sub(r'[^A-Za-z0-9._-]', '_', s).strip('_')
def collect_tokens(text):
parts = []
for tok in text.split():
tok = tok.rstrip(',.')
if is_stop(tok):
break
parts.append(sanitize(tok))
return '_'.join(filter(None, parts))
def parse_organism(defn, gcf_id):
words = defn.split()
species = sanitize(words[0] + '_' + words[1])
m = re.search(r'\bstr\.\s+(\S+)(?:\s+substr\.\s+(\S+))?', defn)
if m:
strain = sanitize(m.group(1))
if m.group(2):
strain += '_' + sanitize(m.group(2))
return species, strain
m = re.search(r'\bstrain\b\s+(.*)', defn)
if m:
strain = collect_tokens(m.group(1))
if strain:
return species, strain
remainder = re.sub(r'^\S+ \S+\s*', '', defn)
remainder = re.sub(r'^subsp\.\s+\S+\s*', '', remainder)
remainder = re.sub(r'^serovar\s+\S+\s*', '', remainder)
strain = collect_tokens(remainder)
return species, strain if strain else gcf_id
def first_definition(path):
with gzip.open(path, 'rt') as fh:
for line in fh:
if line.startswith('>'):
m = re.search(r'"definition":"([^"]*)"', line)
return m.group(1) if m else line[1:].split()[0]
return Path(path).stem
def main():
entries = [] # (specimen, species, sim_dir, genome_path)
species_seen = []
for path in sorted(sys.argv[1:]):
gcf_id = Path(path).name.replace('_genomic.fna.gz', '')
defn = first_definition(path)
sp, st = parse_organism(defn, gcf_id)
specimen = f'{sp}--{st}'
sim_dir = f'simulated_data/{sp}/{st}'
entries.append((specimen, sp, sim_dir, path))
if sp not in species_seen:
species_seen.append(sp)
specimens = [e[0] for e in entries]
print('SPECIMENS :=', ' '.join(specimens))
print('SPECIES :=', ' '.join(species_seen))
for specimen, species, sim_dir, genome in entries:
reads = f'{sim_dir}/reads_R1.fastq.gz'
p_done = f'specimen_index_presence/{specimen}/index.done'
p_stats = f'stats/indexing_presence/{specimen}.stats'
c_done = f'specimen_index_count/{specimen}/index.done'
c_stats = f'stats/indexing_count/{specimen}.stats'
ref = f'reference_index/{specimen}.npz'
vp = f'stats/verify_presence/{specimen}.stats'
vc = f'stats/verify_count/{specimen}.stats'
print()
print(f'# {specimen}')
print(f'{reads}: {genome}')
print(f'{ref}: {reads}')
print(f'{p_done} {p_stats}: {reads}')
print(f'{c_done} {c_stats}: {reads}')
print(f'{vp}: {ref} {p_done}')
print(f'{vc}: {ref} {c_done}')
print()
for sp in species_seen:
sp_done = f'specific_index_presence/{sp}/index.done'
sp_stats = f'stats/specific_kmer_presence/{sp}.stats'
sc_done = f'specific_index_count/{sp}/index.done'
sc_stats = f'stats/specific_kmer_count/{sp}.stats'
print(f'# {sp}')
print(f'{sp_done} {sp_stats}: global_index_presence/index.done')
print(f'{sc_done} {sc_stats}: global_index_count/index.done')
if __name__ == '__main__':
main()
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
IDX_DIR="${SCRIPT_DIR}/specimen_index_count"
OUTPUT="${SCRIPT_DIR}/global_index_count"
STATS_DIR="${SCRIPT_DIR}/stats/merge_count"
mkdir -p "${STATS_DIR}"
run_n=$(printf '%03d' "$(find "${STATS_DIR}" -maxdepth 1 -name 'run_*.csv' | wc -l | tr -d ' ')")
CSV="${STATS_DIR}/run_${run_n}.csv"
printf 'run,n_sources,bootstrap_wall_s,bootstrap_rss_b,spectrums_wall_s,spectrums_rss_b,merge_partitions_wall_s,merge_partitions_rss_b,pack_wall_s,pack_rss_b,total_wall_s,total_rss_b\n' >"${CSV}"
parse_reporter() {
local run="$1" n_sources="$2" logfile="$3"
python3 - "$run" "$n_sources" "$logfile" <<'PYEOF'
import sys, re
run, n_sources, logfile = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s):
state = 'rows'
elif state == 'rows':
if is_sep(s):
state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats[parts[0]] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
STAGE_ORDER = ['bootstrap', 'spectrums', 'merge_partitions', 'pack']
row = [run, n_sources]
for stage in STAGE_ORDER:
w, r = stats.get(stage, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
tw, tr = stats.get('TOTAL', ('', ''))
row += [f'{tw:.3f}' if isinstance(tw, float) else '', str(tr)]
print(','.join(row))
PYEOF
}
mapfile -t sources < <(find "${IDX_DIR}" -mindepth 1 -maxdepth 1 -type d | sort)
if [[ ${#sources[@]} -eq 0 ]]; then
echo "ERROR: no indexes found in ${IDX_DIR}" >&2
exit 1
fi
echo "Merging ${#sources[@]} count indexes → ${OUTPUT}"
printf ' %s\n' "${sources[@]}"
STDERR_LOG=$(mktemp)
trap 'rm -f "${STDERR_LOG}"' EXIT
"${BINARY}" merge \
--output "${OUTPUT}" \
--force \
"${sources[@]}" \
2>"${STDERR_LOG}"
cat "${STDERR_LOG}" >&2
parse_reporter "${run_n}" "${#sources[@]}" "${STDERR_LOG}" >>"${CSV}"
echo "Done. Run ${run_n}${CSV}"
+104
View File
@@ -0,0 +1,104 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
IDX_DIR="${SCRIPT_DIR}/specimen_index_presence"
OUTPUT="${SCRIPT_DIR}/global_index_presence"
STATS_DIR="${SCRIPT_DIR}/stats/merge_presence"
mkdir -p "${STATS_DIR}"
run_n=$(printf '%03d' "$(find "${STATS_DIR}" -maxdepth 1 -name 'run_*.csv' | wc -l | tr -d ' ')")
CSV="${STATS_DIR}/run_${run_n}.csv"
printf 'run,n_sources,bootstrap_wall_s,bootstrap_rss_b,spectrums_wall_s,spectrums_rss_b,merge_partitions_wall_s,merge_partitions_rss_b,pack_wall_s,pack_rss_b,total_wall_s,total_rss_b\n' >"${CSV}"
parse_reporter() {
local run="$1" n_sources="$2" logfile="$3"
python3 - "$run" "$n_sources" "$logfile" <<'PYEOF'
import sys, re
run, n_sources, logfile = sys.argv[1], sys.argv[2], sys.argv[3]
def strip_ansi(s):
return re.sub(r'\x1b\[[\x30-\x3f]*[\x20-\x2f]*[\x40-\x7e]', '', s)
def parse_wall(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1000.0
if s.endswith('s'): return float(s[:-1])
return 0.0
def parse_rss(s):
m = re.match(r'([\d.]+)\s*(GB|MB|KB|B)', s.strip())
if not m: return 0
return int(float(m.group(1)) * {'GB': 1<<30, 'MB': 1<<20, 'KB': 1024, 'B': 1}[m.group(2)])
def is_sep(s):
return bool(s) and not re.search(r'[A-Za-z0-9]', s)
stats = {}
state = 'scan'
with open(logfile, errors='replace') as fh:
for raw in fh:
line = strip_ansi(raw.rstrip('\n'))
s = line.strip()
if state == 'scan':
if re.search(r'\bstage\b.*\bwall\b', line):
state = 'in_header'
elif state == 'in_header':
if is_sep(s):
state = 'rows'
elif state == 'rows':
if is_sep(s):
state = 'total'
elif s:
parts = re.split(r' +', s)
if len(parts) >= 4:
stats[parts[0]] = (parse_wall(parts[1]), parse_rss(parts[3]))
elif state == 'total':
if s:
parts = re.split(r' +', s)
if len(parts) >= 3:
stats[parts[0]] = (parse_wall(parts[1]),
parse_rss(parts[3]) if len(parts) > 3 else 0)
break
STAGE_ORDER = ['bootstrap', 'spectrums', 'merge_partitions', 'pack']
row = [run, n_sources]
for stage in STAGE_ORDER:
w, r = stats.get(stage, ('', ''))
row += [f'{w:.3f}' if isinstance(w, float) else '', str(r)]
tw, tr = stats.get('TOTAL', ('', ''))
row += [f'{tw:.3f}' if isinstance(tw, float) else '', str(tr)]
print(','.join(row))
PYEOF
}
mapfile -t sources < <(find "${IDX_DIR}" -mindepth 1 -maxdepth 1 -type d | sort)
if [[ ${#sources[@]} -eq 0 ]]; then
echo "ERROR: no indexes found in ${IDX_DIR}" >&2
exit 1
fi
echo "Merging ${#sources[@]} presence indexes → ${OUTPUT}"
printf ' %s\n' "${sources[@]}"
STDERR_LOG=$(mktemp)
trap 'rm -f "${STDERR_LOG}"' EXIT
"${BINARY}" merge \
--output "${OUTPUT}" \
--force \
--force-presence \
"${sources[@]}" \
2>"${STDERR_LOG}"
cat "${STDERR_LOG}" >&2
parse_reporter "${run_n}" "${#sources[@]}" "${STDERR_LOG}" >>"${CSV}"
echo "Done. Run ${run_n}${CSV}"
+12
View File
@@ -0,0 +1,12 @@
#!/usr/bin/env bash
# Simulate all genomes. Delegates to simulate_one.sh per genome.
# Prefer running via `gmake simulate` which handles individual dependencies.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
for genome_file in "${SCRIPT_DIR}"/genomes/*.fna.gz; do
out_dir=$("${SCRIPT_DIR}/../.venv/bin/python3" "${SCRIPT_DIR}/make_deps.py" \
--dir-for "${genome_file}")
bash "${SCRIPT_DIR}/simulate_one.sh" "${genome_file}" "${out_dir}"
done
+33
View File
@@ -0,0 +1,33 @@
#!/usr/bin/env bash
# Usage: simulate_one.sh genome.fna.gz output_dir
# Simulates paired-end HiSeq reads for a single genome.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ISS="${SCRIPT_DIR}/../.venv/bin/iss"
COVERAGE=15
READ_LENGTH=150
CPUS="${CPUS:-$(sysctl -n hw.logicalcpu 2>/dev/null || nproc 2>/dev/null || echo 2)}"
genome_file="$1"
out_dir="$2"
mkdir -p "${out_dir}"
tmp_fasta=$(mktemp "${TMPDIR:-/tmp}/obikmer_XXXXXX.fna")
trap 'rm -f "${tmp_fasta}"' EXIT
gzip -dc "${genome_file}" > "${tmp_fasta}"
genome_size=$(grep -v "^>" "${tmp_fasta}" | tr -d '[:space:]' | wc -c | tr -d ' ')
n_reads=$(python3 -c "import math; print(math.ceil(${COVERAGE} * ${genome_size} / (2 * ${READ_LENGTH})))")
echo "[${out_dir}] genome=${genome_size} bp → ${n_reads} read pairs (${COVERAGE}x HiSeq)"
"${ISS}" generate \
--genomes "${tmp_fasta}" \
--model HiSeq \
--n_reads "${n_reads}" \
--cpus "${CPUS}" \
--compress \
--output "${out_dir}/reads"
+181
View File
@@ -0,0 +1,181 @@
#!/usr/bin/env python3
"""Compare an obikmer count index against a reference kmer set (presence + counts).
Loads the reference .npz (sorted uint64 kmers + uint32 counts from build_reference.py),
streams `obikmer dump` from a --with-counts index, then reports:
- false negatives : kmers in reference absent from the index
- false positives : kmers in the index absent from the reference
- count mismatches: kmers present in both but with differing counts
Output to stdout: one CSV row
species,strain,ref_kmers,idx_kmers,false_neg,false_pos,count_mismatch,
fn_pct,fp_pct,cm_pct
"""
import argparse
import subprocess
import sys
import numpy as np
# ── encoding ──────────────────────────────────────────────────────────────────
_ENCODE = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
_DECODE = ['A', 'C', 'G', 'T']
def encode_kmer(s: str) -> int:
kmer = 0
for c in s:
kmer = (kmer << 2) | _ENCODE[c]
return kmer
def decode_kmer(val: int, k: int) -> str:
bases = []
for _ in range(k):
bases.append(_DECODE[val & 3])
val >>= 2
return ''.join(reversed(bases))
# ── dump parsing ──────────────────────────────────────────────────────────────
def load_index(obikmer_bin: str, index_dir: str) -> tuple[np.ndarray, np.ndarray]:
"""Stream `obikmer dump` and return (kmers_sorted_uint64, counts_uint32)."""
cmd = [obikmer_bin, 'dump', index_dir]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
text=True)
kmers, counts = [], []
header = True
for line in proc.stdout:
if header:
header = False
continue
parts = line.rstrip('\n').split(',')
kmers.append(encode_kmer(parts[0]))
counts.append(int(parts[1]))
proc.wait()
if proc.returncode != 0:
print(f'ERROR: obikmer dump exited {proc.returncode}', file=sys.stderr)
sys.exit(1)
order = np.argsort(np.array(kmers, dtype=np.uint64), kind='stable')
return (np.array(kmers, dtype=np.uint64)[order],
np.array(counts, dtype=np.uint32)[order])
# ── comparison ────────────────────────────────────────────────────────────────
def compare(ref_kmers: np.ndarray, ref_counts: np.ndarray,
idx_kmers: np.ndarray, idx_counts: np.ndarray,
) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Return (false_neg, false_pos, cm_ref_kmers, cm_ref_counts, cm_idx_counts).
All arrays sorted; cm_* cover kmers present in both arrays but with
differing counts.
"""
false_neg = np.setdiff1d(ref_kmers, idx_kmers, assume_unique=True)
false_pos = np.setdiff1d(idx_kmers, ref_kmers, assume_unique=True)
# Count mismatches among shared kmers.
# Both arrays are sorted so we can use searchsorted.
pos_in_idx = np.searchsorted(idx_kmers, ref_kmers)
pos_in_idx = np.clip(pos_in_idx, 0, len(idx_kmers) - 1)
shared_mask = idx_kmers[pos_in_idx] == ref_kmers
shared_ref_counts = ref_counts[shared_mask]
shared_idx_counts = idx_counts[pos_in_idx[shared_mask]]
mismatch_mask = shared_ref_counts != shared_idx_counts
cm_kmers = ref_kmers[shared_mask][mismatch_mask]
cm_ref_counts = shared_ref_counts[mismatch_mask]
cm_idx_counts = shared_idx_counts[mismatch_mask]
return false_neg, false_pos, cm_kmers, cm_ref_counts, cm_idx_counts
# ── main ─────────────────────────────────────────────────────────────────────
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument('reference', metavar='REF_NPZ', nargs='?',
help='Reference .npz file')
ap.add_argument('index', metavar='INDEX_DIR', nargs='?',
help='obikmer index directory (built with --with-counts)')
ap.add_argument('--obikmer', default='obikmer',
help='Path to obikmer binary')
ap.add_argument('--species', default='')
ap.add_argument('--strain', default='')
ap.add_argument('--header', action='store_true',
help='Print CSV header and exit')
ap.add_argument('--save-fp', metavar='FILE',
help='Save false-positive kmer strings to FILE')
ap.add_argument('--save-fn', metavar='FILE',
help='Save false-negative kmer strings to FILE')
ap.add_argument('--save-cm', metavar='FILE',
help='Save count-mismatch rows (kmer,ref_count,idx_count) to FILE')
args = ap.parse_args()
if args.header:
print('species,strain,ref_kmers,idx_kmers,'
'false_neg,false_pos,count_mismatch,'
'fn_pct,fp_pct,cm_pct')
return
# Detect k
cmd1 = [args.obikmer, 'dump', '--head', '1', args.index]
out1 = subprocess.check_output(cmd1, stderr=subprocess.DEVNULL, text=True)
k = len(out1.splitlines()[1].split(',')[0])
# Load reference
print(f'Loading reference: {args.reference}', file=sys.stderr)
npz = np.load(args.reference)
ref_kmers = npz['kmers'] # sorted uint64
ref_counts = npz['counts'] # uint32
# Load index
print(f'Streaming dump (k={k}): {args.index}', file=sys.stderr)
idx_kmers, idx_counts = load_index(args.obikmer, args.index)
print(f'k={k} ref={len(ref_kmers):,} idx={len(idx_kmers):,}', file=sys.stderr)
false_neg, false_pos, cm_kmers, cm_ref, cm_idx = compare(
ref_kmers, ref_counts, idx_kmers, idx_counts)
n_shared = len(ref_kmers) - len(false_neg)
fn_pct = 100.0 * len(false_neg) / len(ref_kmers) if len(ref_kmers) else 0.0
fp_pct = 100.0 * len(false_pos) / len(idx_kmers) if len(idx_kmers) else 0.0
cm_pct = 100.0 * len(cm_kmers) / n_shared if n_shared else 0.0
print(f'false negatives : {len(false_neg):,} ({fn_pct:.4f}%)', file=sys.stderr)
print(f'false positives : {len(false_pos):,} ({fp_pct:.4f}%)', file=sys.stderr)
print(f'count mismatches: {len(cm_kmers):,} ({cm_pct:.4f}% of shared)',
file=sys.stderr)
if args.save_fn and len(false_neg):
with open(args.save_fn, 'w') as fh:
for v in false_neg:
fh.write(decode_kmer(int(v), k) + '\n')
if args.save_fp and len(false_pos):
with open(args.save_fp, 'w') as fh:
for v in false_pos:
fh.write(decode_kmer(int(v), k) + '\n')
if args.save_cm and len(cm_kmers):
with open(args.save_cm, 'w') as fh:
fh.write('kmer,ref_count,idx_count\n')
for v, rc, ic in zip(cm_kmers, cm_ref, cm_idx):
fh.write(f'{decode_kmer(int(v), k)},{rc},{ic}\n')
print(f'{args.species},{args.strain},'
f'{len(ref_kmers)},{len(idx_kmers)},'
f'{len(false_neg)},{len(false_pos)},{len(cm_kmers)},'
f'{fn_pct:.4f},{fp_pct:.4f},{cm_pct:.4f}')
if __name__ == '__main__':
main()
+201
View File
@@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""Verify the merged count index against all per-specimen reference sets.
Streams `obikmer dump` once on the merged index, accumulates per-specimen
kmer+count pairs from each column, then compares each against its reference .npz.
Output to stdout: one CSV row per specimen (same columns as verify_count.py)
species,strain,ref_kmers,idx_kmers,false_neg,false_pos,count_mismatch,
fn_pct,fp_pct,cm_pct
"""
import argparse
import subprocess
import sys
from pathlib import Path
import numpy as np
# ── encoding ──────────────────────────────────────────────────────────────────
_ENCODE = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
_DECODE = ['A', 'C', 'G', 'T']
def encode_kmer(s: str) -> int:
kmer = 0
for c in s:
kmer = (kmer << 2) | _ENCODE[c]
return kmer
def decode_kmer(val: int, k: int) -> str:
bases = []
for _ in range(k):
bases.append(_DECODE[val & 3])
val >>= 2
return ''.join(reversed(bases))
# ── single-pass dump ──────────────────────────────────────────────────────────
def stream_merged_dump(obikmer_bin: str, index_dir: str,
) -> tuple[list[str], dict[str, tuple[list[int], list[int]]]]:
"""Stream the merged dump once.
Returns:
specimen_names : column labels in dump order
per_specimen : mapping label → (kmer_ints, counts) for entries > 0
"""
cmd = [obikmer_bin, 'dump', index_dir]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
text=True)
header_line = proc.stdout.readline().rstrip('\n')
cols = header_line.split(',')
specimen_names = cols[1:]
per_specimen: dict[str, tuple[list[int], list[int]]] = {
name: ([], []) for name in specimen_names}
for line in proc.stdout:
parts = line.rstrip('\n').split(',')
kmer_int = encode_kmer(parts[0])
for i, name in enumerate(specimen_names):
count = int(parts[i + 1])
if count > 0:
per_specimen[name][0].append(kmer_int)
per_specimen[name][1].append(count)
proc.wait()
if proc.returncode != 0:
print(f'ERROR: obikmer dump exited {proc.returncode}', file=sys.stderr)
sys.exit(1)
return specimen_names, per_specimen
# ── per-specimen comparison ───────────────────────────────────────────────────
def compare_specimen(name: str,
kmer_list: list[int],
count_list: list[int],
ref_dir: Path,
k: int,
save_fn: Path | None,
save_fp: Path | None,
save_cm: Path | None,
) -> str:
ref_path = ref_dir / f'{name}.npz'
if not ref_path.exists():
print(f' SKIP {name}: no reference at {ref_path}', file=sys.stderr)
return ''
species = name.split('--')[0]
strain = name[len(species) + 2:]
npz = np.load(ref_path)
ref_kmers = npz['kmers'] # sorted uint64
ref_counts = npz['counts'] # uint32
order = np.argsort(np.array(kmer_list, dtype=np.uint64), kind='stable')
idx_kmers = np.array(kmer_list, dtype=np.uint64)[order]
idx_counts = np.array(count_list, dtype=np.uint32)[order]
false_neg = np.setdiff1d(ref_kmers, idx_kmers, assume_unique=True)
false_pos = np.setdiff1d(idx_kmers, ref_kmers, assume_unique=True)
# Count mismatches among shared kmers
pos_in_idx = np.searchsorted(idx_kmers, ref_kmers)
pos_in_idx = np.clip(pos_in_idx, 0, len(idx_kmers) - 1)
shared_mask = idx_kmers[pos_in_idx] == ref_kmers
mismatch_mask = ref_counts[shared_mask] != idx_counts[pos_in_idx[shared_mask]]
cm_kmers = ref_kmers[shared_mask][mismatch_mask]
cm_ref = ref_counts[shared_mask][mismatch_mask]
cm_idx = idx_counts[pos_in_idx[shared_mask]][mismatch_mask]
n_shared = int(shared_mask.sum())
fn_pct = 100.0 * len(false_neg) / len(ref_kmers) if len(ref_kmers) else 0.0
fp_pct = 100.0 * len(false_pos) / len(idx_kmers) if len(idx_kmers) else 0.0
cm_pct = 100.0 * len(cm_kmers) / n_shared if n_shared else 0.0
print(f' {name}: ref={len(ref_kmers):,} idx={len(idx_kmers):,} '
f'fn={len(false_neg):,} ({fn_pct:.4f}%) '
f'fp={len(false_pos):,} ({fp_pct:.4f}%) '
f'cm={len(cm_kmers):,} ({cm_pct:.4f}%)',
file=sys.stderr)
if save_fn and len(false_neg):
fn_file = save_fn / f'{name}_fn.txt'
fn_file.write_text('\n'.join(decode_kmer(int(v), k) for v in false_neg) + '\n')
if save_fp and len(false_pos):
fp_file = save_fp / f'{name}_fp.txt'
fp_file.write_text('\n'.join(decode_kmer(int(v), k) for v in false_pos) + '\n')
if save_cm and len(cm_kmers):
cm_file = save_cm / f'{name}_cm.csv'
lines = ['kmer,ref_count,idx_count']
for v, rc, ic in zip(cm_kmers, cm_ref, cm_idx):
lines.append(f'{decode_kmer(int(v), k)},{rc},{ic}')
cm_file.write_text('\n'.join(lines) + '\n')
return (f'{species},{strain},'
f'{len(ref_kmers)},{len(idx_kmers)},'
f'{len(false_neg)},{len(false_pos)},{len(cm_kmers)},'
f'{fn_pct:.4f},{fp_pct:.4f},{cm_pct:.4f}')
# ── main ─────────────────────────────────────────────────────────────────────
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument('index', metavar='INDEX_DIR', nargs='?',
help='Merged count index directory')
ap.add_argument('ref_dir', metavar='REF_DIR', nargs='?',
help='Directory containing per-specimen .npz reference files')
ap.add_argument('--obikmer', default='obikmer')
ap.add_argument('--header', action='store_true',
help='Print CSV header and exit')
ap.add_argument('--save-fn', metavar='DIR',
help='Directory for false-negative kmer lists')
ap.add_argument('--save-fp', metavar='DIR',
help='Directory for false-positive kmer lists')
ap.add_argument('--save-cm', metavar='DIR',
help='Directory for count-mismatch CSV files')
args = ap.parse_args()
if args.header:
print('species,strain,ref_kmers,idx_kmers,'
'false_neg,false_pos,count_mismatch,'
'fn_pct,fp_pct,cm_pct')
return
ref_dir = Path(args.ref_dir)
save_fn = Path(args.save_fn) if args.save_fn else None
save_fp = Path(args.save_fp) if args.save_fp else None
save_cm = Path(args.save_cm) if args.save_cm else None
for d in (save_fn, save_fp, save_cm):
if d: d.mkdir(parents=True, exist_ok=True)
out1 = subprocess.check_output(
[args.obikmer, 'dump', '--head', '1', args.index],
stderr=subprocess.DEVNULL, text=True)
k = len(out1.splitlines()[1].split(',')[0])
print(f'k={k} streaming merged dump: {args.index}', file=sys.stderr)
specimen_names, per_specimen = stream_merged_dump(args.obikmer, args.index)
print(f'{len(specimen_names)} specimen columns loaded', file=sys.stderr)
for name in specimen_names:
kmers, counts = per_specimen[name]
row = compare_specimen(name, kmers, counts, ref_dir, k,
save_fn, save_fp, save_cm)
if row:
print(row)
if __name__ == '__main__':
main()
+27
View File
@@ -0,0 +1,27 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
INDEX="${SCRIPT_DIR}/global_index_count"
REF_DIR="${SCRIPT_DIR}/reference_index"
STATS_DIR="${SCRIPT_DIR}/stats/verify_merge_count"
PYTHON="${SCRIPT_DIR}/../.venv/bin/python3"
VERIFY_PY="${SCRIPT_DIR}/verify_merge_count.py"
mkdir -p "${STATS_DIR}"
CURRENT="${STATS_DIR}/current.csv"
"${PYTHON}" "${VERIFY_PY}" --header >"${CURRENT}"
"${PYTHON}" "${VERIFY_PY}" \
--obikmer "${BINARY}" \
"${INDEX}" "${REF_DIR}" \
>>"${CURRENT}"
run_n=$(printf '%03d' "$(find "${STATS_DIR}" -maxdepth 1 -name 'count_*.csv' | wc -l | tr -d ' ')")
ARCHIVE="${STATS_DIR}/count_${run_n}.csv"
cp "${CURRENT}" "${ARCHIVE}"
echo "Done. Results → ${ARCHIVE}"
+170
View File
@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""Verify the merged presence index against all per-specimen reference sets.
Streams `obikmer dump` once on the merged index, accumulates per-specimen
kmer sets from each column, then compares each against its reference .npz.
Output to stdout: one CSV row per specimen (same columns as verify_presence.py)
species,strain,ref_kmers,idx_kmers,false_neg,false_pos,fn_pct,fp_pct
"""
import argparse
import subprocess
import sys
from pathlib import Path
import numpy as np
# ── encoding ──────────────────────────────────────────────────────────────────
_ENCODE = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
_DECODE = ['A', 'C', 'G', 'T']
def encode_kmer(s: str) -> int:
kmer = 0
for c in s:
kmer = (kmer << 2) | _ENCODE[c]
return kmer
def decode_kmer(val: int, k: int) -> str:
bases = []
for _ in range(k):
bases.append(_DECODE[val & 3])
val >>= 2
return ''.join(reversed(bases))
# ── single-pass dump ──────────────────────────────────────────────────────────
def stream_merged_dump(obikmer_bin: str, index_dir: str,
) -> tuple[list[str], dict[str, list[int]]]:
"""Stream the merged dump once.
Returns:
specimen_names : column labels in dump order (excluding 'kmer')
per_specimen : mapping label → list of kmer ints where presence > 0
"""
cmd = [obikmer_bin, 'dump', index_dir]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
text=True)
header_line = proc.stdout.readline().rstrip('\n')
cols = header_line.split(',')
specimen_names = cols[1:] # first col is 'kmer'
per_specimen: dict[str, list[int]] = {name: [] for name in specimen_names}
for line in proc.stdout:
parts = line.rstrip('\n').split(',')
kmer_int = encode_kmer(parts[0])
for i, name in enumerate(specimen_names):
if int(parts[i + 1]) > 0:
per_specimen[name].append(kmer_int)
proc.wait()
if proc.returncode != 0:
print(f'ERROR: obikmer dump exited {proc.returncode}', file=sys.stderr)
sys.exit(1)
return specimen_names, per_specimen
# ── per-specimen comparison ───────────────────────────────────────────────────
def compare_specimen(name: str,
kmer_list: list[int],
ref_dir: Path,
k: int,
save_fn: Path | None,
save_fp: Path | None,
) -> str:
"""Compare one specimen column against its reference .npz.
Returns a CSV row string.
"""
ref_path = ref_dir / f'{name}.npz'
if not ref_path.exists():
print(f' SKIP {name}: no reference at {ref_path}', file=sys.stderr)
return ''
species = name.split('--')[0]
strain = name[len(species) + 2:]
ref_kmers = np.load(ref_path)['kmers'] # sorted uint64
idx_kmers = np.array(sorted(kmer_list), dtype=np.uint64)
false_neg = np.setdiff1d(ref_kmers, idx_kmers, assume_unique=True)
false_pos = np.setdiff1d(idx_kmers, ref_kmers, assume_unique=True)
fn_pct = 100.0 * len(false_neg) / len(ref_kmers) if len(ref_kmers) else 0.0
fp_pct = 100.0 * len(false_pos) / len(idx_kmers) if len(idx_kmers) else 0.0
print(f' {name}: ref={len(ref_kmers):,} idx={len(idx_kmers):,} '
f'fn={len(false_neg):,} ({fn_pct:.4f}%) '
f'fp={len(false_pos):,} ({fp_pct:.4f}%)',
file=sys.stderr)
if save_fn and len(false_neg):
fn_file = save_fn / f'{name}_fn.txt'
fn_file.write_text('\n'.join(decode_kmer(int(v), k) for v in false_neg) + '\n')
if save_fp and len(false_pos):
fp_file = save_fp / f'{name}_fp.txt'
fp_file.write_text('\n'.join(decode_kmer(int(v), k) for v in false_pos) + '\n')
return (f'{species},{strain},'
f'{len(ref_kmers)},{len(idx_kmers)},'
f'{len(false_neg)},{len(false_pos)},'
f'{fn_pct:.4f},{fp_pct:.4f}')
# ── main ─────────────────────────────────────────────────────────────────────
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument('index', metavar='INDEX_DIR', nargs='?',
help='Merged presence index directory')
ap.add_argument('ref_dir', metavar='REF_DIR', nargs='?',
help='Directory containing per-specimen .npz reference files')
ap.add_argument('--obikmer', default='obikmer')
ap.add_argument('--header', action='store_true',
help='Print CSV header and exit')
ap.add_argument('--save-fn', metavar='DIR',
help='Directory to save false-negative kmer lists')
ap.add_argument('--save-fp', metavar='DIR',
help='Directory to save false-positive kmer lists')
args = ap.parse_args()
if args.header:
print('species,strain,ref_kmers,idx_kmers,'
'false_neg,false_pos,fn_pct,fp_pct')
return
ref_dir = Path(args.ref_dir)
save_fn = Path(args.save_fn) if args.save_fn else None
save_fp = Path(args.save_fp) if args.save_fp else None
if save_fn: save_fn.mkdir(parents=True, exist_ok=True)
if save_fp: save_fp.mkdir(parents=True, exist_ok=True)
# Detect k
out1 = subprocess.check_output(
[args.obikmer, 'dump', '--head', '1', args.index],
stderr=subprocess.DEVNULL, text=True)
k = len(out1.splitlines()[1].split(',')[0])
print(f'k={k} streaming merged dump: {args.index}', file=sys.stderr)
specimen_names, per_specimen = stream_merged_dump(args.obikmer, args.index)
print(f'{len(specimen_names)} specimen columns loaded', file=sys.stderr)
for name in specimen_names:
row = compare_specimen(name, per_specimen[name], ref_dir, k, save_fn, save_fp)
if row:
print(row)
if __name__ == '__main__':
main()
+27
View File
@@ -0,0 +1,27 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
INDEX="${SCRIPT_DIR}/global_index_presence"
REF_DIR="${SCRIPT_DIR}/reference_index"
STATS_DIR="${SCRIPT_DIR}/stats/verify_merge_presence"
PYTHON="${SCRIPT_DIR}/../.venv/bin/python3"
VERIFY_PY="${SCRIPT_DIR}/verify_merge_presence.py"
mkdir -p "${STATS_DIR}"
CURRENT="${STATS_DIR}/current.csv"
"${PYTHON}" "${VERIFY_PY}" --header >"${CURRENT}"
"${PYTHON}" "${VERIFY_PY}" \
--obikmer "${BINARY}" \
"${INDEX}" "${REF_DIR}" \
>>"${CURRENT}"
run_n=$(printf '%03d' "$(find "${STATS_DIR}" -maxdepth 1 -name 'presence_*.csv' | wc -l | tr -d ' ')")
ARCHIVE="${STATS_DIR}/presence_${run_n}.csv"
cp "${CURRENT}" "${ARCHIVE}"
echo "Done. Results → ${ARCHIVE}"
+30
View File
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
# Usage: verify_one_count.sh SPECIMEN
# SPECIMEN = "species--strain" (Make pattern stem)
# Output: stats/verify_count/SPECIMEN.stats (one CSV data row, no header)
set -euo pipefail
SPECIMEN="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
PYTHON="${SCRIPT_DIR}/../.venv/bin/python3"
VERIFY_PY="${SCRIPT_DIR}/verify_count.py"
species="${SPECIMEN%%--*}"
strain="${SPECIMEN#*--}"
REF_NPZ="${SCRIPT_DIR}/reference_index/${SPECIMEN}.npz"
INDEX_DIR="${SCRIPT_DIR}/specimen_index_count/${SPECIMEN}"
STATS_DIR="${SCRIPT_DIR}/stats/verify_count"
STATS_FILE="${STATS_DIR}/${SPECIMEN}.stats"
mkdir -p "${STATS_DIR}"
echo "[${SPECIMEN}] verifying count"
"${PYTHON}" "${VERIFY_PY}" \
--obikmer "${BINARY}" \
--species "${species}" \
--strain "${strain}" \
"${REF_NPZ}" "${INDEX_DIR}" \
>"${STATS_FILE}"
+30
View File
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
# Usage: verify_one_presence.sh SPECIMEN
# SPECIMEN = "species--strain" (Make pattern stem)
# Output: stats/verify_presence/SPECIMEN.stats (one CSV data row, no header)
set -euo pipefail
SPECIMEN="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BINARY="${SCRIPT_DIR}/../src/target/release/obikmer"
PYTHON="${SCRIPT_DIR}/../.venv/bin/python3"
VERIFY_PY="${SCRIPT_DIR}/verify_presence.py"
species="${SPECIMEN%%--*}"
strain="${SPECIMEN#*--}"
REF_NPZ="${SCRIPT_DIR}/reference_index/${SPECIMEN}.npz"
INDEX_DIR="${SCRIPT_DIR}/specimen_index_presence/${SPECIMEN}"
STATS_DIR="${SCRIPT_DIR}/stats/verify_presence"
STATS_FILE="${STATS_DIR}/${SPECIMEN}.stats"
mkdir -p "${STATS_DIR}"
echo "[${SPECIMEN}] verifying presence"
"${PYTHON}" "${VERIFY_PY}" \
--obikmer "${BINARY}" \
--species "${species}" \
--strain "${strain}" \
"${REF_NPZ}" "${INDEX_DIR}" \
>"${STATS_FILE}"
+139
View File
@@ -0,0 +1,139 @@
#!/usr/bin/env python3
"""Compare an obikmer index against a reference kmer set (presence/absence).
Loads the reference .npz (sorted uint64 kmers built by build_reference.py),
streams the output of `obikmer dump`, encodes each kmer string to uint64,
then reports false negatives and false positives using numpy set operations.
Output to stdout: one CSV row
species, strain, ref_kmers, idx_kmers, false_neg, false_pos, fn_pct, fp_pct
"""
import argparse
import subprocess
import sys
import numpy as np
# ── encoding ──────────────────────────────────────────────────────────────────
_ENCODE = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
_DECODE = ['A', 'C', 'G', 'T']
def encode_kmer(s: str) -> int:
kmer = 0
for c in s:
kmer = (kmer << 2) | _ENCODE[c]
return kmer
def decode_kmer(val: int, k: int) -> str:
bases = []
for _ in range(k):
bases.append(_DECODE[val & 3])
val >>= 2
return ''.join(reversed(bases))
# ── dump parsing ──────────────────────────────────────────────────────────────
def load_index_kmers(obikmer_bin: str, index_dir: str) -> np.ndarray:
"""Stream `obikmer dump` and return a sorted uint64 array of kmer integers."""
cmd = [obikmer_bin, 'dump', index_dir]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
text=True)
kmers = []
header = True
for line in proc.stdout:
if header:
header = False
continue
kmer_str = line.split(',', 1)[0]
kmers.append(encode_kmer(kmer_str))
proc.wait()
if proc.returncode != 0:
print(f'ERROR: obikmer dump exited {proc.returncode}', file=sys.stderr)
sys.exit(1)
arr = np.array(kmers, dtype=np.uint64)
arr.sort()
return arr
# ── comparison ────────────────────────────────────────────────────────────────
def compare(ref: np.ndarray, idx: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""Return (false_negatives, false_positives) as uint64 arrays."""
false_neg = np.setdiff1d(ref, idx, assume_unique=True)
false_pos = np.setdiff1d(idx, ref, assume_unique=True)
return false_neg, false_pos
# ── main ─────────────────────────────────────────────────────────────────────
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument('reference', metavar='REF_NPZ', nargs='?', help='Reference .npz file')
ap.add_argument('index', metavar='INDEX_DIR', nargs='?', help='obikmer index directory')
ap.add_argument('--obikmer', default='obikmer', help='Path to obikmer binary')
ap.add_argument('--species', default='', help='Species label for CSV row')
ap.add_argument('--strain', default='', help='Strain label for CSV row')
ap.add_argument('--header', action='store_true', help='Print CSV header and exit')
ap.add_argument('--save-fp', metavar='FILE',
help='Save false-positive kmer strings to FILE')
ap.add_argument('--save-fn', metavar='FILE',
help='Save false-negative kmer strings to FILE')
args = ap.parse_args()
if args.header:
print('species,strain,ref_kmers,idx_kmers,'
'false_neg,false_pos,fn_pct,fp_pct')
return
# Detect k from the index (one cheap call before the full dump).
cmd1 = [args.obikmer, 'dump', '--head', '1', args.index]
out1 = subprocess.check_output(cmd1, stderr=subprocess.DEVNULL, text=True)
k = len(out1.splitlines()[1].split(',')[0])
# Load reference
print(f'Loading reference: {args.reference}', file=sys.stderr)
npz = np.load(args.reference)
ref_kmers = npz['kmers'] # already sorted uint64
# Load index
print(f'Streaming dump (k={k}): {args.index}', file=sys.stderr)
idx_kmers = load_index_kmers(args.obikmer, args.index)
print(f'k={k} ref={len(ref_kmers):,} idx={len(idx_kmers):,}', file=sys.stderr)
false_neg, false_pos = compare(ref_kmers, idx_kmers)
fn_pct = 100.0 * len(false_neg) / len(ref_kmers) if len(ref_kmers) else 0.0
fp_pct = 100.0 * len(false_pos) / len(idx_kmers) if len(idx_kmers) else 0.0
print(f'false negatives: {len(false_neg):,} ({fn_pct:.4f}%)', file=sys.stderr)
print(f'false positives: {len(false_pos):,} ({fp_pct:.4f}%)', file=sys.stderr)
if args.save_fn and len(false_neg):
with open(args.save_fn, 'w') as fh:
for v in false_neg:
fh.write(decode_kmer(int(v), k) + '\n')
print(f'False negatives saved → {args.save_fn}', file=sys.stderr)
if args.save_fp and len(false_pos):
with open(args.save_fp, 'w') as fh:
for v in false_pos:
fh.write(decode_kmer(int(v), k) + '\n')
print(f'False positives saved → {args.save_fp}', file=sys.stderr)
print(f'{args.species},{args.strain},'
f'{len(ref_kmers)},{len(idx_kmers)},'
f'{len(false_neg)},{len(false_pos)},'
f'{fn_pct:.4f},{fp_pct:.4f}')
if __name__ == '__main__':
main()
+84
View File
@@ -638,6 +638,34 @@
<li class="md-nav__item">
<a href="/implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="/implementation/obilayeredmap/" class="md-nav__link"> <a href="/implementation/obilayeredmap/" class="md-nav__link">
@@ -716,6 +744,62 @@
<li class="md-nav__item">
<a href="/implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="/implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -9,7 +9,7 @@
<link rel="prev" href="../../../implementation/persistent_bit_vec/"> <link rel="prev" href="../../../implementation/rebuild_filter/">
<link rel="next" href="../../index_architecture/"> <link rel="next" href="../../index_architecture/">
@@ -647,6 +647,34 @@
<li class="md-nav__item">
<a href="../../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../../../implementation/obilayeredmap/" class="md-nav__link"> <a href="../../../implementation/obilayeredmap/" class="md-nav__link">
@@ -725,6 +753,62 @@
<li class="md-nav__item">
<a href="../../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
File diff suppressed because it is too large Load Diff
+101 -38
View File
@@ -243,19 +243,28 @@
</label> </label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix=""> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item"> <li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope"> <a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis"> <span class="md-ellipsis">
Output type: rope Two reading paths
</span> </span>
</a> </a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy"> <a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis"> <span class="md-ellipsis">
Allocation policy Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span> </span>
</a> </a>
@@ -347,6 +356,18 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../evidence_elimination/">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span> </span>
</a> </a>
</li> </li>
@@ -383,6 +404,30 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../merge/">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../rebuild_filter/">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span> </span>
</a> </a>
</li> </li>
@@ -454,19 +499,28 @@
</label> </label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix=""> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item"> <li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope"> <a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis"> <span class="md-ellipsis">
Output type: rope Two reading paths
</span> </span>
</a> </a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy"> <a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis"> <span class="md-ellipsis">
Allocation policy Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span> </span>
</a> </a>
@@ -506,68 +560,77 @@
<div class="md-content" data-md-component="content"> <div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset"> <article class="md-content__inner md-typeset">
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1> <h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p> <p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
<h2 id="output-type-rope">Output type: rope</h2> <h2 id="two-reading-paths">Two reading paths</h2>
<p>Each chunk is a <code>Vec&lt;Bytes&gt;</code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
<h2 id="allocation-policy">Allocation policy</h2>
<table> <table>
<thead> <thead>
<tr> <tr>
<th>Case</th> <th>Path</th>
<th>Cost</th> <th>API</th>
<th>Output unit</th>
<th>Per-record identity</th>
<th>Use case</th>
</tr> </tr>
</thead> </thead>
<tbody> <tbody>
<tr> <tr>
<td>Boundary found in the current block (common)</td> <td><strong>Record path</strong></td>
<td>zero extra allocation — <code>split_to</code> only</td> <td><code>read_sequence_chunks</code><code>parse_chunk</code></td>
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
<td>yes</td>
<td><code>query</code> — must read complete records</td>
</tr> </tr>
<tr> <tr>
<td>Boundary straddles multiple blocks (sequence &gt; block size, rare)</td> <td><strong>Stream path</strong></td>
<td>one allocation to pack the rope into a flat buffer</td> <td><code>open_nuc_stream</code></td>
</tr> <td><code>NucPage</code> (flat normalised byte buffer)</td>
<tr> <td>no</td>
<td>EOF flush</td> <td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
<td>zero extra allocation</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
<hr/>
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec&lt;SeqRecord&gt;</code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
<h2 id="output-type-rope">Output type: Rope</h2>
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec&lt;Cell&lt;u8&gt;&gt;</code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
<h2 id="seqchunkiter">SeqChunkIter</h2> <h2 id="seqchunkiter">SeqChunkIter</h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span> <div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span> <span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Bytes</span><span class="o">&gt;&gt;</span><span class="p">;</span> <span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Rope</span><span class="o">&gt;</span><span class="p">;</span>
<span class="p">}</span> <span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
</code></pre></div> </code></pre></div>
<p><code>next()</code> loop:</p> <p><code>next()</code> loop:</p>
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope <div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
2. probe check: if the boundary marker ("\n&gt;" or "\n@") is absent from the 2. call splitter(rope) → Option&lt;abs_offset&gt;
last block, skip the splitter (avoids a full backward scan for nothing) if Some(pos):
3. call splitter on last block tail = rope.split_off(pos) ← O(log n), may split one block
if found at offset n: chunk = mem::replace(&amp;mut rope, tail)
remainder = last_block.split_to(n) ← O(1), zero copy return Some(Ok(chunk))
return std::mem::take(&amp;mut self.rope) ← the chunk 3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if rope.len() &gt; 1 (multi-block accumulation): 4. if EOF and rope empty: return None
pack rope into one flat buffer ← one alloc
retry splitter on flat buffer
5. if EOF: flush remaining rope as final chunk
</code></pre></div> </code></pre></div>
<p>The <code>Splitter</code> function signature is <code>fn(&amp;Rope) -&gt; Option&lt;usize&gt;</code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2> <h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
<p>Backward scan with a 2-state machine. Searches for <code>&gt;</code> immediately preceded by <code>\n</code> or <code>\r</code>:</p> <p>Backward scan with a 2-state machine. Searches (right to left) for <code>&gt;</code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>&gt;</code> that is preceded by a newline in forward order):</p>
<pre class="mermaid"><code>stateDiagram-v2 <pre class="mermaid"><code>stateDiagram-v2
direction LR direction LR
[*] --&gt; Scanning [*] --&gt; Scanning
Scanning --&gt; FoundGt : '&gt;' Scanning --&gt; FoundGt : '&gt;'
FoundGt --&gt; Scanning : other FoundGt --&gt; Scanning : other
FoundGt --&gt; [*] : '\\n' / '\\r' ✓</code></pre> FoundGt --&gt; [*] : '\\n' / '\\r' ✓</code></pre>
<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record.</p> <p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record. Returns <code>None</code> if only one <code>&gt;</code> is found (cannot confirm there is a prior complete record).</p>
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2> <h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p> <p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p> <p>7-state machine (states 06), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
<pre class="mermaid"><code>stateDiagram-v2 <pre class="mermaid"><code>stateDiagram-v2
direction LR direction LR
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+210 -21
View File
@@ -514,10 +514,21 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#memory-layout" class="md-nav__link"> <a href="#types-and-layout" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Memory layout Types and layout
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#global-parameters" class="md-nav__link">
<span class="md-ellipsis">
Global parameters
</span> </span>
</a> </a>
@@ -558,10 +569,32 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#canonical-form" class="md-nav__link"> <a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Canonical form Canonical form and CanonicalKmerOf
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#sliding-window-helpers" class="md-nav__link">
<span class="md-ellipsis">
Sliding window helpers
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#hashing" class="md-nav__link">
<span class="md-ellipsis">
Hashing
</span> </span>
</a> </a>
@@ -751,6 +784,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -829,6 +890,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -973,10 +1090,21 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#memory-layout" class="md-nav__link"> <a href="#types-and-layout" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Memory layout Types and layout
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#global-parameters" class="md-nav__link">
<span class="md-ellipsis">
Global parameters
</span> </span>
</a> </a>
@@ -1017,10 +1145,32 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#canonical-form" class="md-nav__link"> <a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Canonical form Canonical form and CanonicalKmerOf
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#sliding-window-helpers" class="md-nav__link">
<span class="md-ellipsis">
Sliding window helpers
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#hashing" class="md-nav__link">
<span class="md-ellipsis">
Hashing
</span> </span>
</a> </a>
@@ -1045,12 +1195,43 @@
<h1 id="kmer-implementation">Kmer — implementation</h1> <h1 id="kmer-implementation">Kmer — implementation</h1>
<h2 id="memory-layout">Memory layout</h2> <h2 id="types-and-layout">Types and layout</h2>
<p><code>Kmer</code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code>:</p> <p><code>KmerOf&lt;L&gt;</code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code> parameterized by a <code>KmerLength</code> marker:</p>
<div class="highlight"><pre><span></span><code><span class="cp">#[repr(transparent)]</span> <div class="highlight"><pre><span></span><code><span class="cp">#[repr(transparent)]</span>
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Kmer</span><span class="p">(</span><span class="kt">u64</span><span class="p">);</span> <span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">KmerOf</span><span class="o">&lt;</span><span class="n">L</span><span class="p">:</span><span class="w"> </span><span class="nc">KmerLength</span><span class="o">&gt;</span><span class="p">(</span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="n">PhantomData</span><span class="o">&lt;</span><span class="n">L</span><span class="o">&gt;</span><span class="p">);</span>
</code></pre></div> </code></pre></div>
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 6362; nucleotide i occupies bits 632i and 622i. The low 642k bits are always zero. k is <strong>not stored</strong> — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.</p> <p>Three marker types implement <code>KmerLength</code>:</p>
<table>
<thead>
<tr>
<th>Marker</th>
<th><code>len()</code> source</th>
<th>Used for</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>KLen</code></td>
<td><code>params::k()</code></td>
<td>k-mers</td>
</tr>
<tr>
<td><code>MLen</code></td>
<td><code>params::m()</code></td>
<td>minimizers</td>
</tr>
<tr>
<td><code>ConstLen&lt;N&gt;</code></td>
<td>const generic <code>N</code></td>
<td>tests</td>
</tr>
</tbody>
</table>
<p>Public aliases:</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Kmer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KmerOf</span><span class="o">&lt;</span><span class="n">KLen</span><span class="o">&gt;</span><span class="p">;</span><span class="w"> </span><span class="c1">// k-mer, global k</span>
<span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Minimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="o">&lt;</span><span class="n">MLen</span><span class="o">&gt;</span><span class="p">;</span><span class="w"> </span><span class="c1">// canonical m-mer, global m</span>
</code></pre></div>
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 6362; nucleotide i occupies bits 632i and 622i. The low 642·len bits are always zero. The length is <strong>not stored</strong> — every operation reads it from <code>L::len()</code>.</p>
<table> <table>
<thead> <thead>
<tr> <tr>
@@ -1071,33 +1252,41 @@
</tr> </tr>
</tbody> </tbody>
</table> </table>
<h2 id="global-parameters">Global parameters</h2>
<p><code>params::set_k(k)</code> / <code>params::k()</code> and <code>params::set_m(m)</code> / <code>params::m()</code> are backed by <code>OnceLock&lt;usize&gt;</code> in production (write-once, panic on conflict) and by <code>thread_local! { Cell&lt;usize&gt; }</code> in test builds (per-thread, freely writable). <code>params::init(k, m)</code> sets both in one call.</p>
<h2 id="encoding">Encoding</h2> <h2 id="encoding">Encoding</h2>
<p><code>Kmer::from_ascii(ascii, k)</code> encodes the first k bytes of an ASCII slice using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p> <p><code>KmerOf::&lt;L&gt;::from_ascii(ascii)</code> encodes the first <code>L::len()</code> bytes using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">k</span><span class="w"> </span><span class="p">{</span> <div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">k</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">encode_base</span><span class="p">(</span><span class="n">ascii</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">u64</span><span class="p">;</span> <span class="w"> </span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">encode_base</span><span class="p">(</span><span class="n">ascii</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">u64</span><span class="p">;</span>
<span class="p">}</span> <span class="p">}</span>
<span class="n">Kmer</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span> <span class="n">KmerOf</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
</code></pre></div> </code></pre></div>
<p>Zero allocation — result lives on the stack.</p> <p>Zero allocation — result lives on the stack.</p>
<h2 id="decoding">Decoding</h2> <h2 id="decoding">Decoding</h2>
<p><code>write_ascii(k, buf)</code> appends k ASCII characters to a caller-supplied <code>Vec&lt;u8&gt;</code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.</p> <p><code>write_ascii(writer)</code> writes k ASCII characters to any <code>W: Write</code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.</p>
<p><code>to_ascii(k)</code> is a convenience wrapper that allocates and returns a <code>Vec&lt;u8&gt;</code>; intended for tests and display only.</p> <p><code>to_ascii()</code> is a convenience wrapper that allocates and returns a <code>Vec&lt;u8&gt;</code>; intended for tests and display only.</p>
<h2 id="reverse-complement">Reverse complement</h2> <h2 id="reverse-complement">Reverse complement</h2>
<p>Computed as pure arithmetic — no lookup table, no memory access:</p> <p>Computed as pure arithmetic — no lookup table, no memory access:</p>
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="c1">// complement</span> <div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="c1">// complement</span>
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">swap_bytes</span><span class="p">();</span><span class="w"> </span><span class="c1">// reverse bytes</span> <span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">swap_bytes</span><span class="p">();</span><span class="w"> </span><span class="c1">// reverse bytes</span>
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap nibbles</span> <span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap nibbles</span>
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap 2-bit groups</span> <span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap 2-bit groups</span>
<span class="n">Kmer</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span> <span class="n">KmerOf</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
</code></pre></div> </code></pre></div>
<p>After complementing, bytes are reversed (<code>swap_bytes</code>), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.</p> <p>After complementing, bytes are reversed (<code>swap_bytes</code>), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.</p>
<h2 id="canonical-form">Canonical form</h2> <h2 id="canonical-form-and-canonicalkmerof">Canonical form and <code>CanonicalKmerOf</code></h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Self</span><span class="w"> </span><span class="p">{</span> <p><code>canonical()</code> returns a <code>CanonicalKmerOf&lt;L&gt;</code> — a distinct newtype that carries the same <code>u64</code> layout but enforces the invariant that the stored value equals <code>min(kmer, revcomp)</code>:</p>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">(</span><span class="n">k</span><span class="p">);</span> <div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">CanonicalKmerOf</span><span class="o">&lt;</span><span class="n">L</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o">*</span><span class="bp">self</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="p">}</span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">();</span>
<span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p>Lexicographic minimum of forward and reverse-complement, comparing the raw <code>u64</code> values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.</p> <p>Lexicographic minimum of forward and reverse-complement, comparing the raw <code>u64</code> values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.</p>
<p><code>CanonicalKmerOf::from_raw_unchecked(raw)</code> is the only other public constructor, for trusted paths such as deserialisation.</p>
<h2 id="sliding-window-helpers">Sliding window helpers</h2>
<p><code>push_right(nuc)</code> / <code>push_left(nuc)</code> shift the window by one base in O(1). <code>is_overlapping(other)</code> checks whether the last k1 nucleotides of <code>self</code> equal the first k1 of <code>other</code>.</p>
<h2 id="hashing">Hashing</h2>
<p><code>hash_kmer(raw: u64) -&gt; u64</code> computes <code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>, the seeded splitmix64 finalizer. <code>CanonicalKmerOf::seq_hash()</code> delegates to <code>hash_kmer</code>.</p>
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+175 -20
View File
@@ -757,6 +757,28 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#evidence-modes" class="md-nav__link">
<span class="md-ellipsis">
Evidence modes
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-functions" class="md-nav__link">
<span class="md-ellipsis">
Build functions
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -840,6 +862,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -918,6 +968,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1165,6 +1271,28 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#evidence-modes" class="md-nav__link">
<span class="md-ellipsis">
Evidence modes
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-functions" class="md-nav__link">
<span class="md-ellipsis">
Build functions
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -1226,26 +1354,26 @@
<h2 id="why-two-phases-are-needed">Why two phases are needed</h2> <h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p> <p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
<h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3> <h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p> <p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code><code>count_partition()</code>.</p>
<ol> <ol>
<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li> <li><strong>External sort</strong>: read the dereplicated superkmer file; extract the raw <code>u64</code> canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: <code>sorted_unique.bin</code> — a flat array of f0 distinct sorted <code>u64</code> values. Exact kmer count f0 is known at this point.</li>
<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li> <li><strong>Build provisional MPHF</strong> (ptr_hash, same configuration as phase 2) over <code>sorted_unique.bin</code> using <code>new_from_par_iter</code>. Delete <code>sorted_unique.bin</code> immediately after. Persist to <code>mphf1.bin</code>.</li>
<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li> <li><strong>Create <code>counts1.bin</code></strong>: <code>PersistentCompactIntVec</code> with f0 slots, zero-initialised.</li>
<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li> <li><strong>Accumulation pass</strong>: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute <code>slot = mphf.index(kmer.raw())</code> and increment <code>counts1[slot]</code> by the superkmer's COUNT.</li>
<li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li> <li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
</ol> </ol>
<p>Files produced per partition:</p> <p>Files produced per partition:</p>
<div class="highlight"><pre><span></span><code>part_XXXXX/ <div class="highlight"><pre><span></span><code>part_XXXXX/
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2) mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
counts1.bin — [u32; n_kmers] kmer counts, mmap&#39;d counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
kmer_spectrum_raw.json — local frequency spectrum kmer_spectrum_raw.json — local frequency spectrum
</code></pre></div> </code></pre></div>
<h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3> <h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
<p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p> <p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
<p><code>MphfLayer::build</code> is called on the unitig file:</p> <p><code>MphfLayer::build(dir, block_bits, mode: &amp;IndexMode, fill_slot)</code> is called on the unitig directory:</p>
<ol> <ol>
<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li> <li><strong>Pass 1</strong> (parallel): a <code>CanonicalKmerIter</code> — clonable via <code>Arc&lt;Mmap&gt;</code>, no file reopening — is passed directly to <code>new_from_par_iter</code> via <code>par_bridge()</code>. No <code>.idx</code> is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces <code>mphf.bin</code>.</li>
<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li> <li><strong>Pass 2</strong> (sequential): iterate with <code>iter_indexed_canonical_kmers</code>; fill evidence files; call <code>fill_slot(slot, kmer)</code> callback per kmer. For Exact/Hybrid, <code>.idx</code> is written at the end of this pass — never earlier.</li>
</ol> </ol>
<p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p> <p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
<hr /> <hr />
@@ -1265,13 +1393,11 @@
<p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p> <p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
<ul> <ul>
<li>~2.1 bits/key — most compact; good query speed; deterministic construction</li> <li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
<li>Works well from an exact or slightly overestimated count</li> <li><code>GOFunction</code> (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well</li>
<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
</ul> </ul>
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2> <h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p> <p><strong>Both phases</strong>: <strong>ptr_hash</strong>, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the <code>ph</code> crate dependency.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p> <p>boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.</p>
<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
<hr /> <hr />
<h2 id="space-at-scale">Space at scale</h2> <h2 id="space-at-scale">Space at scale</h2>
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p> <p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
@@ -1320,9 +1446,12 @@
<h3 id="layer-structure">Layer structure</h3> <h3 id="layer-structure">Layer structure</h3>
<p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p> <p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
<div class="highlight"><pre><span></span><code>layer_i/ <div class="highlight"><pre><span></span><code>layer_i/
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence) unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
unitigs.bin.idx — random-access block index (block_bits controls granularity)
mphf.bin — ptr_hash phase-2 MPHF mphf.bin — ptr_hash phase-2 MPHF
evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
[no layer_meta.json — mode stored once in partition-level meta.json]
</code></pre></div> </code></pre></div>
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p> <p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
<ol> <ol>
@@ -1330,17 +1459,43 @@
<li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li> <li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
<li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li> <li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
</ol> </ol>
<h3 id="evidence-modes">Evidence modes</h3>
<p>Three evidence modes are supported via <code>IndexMode</code>, stored once in <code>PartitionMeta</code> at partition root. There is no <code>layer_meta.json</code>.</p>
<p><strong>Exact</strong> (<code>IndexMode::Exact</code>): <code>evidence.bin</code> stores one <code>(chunk_id, rank)</code> pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. <code>.idx</code> required at query time.</p>
<p><strong>Approx</strong> (<code>IndexMode::Approx { b, z }</code>): <code>fingerprint.bin</code> stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No <code>.idx</code> written or needed.</p>
<p><strong>Hybrid</strong> (<code>IndexMode::Hybrid { b, z }</code>): both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (O(1)); <code>find_strict()</code> uses exact evidence (O(1)).</p>
<h3 id="build-functions">Build functions</h3>
<div class="highlight"><pre><span></span><code>MphfLayer::build(dir, block_bits, mode: &amp;IndexMode, fill_slot)
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
Pass 2: sequential iter → fill evidence files + call fill_slot
.idx written last for Exact/Hybrid (query-time only)
MphfLayer::build_exact_evidence(dir, block_bits)
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
Uses open_sequential(); no .idx required on entry
MphfLayer::build_approx_evidence(dir, b, z)
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
Uses open_sequential(); never writes .idx
</code></pre></div>
<p>There is no <code>build_evidence</code> dispatch wrapper. Callers choose the appropriate post-hoc build directly.</p>
<p>In <code>obikpartitionner</code>, <code>build_index_layer</code> receives <code>block_bits: u8</code> from <code>IndexConfig::block_bits</code> and forwards it directly to <code>Layer::build</code> and <code>Layer::build_approx_evidence</code>.</p>
<h3 id="membership-verification">Membership verification</h3> <h3 id="membership-verification">Membership verification</h3>
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p> <p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:</p>
<ul>
<li><strong>Exact</strong>: decode <code>(chunk_id, rank)</code> from <code>evidence.bin</code>; reconstruct the kmer via <code>unitigs.verify_canonical_kmer</code>; compare to query.</li>
<li><strong>Approx</strong>: compare <code>kmer.seq_hash()</code> to the b-bit fingerprint stored at the slot.</li>
</ul>
<p>A mismatch in either mode means the kmer is absent from this layer; probe the next layer.</p>
<h3 id="query-algorithm">Query algorithm</h3> <h3 id="query-algorithm">Query algorithm</h3>
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;(layer_index, slot)&gt;: <div class="highlight"><pre><span></span><code>fn query(kmer) → Option&lt;(layer_index, slot)&gt;:
for (i, layer) in layers.iter().enumerate(): for (i, layer) in layers.iter().enumerate():
slot = layer.mphf.index(kmer) slot = layer.mphf.index(kmer)
if layer.evidence.decode(slot) matches kmer: if layer.evidence.matches(slot, kmer): // exact or approx dispatch
return Some((i, slot)) return Some((i, slot))
return None return None
</code></pre></div> </code></pre></div>
<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p> <p><code>MphfLayer::find</code> dispatches on <code>LayerEvidence</code> at O(1) — no panicking <code>find_exact</code>/<code>find_approx</code> methods. <code>find_strict</code> always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.</p>
<h3 id="merging-layers">Merging layers</h3> <h3 id="merging-layers">Merging layers</h3>
<p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p> <p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
File diff suppressed because it is too large Load Diff
+557 -84
View File
@@ -9,7 +9,7 @@
<link rel="prev" href="../unitig_evidence/"> <link rel="prev" href="../evidence_elimination/">
<link rel="next" href="../persistent_compact_int_vec/"> <link rel="next" href="../persistent_compact_int_vec/">
@@ -647,6 +647,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
@@ -729,6 +757,17 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
<span class="md-ellipsis">
Index mode (homogeneity invariant)
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -740,6 +779,34 @@
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#query-api" class="md-nav__link">
<span class="md-ellipsis">
Query API
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-surface" class="md-nav__link">
<span class="md-ellipsis">
Build surface
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -751,6 +818,73 @@
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="Layer\&lt;D: LayerData&gt; — MPHF + payload">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#build-signatures" class="md-nav__link">
<span class="md-ellipsis">
Build signatures
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
<span class="md-ellipsis">
FingerprintVec and FingerprintVecWriter
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
<span class="md-ellipsis">
LayeredMap\&lt;D> — collection of layers
</span>
</a>
<nav class="md-nav" aria-label="LayeredMap\&lt;D&gt; — collection of layers">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#common-methods" class="md-nav__link">
<span class="md-ellipsis">
Common methods
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#push_layer" class="md-nav__link">
<span class="md-ellipsis">
push_layer
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -776,10 +910,10 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#evidence-encoding" class="md-nav__link"> <a href="#evidence-encoding-exact" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Evidence encoding Evidence encoding (exact)
</span> </span>
</a> </a>
@@ -798,14 +932,53 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#query-path" class="md-nav__link"> <a href="#column-append-and-merge-support" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Query path Column append and merge support
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="Column append and merge support">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#layer-level-genome-column-append" class="md-nav__link">
<span class="md-ellipsis">
Layer-level genome column append
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#presence-matrix-initialisation" class="md-nav__link">
<span class="md-ellipsis">
Presence matrix initialisation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
<span class="md-ellipsis">
Why the MPHF is never rebuilt
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -895,6 +1068,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1058,6 +1287,17 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
<span class="md-ellipsis">
Index mode (homogeneity invariant)
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -1069,6 +1309,34 @@
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#query-api" class="md-nav__link">
<span class="md-ellipsis">
Query API
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#build-surface" class="md-nav__link">
<span class="md-ellipsis">
Build surface
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -1080,6 +1348,73 @@
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="Layer\&lt;D: LayerData&gt; — MPHF + payload">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#build-signatures" class="md-nav__link">
<span class="md-ellipsis">
Build signatures
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
<span class="md-ellipsis">
FingerprintVec and FingerprintVecWriter
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
<span class="md-ellipsis">
LayeredMap\&lt;D> — collection of layers
</span>
</a>
<nav class="md-nav" aria-label="LayeredMap\&lt;D&gt; — collection of layers">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#common-methods" class="md-nav__link">
<span class="md-ellipsis">
Common methods
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#push_layer" class="md-nav__link">
<span class="md-ellipsis">
push_layer
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -1105,10 +1440,10 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#evidence-encoding" class="md-nav__link"> <a href="#evidence-encoding-exact" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Evidence encoding Evidence encoding (exact)
</span> </span>
</a> </a>
@@ -1127,14 +1462,53 @@
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#query-path" class="md-nav__link"> <a href="#column-append-and-merge-support" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
Query path Column append and merge support
</span> </span>
</a> </a>
<nav class="md-nav" aria-label="Column append and merge support">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#layer-level-genome-column-append" class="md-nav__link">
<span class="md-ellipsis">
Layer-level genome column append
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#presence-matrix-initialisation" class="md-nav__link">
<span class="md-ellipsis">
Presence matrix initialisation
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
<span class="md-ellipsis">
Why the MPHF is never rebuilt
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -1178,7 +1552,7 @@
<h1 id="obilayeredmap-layered-kmer-index-crate">obilayeredmap — layered kmer index crate</h1> <h1 id="obilayeredmap-layered-kmer-index-crate">obilayeredmap — layered kmer index crate</h1>
<h2 id="purpose">Purpose</h2> <h2 id="purpose">Purpose</h2>
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. The index is organised in three levels: <strong>index root → partition → layer</strong>. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p> <p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
<hr /> <hr />
<h2 id="three-usage-modes">Three usage modes</h2> <h2 id="three-usage-modes">Three usage modes</h2>
<p>The MPHF + evidence infrastructure is the same for all modes. The <strong>payload</strong> varies.</p> <p>The MPHF + evidence infrastructure is the same for all modes. The <strong>payload</strong> varies.</p>
@@ -1214,34 +1588,65 @@
</table> </table>
<p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate.</p> <p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate.</p>
<hr /> <hr />
<h2 id="index-mode-homogeneity-invariant">Index mode (homogeneity invariant)</h2>
<p>A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at <code>LayeredMap::open()</code> from <code>PartitionMeta.mode</code> and passed to each <code>Layer::open()</code> — no per-layer file is read.</p>
<div class="highlight"><pre><span></span><code><span class="cp">#[derive(Serialize, Deserialize, Default)]</span>
<span class="cp">#[serde(tag = </span><span class="s">&quot;type&quot;</span><span class="cp">, rename_all = </span><span class="s">&quot;snake_case&quot;</span><span class="cp">)]</span>
<span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">IndexMode</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="cp">#[default]</span>
<span class="w"> </span><span class="n">Exact</span><span class="p">,</span>
<span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
<span class="p">}</span>
</code></pre></div>
<p><code>IndexMode</code> is stored once in <code>PartitionMeta</code> (<code>meta.json</code> at partition root). There is no <code>layer_meta.json</code>.</p>
<ul>
<li><strong>Exact</strong>: writes <code>evidence.bin</code> + <code>unitigs.bin.idx</code>. Zero false positives.</li>
<li><strong>Approx</strong>: writes <code>fingerprint.bin</code> only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No <code>.idx</code> written or required.</li>
<li><strong>Hybrid</strong>: writes both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (fast, O(1)); <code>find_strict()</code> uses exact evidence.</li>
</ul>
<hr />
<h2 id="mphflayer-autonomous-kmer-slot-mapping">MphfLayer — autonomous kmer → slot mapping</h2> <h2 id="mphflayer-autonomous-kmer-slot-mapping">MphfLayer — autonomous kmer → slot mapping</h2>
<p><code>MphfLayer</code> encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.</p> <p><code>MphfLayer</code> encapsulates the MPHF and evidence store for one layer. It is independent of any payload.</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="w"> </span><span class="p">{</span> <div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span> <span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
<span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span> <span class="w"> </span><span class="n">ev</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="p">,</span><span class="w"> </span><span class="c1">// loaded at open() time</span>
<span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span> <span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="c1">// number of indexed kmers = number of MPHF slots</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p>Public API:</p> <p><code>LayerEvidence</code> is an internal enum, not public:</p>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">MphfLayer</span><span class="w"> </span><span class="p">{</span> <div class="highlight"><pre><span></span><code><span class="k">enum</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span> <span class="w"> </span><span class="n">Exact</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span><span class="w"> </span><span class="c1">// Some(slot) or None</span> <span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs_path</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span> <span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">unitig_writer</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="n">UnitigFileWriter</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span>
<span class="w"> </span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span>
<span class="w"> </span><span class="n">fill_slot</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">mut</span><span class="w"> </span><span class="k">impl</span><span class="w"> </span><span class="nb">FnMut</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="p">,</span>
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p><code>find</code> returns <code>Some(slot)</code> only after verifying via evidence that the kmer is actually indexed. It returns <code>None</code> for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).</p> <p><code>MphfLayer::open(dir, mode: &amp;IndexMode)</code> receives the mode from <code>PartitionMeta</code> — no per-layer file is read.</p>
<p><code>build</code> runs two sequential passes over <code>unitigs.bin</code>:</p> <h3 id="query-api">Query API</h3>
<p>Two public query methods, both returning <code>Option&lt;usize&gt;</code> (slot index):</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find_strict</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
</code></pre></div>
<ul>
<li><code>find</code>: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.</li>
<li><code>find_strict</code>: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no <code>.idx</code>).</li>
</ul>
<p>There are no <code>find_exact</code>/<code>find_approx</code> methods; panicking dispatch is eliminated.</p>
<h3 id="build-surface">Build surface</h3>
<div class="highlight"><pre><span></span><code><span class="c1">// Full MPHF + evidence build (two-pass)</span>
<span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span><span class="p">,</span><span class="w"> </span><span class="n">fill_slot</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="c1">// Evidence-only post-hoc builds (MPHF already present)</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
</code></pre></div>
<p><code>MphfLayer::build</code> runs two passes over <code>unitigs.bin</code>:</p>
<ol> <ol>
<li><strong>Pass 1</strong>: iterate all canonical kmers in parallel via rayon, construct and store <code>mphf.bin</code>. <code>new_from_par_iter</code> avoids materialising a full key <code>Vec</code>.</li> <li><strong>Pass 1</strong> (parallel via rayon): a <code>CanonicalKmerIter</code> (clonable, <code>Arc&lt;Mmap&gt;</code>, no file reopening) is passed to <code>new_from_par_iter</code> via <code>par_bridge()</code>. Produces <code>mphf.bin</code>. No <code>.idx</code> is read or created at this stage.</li>
<li><strong>Pass 2</strong>: iterate again sequentially, fill <code>evidence.bin</code>, call <code>fill_slot(slot, kmer)</code> once per kmer for payload population. A compact <code>n/8</code>-byte seen-bitset verifies MPHF injectivity inline.</li> <li><strong>Pass 2</strong> (sequential): fill evidence files; call <code>fill_slot(slot, kmer)</code> per kmer. <code>.idx</code> is written last for Exact/Hybrid modes (query-time only).</li>
</ol> </ol>
<p>For empty layers (n = 0), <code>build</code> returns <code>Ok(0)</code> immediately after creating empty <code>mphf.bin</code> and <code>evidence.bin</code>.</p> <p>There is no <code>build_evidence</code> dispatch wrapper — callers invoke <code>build_exact_evidence</code> or <code>build_approx_evidence</code> directly.</p>
<p>For empty layers (n = 0), all build variants return <code>Ok(0)</code> immediately after creating empty output files.</p>
<hr /> <hr />
<h2 id="layerd-layerdata-mphf-payload">Layer\&lt;D: LayerData&gt; — MPHF + payload</h2> <h2 id="layerd-layerdata-mphf-payload">Layer\&lt;D: LayerData&gt; — MPHF + payload</h2>
<p><code>Layer&lt;D&gt;</code> pairs an <code>MphfLayer</code> with one payload store.</p> <p><code>Layer&lt;D&gt;</code> pairs an <code>MphfLayer</code> with one payload store.</p>
@@ -1261,7 +1666,7 @@
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">T</span><span class="p">,</span> <span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">T</span><span class="p">,</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not in the trait.</p> <p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not part of the trait.</p>
<table> <table>
<thead> <thead>
<tr> <tr>
@@ -1288,28 +1693,89 @@
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p><strong>Build signatures:</strong></p> <h3 id="build-signatures">Build signatures</h3>
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span> <div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span> <span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span> <span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span> <span class="p">}</span>
<span class="c1">// mode 2</span> <span class="c1">// mode 2</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentCompactIntMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span> <span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentCompactIntMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span> <span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span><span class="p">,</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">HashMap</span><span class="o">&lt;</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span> <span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span><span class="p">,</span>
<span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">HashMap</span><span class="o">&lt;</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span> <span class="p">}</span>
<span class="c1">// mode 3</span> <span class="c1">// mode 3</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentBitMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span> <span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentBitMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span> <span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span><span class="p">,</span>
<span class="w"> </span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span> <span class="w"> </span><span class="n">n_genomes</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
<span class="w"> </span><span class="n">n_genomes</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span> <span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">bool</span><span class="p">,</span>
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p>All build impls delegate MPHF + evidence construction to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. Mode 2 pre-reads <code>n_kmers</code> from <code>unitigs.bin</code> to size the <code>PersistentCompactIntMatrixBuilder</code> before calling <code>MphfLayer::build</code>. Mode 3 does the same for <code>PersistentBitMatrixBuilder</code>.</p> <p>All build impls delegate to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. The <code>mode</code> parameter is forwarded directly — no <code>LayerMeta</code> is written.</p>
<p>Evidence-only post-hoc builds are accessible directly on <code>Layer&lt;D&gt;</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="o">&lt;</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="o">&gt;</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p>There is no <code>build_evidence</code> dispatch wrapper.</p>
<hr />
<h2 id="fingerprintvec-and-fingerprintvecwriter">FingerprintVec and FingerprintVecWriter</h2>
<p>Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.</p>
<div class="highlight"><pre><span></span><code>fingerprint.bin format:
magic: b&quot;FPVF&quot; (4 bytes)
b: u8 (bits per fingerprint, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">FingerprintVec</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">path</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">get</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u64</span>
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">matches</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">bool</span>
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span>
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">b</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u8</span>
<span class="p">}</span>
</code></pre></div>
<p><code>matches(slot, hash)</code> extracts the b-bit fingerprint stored at <code>slot</code> and compares it to the low b bits of <code>hash</code>. It is the core operation of <code>find_approx</code>.</p>
<hr />
<h2 id="layeredmapd-collection-of-layers">LayeredMap\&lt;D&gt; — collection of layers</h2>
<p><code>LayeredMap&lt;D&gt;</code> wraps <code>Vec&lt;Layer&lt;D&gt;&gt;</code> for a single partition directory.</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">LayeredMap</span><span class="o">&lt;</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="p">,</span>
<span class="w"> </span><span class="n">meta</span><span class="p">:</span><span class="w"> </span><span class="nc">PartitionMeta</span><span class="p">,</span>
<span class="w"> </span><span class="n">layers</span><span class="p">:</span><span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;&gt;</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div>
<p><code>PartitionMeta</code> (<code>meta.json</code> at the partition root) stores <code>n_layers</code>.</p>
<h3 id="common-methods">Common methods</h3>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">create</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="bp">Self</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n_layers</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span>
<span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">layer</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Layer</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">mode</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">IndexMode</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">Hit</span><span class="o">&lt;</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">&gt;</span><span class="p">)</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">next_layer_writer</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="n">UnitigFileWriter</span><span class="o">&gt;</span>
</code></pre></div>
<p><code>open</code> reads <code>PartitionMeta</code> once, extracts <code>mode</code>, and passes it to every <code>Layer::open</code> — no per-layer file is read. <code>create</code> stores the given mode in <code>PartitionMeta</code>.</p>
<p><code>query</code> probes layers in order and returns <code>(layer_index, Hit)</code> on the first match. Expected probe depth: 1 for kmers in layer 0.</p>
<h3 id="push_layer">push_layer</h3>
<p><code>push_layer</code> builds the next layer from a <code>unitigs.bin</code> already written via <code>next_layer_writer</code>, using <code>DEFAULT_BLOCK_BITS</code>:</p>
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
<span class="c1">// mode 2</span>
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o">&lt;</span><span class="n">PersistentCompactIntMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer_from_map</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">HashMap</span><span class="o">&lt;</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">&gt;</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p>Mode 3 (<code>PersistentBitMatrix</code>) has no <code>push_layer</code> on <code>LayeredMap</code>; callers build directly via <code>Layer&lt;PersistentBitMatrix&gt;::build_presence</code>.</p>
<hr /> <hr />
<h2 id="layeredstores-and-aggregation-traits">LayeredStore\&lt;S&gt; and aggregation traits</h2> <h2 id="layeredstores-and-aggregation-traits">LayeredStore\&lt;S&gt; and aggregation traits</h2>
<p><code>LayeredStore&lt;S&gt;</code> is a generic aggregation wrapper over <code>Vec&lt;S&gt;</code>. It propagates three traits from <code>obicompactvec::traits</code> up the hierarchy via blanket impls:</p> <p><code>LayeredStore&lt;S&gt;</code> is a generic aggregation wrapper over <code>Vec&lt;S&gt;</code>. It propagates three traits from <code>obicompactvec::traits</code> up the hierarchy via blanket impls:</p>
@@ -1320,11 +1786,6 @@
<span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">BitPartials</span><span class="o">&gt;</span><span class="w"> </span><span class="n">BitPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err"></span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span> <span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">BitPartials</span><span class="o">&gt;</span><span class="w"> </span><span class="n">BitPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o">&lt;</span><span class="n">S</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err"></span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span>
</code></pre></div> </code></pre></div>
<p>Because blanket impls compose, <code>LayeredStore&lt;LayeredStore&lt;S&gt;&gt;</code> automatically inherits all three traits when <code>S</code> does — providing the partitioned level without a separate type.</p> <p>Because blanket impls compose, <code>LayeredStore&lt;LayeredStore&lt;S&gt;&gt;</code> automatically inherits all three traits when <code>S</code> does — providing the partitioned level without a separate type.</p>
<p><strong>Aggregation hierarchy:</strong></p>
<div class="highlight"><pre><span></span><code>PersistentCompactIntMatrix implements CountPartials
LayeredStore&lt;PersistentCompactIntMatrix&gt; via blanket impl (one partition)
LayeredStore&lt;LayeredStore&lt;&gt;&gt; via blanket impl (partitioned index)
</code></pre></div>
<p><strong>Leaf implementors</strong> (in <code>obicompactvec</code>):</p> <p><strong>Leaf implementors</strong> (in <code>obicompactvec</code>):</p>
<table> <table>
<thead> <thead>
@@ -1344,69 +1805,77 @@ LayeredStore&lt;LayeredStore&lt;…&gt;&gt; via blanket impl
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p><code>PersistentCompactIntVec</code> and <code>PersistentBitVec</code> do not implement these traits — they are single-column primitives, not matrix-level aggregators.</p>
<p>See <a href="../../architecture/index_architecture/">Kmer index architecture</a> for the full trait API and the two-pass normalised-metric pattern.</p> <p>See <a href="../../architecture/index_architecture/">Kmer index architecture</a> for the full trait API and the two-pass normalised-metric pattern.</p>
<hr /> <hr />
<h2 id="on-disk-structure">On-disk structure</h2> <h2 id="on-disk-structure">On-disk structure</h2>
<div class="highlight"><pre><span></span><code>index_root/ ← LayeredMap (collection) <div class="highlight"><pre><span></span><code>partition_root/ ← LayeredMap (one partition)
meta.json meta.json — {&quot;n_layers&quot;: N, &quot;mode&quot;: {&quot;type&quot;: &quot;exact&quot;|&quot;approx&quot;|&quot;hybrid&quot;, ...}}
part_00000/ ← Partition layer_0/ ← Layer
layer_0/ ← Layer mphf.bin — ptr_hash MPHF (epserde format)
mphf.bin — ptr_hash MPHF (epserde format) unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin — packed 2-bit nucleotide sequences unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] evidence.bin — [u32; n], LE (Exact/Hybrid only)
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE fingerprint.bin — packed b-bit array (Approx/Hybrid only)
counts/ [mode 2] PersistentCompactIntMatrix counts/ [mode 2] PersistentCompactIntMatrix
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: 1} meta.json
col_000000.pciv col_000000.pciv
presence/ [mode 3] PersistentBitMatrix presence/ [mode 3] PersistentBitMatrix
meta.json {&quot;n&quot;: N, &quot;n_cols&quot;: G} meta.json
col_000000.pbiv col_000000.pbiv
layer_1/
layer_1/
part_00001/
</code></pre></div> </code></pre></div>
<p><strong>Partition</strong> (<code>part_XXXXX/</code>): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.</p> <p>There is no <code>layer_meta.json</code>. The mode is stored once in <code>PartitionMeta</code> and is valid for all layers. <code>unitigs.bin.idx</code> is built at the end of <code>build_exact_evidence</code> — never during MPHF construction — and is consumed at query time only.</p>
<p><strong>Layer</strong> (<code>layer_N/</code>): one <code>MphfLayer</code> plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.</p>
<hr /> <hr />
<h2 id="evidence-encoding">Evidence encoding</h2> <h2 id="evidence-encoding-exact">Evidence encoding (exact)</h2>
<p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header. Each u32 encodes one slot:</p> <p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header. Each u32 encodes one slot:</p>
<div class="highlight"><pre><span></span><code>bits [31:7] = chunk_id (25 bits) — index of the unitig chunk <div class="highlight"><pre><span></span><code>bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
</code></pre></div> </code></pre></div>
<p>Decoding: <code>chunk_id = raw &gt;&gt; 7</code>, <code>rank = raw &amp; 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code>.</p> <p><code>chunk_id = raw &gt;&gt; 7</code>, <code>rank = raw &amp; 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code> (requires <code>unitigs.bin.idx</code> for random access).</p>
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.</p> <p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.</p>
<hr /> <hr />
<h2 id="ptr_hash-configuration">ptr_hash configuration</h2> <h2 id="ptr_hash-configuration">ptr_hash configuration</h2>
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span> <div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o">&lt;</span>
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span> <span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span>
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span> <span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span> <span class="w"> </span><span class="n">CachelineEfVec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">CachelineEf</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: Elias-Fano</span>
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span> <span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span>
<span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span> <span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">u8</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
<span class="o">&gt;</span><span class="p">;</span> <span class="o">&gt;</span><span class="p">;</span>
</code></pre></div> </code></pre></div>
<p><code>Xx64</code> is chosen over <code>FxHash</code> because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.</p> <p><code>Xx64</code> is chosen over <code>FxHash</code> because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.</p>
<p><code>CubicEps</code> with <code>PtrHashParams::&lt;CubicEps&gt;::default()</code> (λ=3.5) is a balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space.</p> <p><code>CubicEps</code> with <code>PtrHashParams::&lt;CubicEps&gt;::default()</code> (λ=3.5): 2× slower construction than <code>Linear/λ=3.0</code>, ~20% less space.</p>
<hr /> <hr />
<h2 id="query-path">Query path</h2> <h2 id="column-append-and-merge-support">Column append and merge support</h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Option</span><span class="o">&lt;</span><span class="n">Hit</span><span class="o">&lt;</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="p">{</span> <p>These methods extend existing layers with new genome columns without touching the MPHF.</p>
<span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">mphf</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">kmer</span><span class="p">).</span><span class="n">map</span><span class="p">(</span><span class="o">|</span><span class="n">slot</span><span class="o">|</span><span class="w"> </span><span class="n">Hit</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">slot</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">self</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">})</span> <h3 id="layer-level-genome-column-append">Layer-level genome column append</h3>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentBitMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span>
<span class="p">}</span>
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="n">PersistentCompactIntMatrix</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span>
<span class="p">}</span> <span class="p">}</span>
</code></pre></div> </code></pre></div>
<p><code>MphfLayer::find</code> probes the MPHF, decodes evidence, and verifies the kmer — returning <code>Some(slot)</code> on match, <code>None</code> otherwise. <code>data.read(slot)</code> is called only on a confirmed hit.</p> <p>Both delegate to the corresponding <code>PersistentBitMatrix::append_column</code> / <code>PersistentCompactIntMatrix::append_column</code>. They write a new column file (<code>col_NNNNNN.pbiv</code> / <code>col_NNNNNN.pciv</code>) and update <code>meta.json</code> to increment <code>n_cols</code>. <code>value_of</code> is called once per slot (0..n).</p>
<p>In <code>LayeredMap</code>, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.</p> <h3 id="presence-matrix-initialisation">Presence matrix initialisation</h3>
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">init_presence_matrix</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&amp;</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">n_kmers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">OLMResult</span><span class="o">&lt;</span><span class="p">()</span><span class="o">&gt;</span>
<span class="p">}</span>
</code></pre></div>
<p>Called on the first merge of a Presence-mode index. Creates <code>presence/</code> with <code>meta.json {"n": n_kmers, "n_cols": 1}</code> and <code>col_000000.pbiv</code> set entirely to <code>true</code>. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.</p>
<h3 id="why-the-mphf-is-never-rebuilt">Why the MPHF is never rebuilt</h3>
<p>The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new <code>.pciv</code>/<code>.pbiv</code> file and a single <code>meta.json</code> update.</p>
<hr /> <hr />
<h2 id="add-layer-algorithm">Add-layer algorithm</h2> <h2 id="add-layer-algorithm">Add-layer algorithm</h2>
<p>When adding dataset B to an existing index:</p> <p>When adding dataset B to an existing index:</p>
<ol> <ol>
<li>For each partition, probe existing layers for kmers of B routed to that partition.</li> <li>For each partition, probe existing layers for kmers of B routed to that partition.</li>
<li>Collect kmers absent from all layers → <code>B \ index</code>.</li> <li>Collect kmers absent from all layers → <code>B \ index</code>.</li>
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>MphfLayer::unitig_writer</code>.</li> <li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>next_layer_writer()</code>.</li>
<li>Call <code>Layer&lt;D&gt;::build</code> on the new directory.</li> <li>Call <code>Layer&lt;D&gt;::build</code> (or <code>build_presence</code>) on the new layer directory.</li>
<li>Update <code>meta.json</code>.</li> <li>Call <code>push_layer</code> (or <code>append_layer</code>) to register the new layer in <code>meta.json</code>.</li>
</ol> </ol>
<p>Each partition's new layer is built independently; the operation is fully parallel across partitions.</p> <p>Each partition's new layer is built independently; the operation is fully parallel across partitions.</p>
<hr /> <hr />
@@ -1433,11 +1902,15 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
</tr> </tr>
<tr> <tr>
<td><code>memmap2 0.9</code></td> <td><code>memmap2 0.9</code></td>
<td>mmap of evidence and payload files</td> <td>mmap of evidence and fingerprint files</td>
</tr>
<tr>
<td><code>bitvec</code></td>
<td>packed b-bit fingerprint storage</td>
</tr> </tr>
<tr> <tr>
<td><code>obiskio</code></td> <td><code>obiskio</code></td>
<td>unitig file writer/reader</td> <td>unitig file writer/reader + <code>.idx</code> build</td>
</tr> </tr>
<tr> <tr>
<td><code>obicompactvec</code></td> <td><code>obicompactvec</code></td>
@@ -1448,8 +1921,8 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
<td>parallel MPHF construction pass</td> <td>parallel MPHF construction pass</td>
</tr> </tr>
<tr> <tr>
<td><code>ndarray 0.16</code></td> <td><code>serde / serde_json</code></td>
<td>aggregation output arrays</td> <td><code>PartitionMeta</code> serialisation</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
File diff suppressed because it is too large Load Diff
+155 -21
View File
@@ -662,6 +662,17 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#make_pipe-dsl" class="md-nav__link">
<span class="md-ellipsis">
make_pipe! DSL
</span>
</a>
</li> </li>
</ul> </ul>
@@ -801,6 +812,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -879,6 +918,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1087,6 +1182,17 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#make_pipe-dsl" class="md-nav__link">
<span class="md-ellipsis">
make_pipe! DSL
</span>
</a>
</li> </li>
</ul> </ul>
@@ -1145,7 +1251,7 @@
<h1 id="obipipeline-parallel-pipeline-library">obipipeline — parallel pipeline library</h1> <h1 id="obipipeline-parallel-pipeline-library">obipipeline — parallel pipeline library</h1>
<p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>transforms</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p> <p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>stages</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p>
<h2 id="core-types">Core types</h2> <h2 id="core-types">Core types</h2>
<table> <table>
<thead> <thead>
@@ -1158,22 +1264,33 @@
<tbody> <tbody>
<tr> <tr>
<td><code>SourceFn&lt;D&gt;</code></td> <td><code>SourceFn&lt;D&gt;</code></td>
<td><code>Box&lt;dyn FnMut() -&gt; Result&lt;D, PipelineError&gt; + Send+Sync&gt;</code></td> <td><code>Box&lt;dyn FnMut() -&gt; Result&lt;D, PipelineError&gt; + Send&gt;</code></td>
<td>Called repeatedly; <code>FnMut</code> because it holds iterator state</td> <td>Called repeatedly; <code>FnMut</code> because it holds iterator state</td>
</tr> </tr>
<tr> <tr>
<td><code>SharedFn&lt;D&gt;</code></td> <td><code>SharedFn&lt;D&gt;</code></td>
<td><code>Arc&lt;dyn Fn(D) -&gt; Result&lt;D, PipelineError&gt; + Send+Sync&gt;</code></td> <td><code>Arc&lt;dyn Fn(D) -&gt; Result&lt;D, PipelineError&gt; + Send + Sync&gt;</code></td>
<td>Shared across workers via <code>Arc::clone</code> (no copy of the closure)</td> <td>1→1 transform shared across workers via <code>Arc::clone</code></td>
</tr>
<tr>
<td><code>SharedFlatFn&lt;D&gt;</code></td>
<td><code>Arc&lt;dyn Fn(D, &amp;Sender&lt;Result&lt;D, _&gt;&gt;, &amp;Sender&lt;isize&gt;) + Send + Sync&gt;</code></td>
<td>1→N transform; pushes items into channel, sends delta</td>
</tr> </tr>
<tr> <tr>
<td><code>SinkFn&lt;D&gt;</code></td> <td><code>SinkFn&lt;D&gt;</code></td>
<td><code>Box&lt;dyn Fn(D) -&gt; Result&lt;(), PipelineError&gt; + Send+Sync&gt;</code></td> <td><code>Box&lt;dyn Fn(D) -&gt; Result&lt;(), PipelineError&gt; + Send&gt;</code></td>
<td>Final consumer; returns <code>Result</code> so errors propagate back</td> <td>Final consumer; returns <code>Result</code> so errors propagate back</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p><code>Pipeline&lt;D&gt;</code> holds one <code>SourceFn</code>, a <code>Vec&lt;SharedFn&gt;</code>, and one <code>SinkFn</code>.<br /> <p>Stages come in two variants:</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">Stage</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">Transform</span><span class="p">(</span><span class="n">SharedFn</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→1</span>
<span class="w"> </span><span class="n">Flat</span><span class="p">(</span><span class="n">SharedFlatFn</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→N</span>
<span class="p">}</span>
</code></pre></div>
<p><code>Pipeline&lt;D&gt;</code> holds one <code>SourceFn</code>, a <code>Vec&lt;Stage&gt;</code>, and one <code>SinkFn</code>.<br />
<code>WorkerPool&lt;D&gt;</code> wraps a <code>Pipeline</code> with <code>n_workers</code> and channel <code>capacity</code>.</p> <code>WorkerPool&lt;D&gt;</code> wraps a <code>Pipeline</code> with <code>n_workers</code> and channel <code>capacity</code>.</p>
<h2 id="workerpool">WorkerPool</h2> <h2 id="workerpool">WorkerPool</h2>
<div class="highlight"><pre><span></span><code><span class="n">WorkerPool</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">pipeline</span><span class="p">:</span><span class="w"> </span><span class="nc">Pipeline</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="n">n_workers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">capacity</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Self</span> <div class="highlight"><pre><span></span><code><span class="n">WorkerPool</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">pipeline</span><span class="p">:</span><span class="w"> </span><span class="nc">Pipeline</span><span class="o">&lt;</span><span class="n">D</span><span class="o">&gt;</span><span class="p">,</span><span class="w"> </span><span class="n">n_workers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">capacity</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">Self</span>
@@ -1193,7 +1310,7 @@
</tr> </tr>
<tr> <tr>
<td><code>capacity</code></td> <td><code>capacity</code></td>
<td>Bound on every crossbeam channel in the pipeline (source output, inter-stage channels, worker input, sink input, sink error). Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td> <td>Bound on every crossbeam channel in the pipeline. Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
@@ -1208,7 +1325,7 @@
</code></pre></div> </code></pre></div>
<p>Each variant carries the concrete type for one stage's output. The macros pattern-match on this enum to route values between stages.</p> <p>Each variant carries the concrete type for one stage's output. The macros pattern-match on this enum to route values between stages.</p>
<h2 id="macros">Macros</h2> <h2 id="macros">Macros</h2>
<p>Six low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p> <p>Eight low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p>
<h3 id="low-level">Low-level</h3> <h3 id="low-level">Low-level</h3>
<div class="highlight"><pre><span></span><code><span class="n">make_source</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields T</span> <div class="highlight"><pre><span></span><code><span class="n">make_source</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields T</span>
<span class="n">make_source_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields Result&lt;T, E&gt;</span> <span class="n">make_source_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields Result&lt;T, E&gt;</span>
@@ -1216,6 +1333,9 @@
<span class="n">make_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; U</span> <span class="n">make_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; U</span>
<span class="n">make_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; Result&lt;U, E&gt;</span> <span class="n">make_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; Result&lt;U, E&gt;</span>
<span class="n">make_flat_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; impl IntoIterator&lt;Item=U&gt;</span>
<span class="n">make_flat_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; Result&lt;impl IntoIterator&lt;Item=U&gt;, E&gt;</span>
<span class="n">make_sink</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; ()</span> <span class="n">make_sink</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; ()</span>
<span class="n">make_sink_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; Result&lt;(), E&gt;</span> <span class="n">make_sink_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -&gt; Result&lt;(), E&gt;</span>
</code></pre></div> </code></pre></div>
@@ -1224,17 +1344,31 @@
<div class="highlight"><pre><span></span><code>make_pipeline! { <div class="highlight"><pre><span></span><code>make_pipeline! {
DataEnum, DataEnum,
source iterator =&gt; OutputVariant, // or source? for fallible source iterator =&gt; OutputVariant, // or source? for fallible
| func: In =&gt; Out, // non-fallible transform | func: In =&gt; Out, // 1→1 non-fallible transform
|? func: In =&gt; Out, // fallible transform |? func: In =&gt; Out, // 1→1 fallible transform
|| func: In =&gt; Out, // 1→N non-fallible flat transform
||? func: In =&gt; Out, // 1→N fallible flat transform
sink func @ InputVariant, // or sink? for fallible sink func @ InputVariant, // or sink? for fallible
} }
</code></pre></div> </code></pre></div>
<p><code>?</code> marks fallibility on source, individual transforms, or sink independently.<br /> <p><code>?</code> marks fallibility on source, individual transforms, or sink independently.<br />
Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</code> recurses over transform tokens one at a time, accumulating them into a <code>vec![]</code>, then terminates on <code>sink</code>/<code>sink?</code>.</p> Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</code> recurses over transform tokens one at a time, accumulating them into a <code>vec![]</code>, then terminates on <code>sink</code>/<code>sink?</code>.</p>
<h3 id="make_pipe-dsl">make_pipe! DSL</h3>
<p><code>make_pipe!</code> builds a sourceless/sinkless <code>Pipe&lt;D, In, Out&gt;</code> — a reusable, composable stage sequence:</p>
<div class="highlight"><pre><span></span><code>make_pipe! {
DataEnum : InType =&gt; OutType,
| func: InVariant =&gt; OutVariant,
|? func: InVariant =&gt; OutVariant,
|| func: InVariant =&gt; OutVariant,
||? func: InVariant =&gt; OutVariant,
}
</code></pre></div>
<p>Two pipes compose with <code>.then(other)</code>. Apply to an iterator with <code>.apply(iter, n_workers, capacity)</code> to get a <code>PipeIter&lt;Out&gt;</code> — an iterator over the pipeline output, backed by a background <code>WorkerPool</code>. The scatter step in <code>obikmer</code> uses <code>make_pipe!</code> and <code>.apply()</code> rather than the full <code>make_pipeline!</code> / <code>WorkerPool</code> pattern.</p>
<h2 id="scheduler-architecture">Scheduler architecture</h2> <h2 id="scheduler-architecture">Scheduler architecture</h2>
<div class="highlight"><pre><span></span><code>Source thread ──► [source_rx] ──► Scheduler ──► [worker_tx] ──► Workers (×N) <div class="highlight"><pre><span></span><code>Source thread ──► [source_rx] ──► Scheduler ──► [worker_tx] ──► Workers (×N)
▲ │ ▲ │
[stage_rxs] ────────┘◄──────────────────────────────┘ [stage_rxs] ────────┘◄──────────────────────────────┘
[flat_delta_rx] ──► Scheduler (in_flight adjustment)
[sink_err_rx] ← errors from sink (highest priority) [sink_err_rx] ← errors from sink (highest priority)
@@ -1242,20 +1376,20 @@ Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</co
</code></pre></div> </code></pre></div>
<p>The scheduler is a single thread running a biased <code>Select</code> over all input channels. Priority order (highest first):</p> <p>The scheduler is a single thread running a biased <code>Select</code> over all input channels. Priority order (highest first):</p>
<div class="highlight"><pre><span></span><code>index 0 sink_err_rx abort on sink error <div class="highlight"><pre><span></span><code>index 0 sink_err_rx abort on sink error
index 1 stage_rxs[N-1] drain last stage first index 1 flat_delta_rx adjust in_flight before dispatching
... index 2..=n+1 stage_rxs[n-1..0] drain last stage first
index N stage_rxs[0] index n+2 source_rx pull new data last
index N+1 source_rx pull new data last
</code></pre></div> </code></pre></div>
<p>This back-pressure-friendly ordering ensures downstream stages are drained before new items enter the pipeline.</p> <p>This back-pressure-friendly ordering ensures downstream stages are drained before new items enter the pipeline.</p>
<p><strong>Workers</strong> are generic: each receives <code>(data, SharedFn, result_tx)</code> and calls <code>f(data)</code>, sending the result to the provided channel. The scheduler decides which transform to apply and where to route the result.</p> <p><strong>Workers</strong> are generic: each receives a <code>WorkerTask</code> — either <code>Transform(data, stage_idx)</code> or <code>Flat(data, stage_idx)</code>. For <code>Transform</code>, the worker calls <code>f(data)</code> and sends the result to <code>stage_txs[stage_idx]</code>. For <code>Flat</code>, the worker calls <code>f(data, &amp;push_tx, &amp;delta_tx)</code>: the closure pushes N items into <code>push_tx</code> then sends <code>N-1</code> to <code>delta_tx</code>. The scheduler uses the delta to adjust <code>in_flight</code> without knowing N in advance.</p>
<p><strong>Termination</strong> uses an <code>in_flight</code> counter:</p> <p><strong>Termination</strong> uses an <code>in_flight: isize</code> counter and a <code>flat_workers_active: usize</code> counter:</p>
<ul> <ul>
<li>incremented when an item is dispatched from source to workers</li> <li><code>in_flight</code> incremented when an item is dispatched from source to workers</li>
<li>decremented when the item exits the last stage</li> <li><code>in_flight</code> decremented when the item exits the last stage to the sink</li>
<li>the loop exits only when <code>source_done &amp;&amp; in_flight == 0</code></li> <li><code>flat_workers_active</code> incremented when a <code>Flat</code> task is dispatched, decremented when the delta arrives</li>
<li>the loop exits only when <code>source_done &amp;&amp; in_flight == 0 &amp;&amp; flat_workers_active == 0</code></li>
</ul> </ul>
<p>This guarantees all in-flight items complete before <code>join()</code>.</p> <p>This guarantees all in-flight items complete (including all N outputs of a flat stage) before <code>join()</code>.</p>
<h2 id="error-handling">Error handling</h2> <h2 id="error-handling">Error handling</h2>
<p><code>PipelineError</code> has four variants:</p> <p><code>PipelineError</code> has four variants:</p>
<table> <table>
@@ -1279,7 +1413,7 @@ index N+1 source_rx pull new data last
<td>Internal routing error</td> <td>Internal routing error</td>
</tr> </tr>
<tr> <tr>
<td><code>StepError(Box&lt;dyn Error&gt;)</code></td> <td><code>StepError(Box&lt;dyn Error + Send + Sync&gt;)</code></td>
<td>Error from user code (wrapped by <code>make_*_fallible!</code>)</td> <td>Error from user code (wrapped by <code>make_*_fallible!</code>)</td>
</tr> </tr>
</tbody> </tbody>
File diff suppressed because it is too large Load Diff
@@ -12,7 +12,7 @@
<link rel="prev" href="../persistent_compact_int_vec/"> <link rel="prev" href="../persistent_compact_int_vec/">
<link rel="next" href="../../architecture/sequences/invariant/"> <link rel="next" href="../merge/">
@@ -649,6 +649,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -1002,6 +1030,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
File diff suppressed because it is too large Load Diff
@@ -649,6 +649,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -985,6 +1013,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
File diff suppressed because it is too large Load Diff
+146 -11
View File
@@ -773,6 +773,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -851,6 +879,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1104,7 +1188,9 @@
<li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li> <li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li>
</ul> </ul>
<h2 id="phase-1-scatter">Phase 1 — Scatter</h2> <h2 id="phase-1-scatter">Phase 1 — Scatter</h2>
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:</p> <p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.</p>
<p>Input files are read via <code>open_nuc_stream</code>, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of <code>NucPage</code> buffers. Each <code>NucPage</code> is a flat 64 KB buffer of normalised bytes (<code>ACGT</code> + <code>\x00</code> separators), carrying a k1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.</p>
<p>For each read fragment within a page:</p>
<ol> <ol>
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li> <li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li>
<li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <span class="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <span class="arithmatex">\(S[j]\)</span> (with <span class="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <span class="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <span class="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li> <li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <span class="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <span class="arithmatex">\(S[j]\)</span> (with <span class="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <span class="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <span class="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li>
@@ -1154,8 +1240,13 @@ B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
for each kmer in sequence: for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT kmer_counts[canonical(kmer)] += COUNT
</code></pre></div> </code></pre></div>
<p>Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.</p> <p>Implemented as a three-step pipeline in <code>count_partition()</code>:</p>
<p>Abundance filter applied here: kmers with <code>total_count &lt; q</code> are discarded. <code>q</code> is a collection parameter (0 = keep all, including singletons for ≤1x data).</p> <ol>
<li><strong>External sort</strong> (<code>kmer_sort::sort_unique_kmers</code>): read dereplicated superkmers, extract canonical kmer raw <code>u64</code> values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → <code>sorted_unique.bin</code>. f0 is now known exactly.</li>
<li><strong>Provisional MPHF</strong> (ptr_hash): built from <code>sorted_unique.bin</code> via <code>new_from_par_iter(f0, ...)</code>. Stored to <code>mphf1.bin</code>; <code>sorted_unique.bin</code> deleted immediately.</li>
<li><strong>Accumulation pass</strong>: re-read dereplicated superkmers; for each kmer, <code>slot = mphf.index(kmer.raw())</code>, increment <code>counts1[slot]</code> by the superkmer COUNT. Stored in a <code>PersistentCompactIntVec</code> (<code>counts1.bin</code>).</li>
</ol>
<p>At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (<code>spectrums/{label}.json</code>) is written to the index root.</p>
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p> <p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p>
<h2 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer compaction</h2> <h2 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer compaction</h2>
<p>The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:</p> <p>The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:</p>
@@ -1188,14 +1279,52 @@ branching / dead-end → unitig start or end
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p> <p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p>
<p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p> <p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p>
<h2 id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2> <h2 id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2>
<p>Built once on the definitive kmer set (all kmers in all unitigs of the partition). See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for the current implementation.</p> <p><code>build_index_layer</code> is called per partition (in parallel via <code>build_layers</code>) with the following parameters sourced from <code>IndexConfig</code>:</p>
<div class="highlight"><pre><span></span><code>kmers from unitigs → MPHF → mphf.bin <ul>
→ evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits) <li><code>block_bits</code> — from <code>IndexConfig::block_bits</code>; controls the <code>.idx</code> block size (2^block_bits unitig chunks per block) for exact evidence</li>
→ payload : counts/ (mode 2) or presence/ (mode 3) <li><code>evidence</code><code>EvidenceKind::Exact</code> or <code>EvidenceKind::Approx { b, z }</code>; propagated unchanged from <code>IndexConfig::evidence</code></li>
<li><code>min_ab</code> / <code>max_ab</code> — abundance bounds applied before graph construction</li>
<li><code>with_counts</code> — whether to store kmer counts alongside set membership</li>
</ul>
<p><strong>Abundance filtering:</strong> when <code>min_ab &gt; 1</code> or <code>max_ab.is_some()</code>, the provisional <code>mphf1.bin</code> and <code>counts1.bin</code> produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in <code>counts1</code> satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).</p>
<div class="highlight"><pre><span></span><code>for each kmer in dereplicated super-kmer:
ab = counts1[mphf1.index(kmer.raw())]
if ab &lt; min_ab || ab &gt; max_ab: skip
graph.push(kmer)
</code></pre></div> </code></pre></div>
<p>The MPHF is built in two passes over <code>unitigs.bin</code>: parallel pass for <code>mphf.bin</code>, sequential pass for <code>evidence.bin</code> and payload. The exact kmer count is available from the unitig index (<code>unitigs.bin.idx</code>) before the passes begin.</p> <p><strong>Graph build and unitig write:</strong></p>
<p><strong>Exact verification via unitig evidence:</strong></p> <p>The surviving kmers are fed into <code>GraphDeBruijn</code>, which computes degrees and yields unitigs. Unitigs are written to <code>layer_0/unitigs.bin</code> via a <code>UnitigFileWriter</code>.</p>
<p><code>unitigs.bin</code> serves as the evidence structure. The MPHF maps every input to <code>[0, N)</code> including absent kmers — the unitig read-back (via <code>evidence.bin</code>) is the only correct membership test.</p> <p><strong>MPHF and evidence build:</strong></p>
<p><code>Layer::build</code> (membership-only) or <code>Layer::&lt;PersistentCompactIntMatrix&gt;::build</code> (with counts) is called next. Internally, <code>MphfLayer::build</code> performs two passes:</p>
<ol>
<li><strong>Pass 1 (parallel):</strong> build <code>unitigs.bin.idx</code> (block size = 2^<code>block_bits</code>) then construct the MPHF from all canonical kmers in <code>unitigs.bin</code>; store to <code>mphf.bin</code>.</li>
<li><strong>Pass 2 (sequential):</strong> for each kmer in <code>unitigs.bin</code>, compute its slot and write <code>evidence.bin</code> (<code>chunk_id: 25 bits | rank: 7 bits</code> packed into a <code>u32</code>); also invoke the payload callback (<code>fill_slot</code>) to populate <code>counts/</code> if <code>with_counts</code>.</li>
</ol>
<p>After <code>Layer::build</code> completes, <code>layer_meta.json</code> records <code>EvidenceKind::Exact</code>.</p>
<p><strong>Approximate evidence override:</strong></p>
<p>If <code>evidence</code> is <code>EvidenceKind::Approx { b, z }</code>, <code>build_approx_evidence</code> is called immediately after <code>Layer::build</code>. It overwrites the exact evidence bundle with <code>fingerprint.bin</code> (b-bit hash per slot) and rewrites <code>layer_meta.json</code> with <code>EvidenceKind::Approx { b, z }</code>. No <code>.idx</code> file is needed at query time in this mode.</p>
<div class="highlight"><pre><span></span><code>// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
// (evidence.bin left on disk but not used)
</code></pre></div>
<p><strong>Partition metadata:</strong></p>
<p>After all layer files are written, <code>PartitionMeta { n_layers: 1 }</code> is serialised to <code>index/meta.json</code> inside the partition directory. This file is required by <code>LayeredMap::open</code> for subsequent merge operations.</p>
<p><strong>File layout per partition after phase 6:</strong></p>
<div class="highlight"><pre><span></span><code>part_XXXXX/
index/
meta.json ← PartitionMeta { n_layers: 1 }
layer_0/
unitigs.bin ← permanent evidence (all modes)
unitigs.bin.idx ← block index (exact mode only)
mphf.bin ← MPHF
evidence.bin ← exact evidence (exact mode)
fingerprint.bin ← b-bit fingerprints (approx mode)
layer_meta.json ← EvidenceKind tag
counts/ ← PersistentCompactIntMatrix (with_counts only)
</code></pre></div>
<p><strong>Cleanup:</strong> unless <code>--keep-intermediate</code> is set, <code>remove_build_artifacts</code> deletes <code>dereplicated.skmer.zst</code>, <code>mphf1.bin</code>, and <code>counts1.bin</code> after all partitions are indexed.</p>
<p>See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for data structure details.</p>
<p><strong>Query path (exact evidence):</strong></p>
<div class="highlight"><pre><span></span><code>query kmer q <div class="highlight"><pre><span></span><code>query kmer q
→ canonical_minimizer(q) → hash → PART → part_XXXXX/ → canonical_minimizer(q) → hash → PART → part_XXXXX/
→ MPHF(q) → slot s → MPHF(q) → slot s
@@ -1204,7 +1333,13 @@ branching / dead-end → unitig start or end
→ match : return payload[s] ← exact hit → match : return payload[s] ← exact hit
→ no match: kmer absent ← MPHF collision on absent kmer → no match: kmer absent ← MPHF collision on absent kmer
</code></pre></div> </code></pre></div>
<p><code>superkmers.bin.gz</code> is no longer needed at this point and can be deleted.</p> <p><strong>Query path (approximate evidence):</strong></p>
<div class="highlight"><pre><span></span><code>query kmer q
→ MPHF(q) → slot s
→ fingerprint[s] matches seq_hash(q)?
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
→ no : kmer absent
</code></pre></div>
<div class="footnote"> <div class="footnote">
<hr /> <hr />
<ol> <ol>
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+553 -4
View File
@@ -64,7 +64,7 @@
<div data-md-component="skip"> <div data-md-component="skip">
<a href="#on-disk-collection-structure" class="md-skip"> <a href="#on-disk-index-layout" class="md-skip">
Skip to content Skip to content
</a> </a>
@@ -575,6 +575,24 @@
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
On-disk storage
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a href="./" class="md-nav__link md-nav__link--active"> <a href="./" class="md-nav__link md-nav__link--active">
@@ -592,6 +610,174 @@
</a> </a>
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#directory-tree" class="md-nav__link">
<span class="md-ellipsis">
Directory tree
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#state-machine-sentinels" class="md-nav__link">
<span class="md-ellipsis">
State machine (sentinels)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#indexmeta-indexmeta" class="md-nav__link">
<span class="md-ellipsis">
index.meta (IndexMeta)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-files" class="md-nav__link">
<span class="md-ellipsis">
Layer files
</span>
</a>
<nav class="md-nav" aria-label="Layer files">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#unitigsbin" class="md-nav__link">
<span class="md-ellipsis">
unitigs.bin
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
<span class="md-ellipsis">
unitigs.bin.idx (Exact only)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#mphfbin" class="md-nav__link">
<span class="md-ellipsis">
mphf.bin
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer_metajson-layermeta" class="md-nav__link">
<span class="md-ellipsis">
layer_meta.json (LayerMeta)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#evidencebin-exact" class="md-nav__link">
<span class="md-ellipsis">
evidence.bin (Exact)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#fingerprintbin-approx" class="md-nav__link">
<span class="md-ellipsis">
fingerprint.bin (Approx)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
<span class="md-ellipsis">
counts/ (PersistentCompactIntMatrix)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#presence-persistentbitmatrix" class="md-nav__link">
<span class="md-ellipsis">
presence/ (PersistentBitMatrix)
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#metajson-partitionmeta" class="md-nav__link">
<span class="md-ellipsis">
meta.json (PartitionMeta)
</span>
</a>
</li>
</ul>
</nav>
</li> </li>
@@ -659,6 +845,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -737,6 +951,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -874,6 +1144,163 @@
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#directory-tree" class="md-nav__link">
<span class="md-ellipsis">
Directory tree
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#state-machine-sentinels" class="md-nav__link">
<span class="md-ellipsis">
State machine (sentinels)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#indexmeta-indexmeta" class="md-nav__link">
<span class="md-ellipsis">
index.meta (IndexMeta)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer-files" class="md-nav__link">
<span class="md-ellipsis">
Layer files
</span>
</a>
<nav class="md-nav" aria-label="Layer files">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#unitigsbin" class="md-nav__link">
<span class="md-ellipsis">
unitigs.bin
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
<span class="md-ellipsis">
unitigs.bin.idx (Exact only)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#mphfbin" class="md-nav__link">
<span class="md-ellipsis">
mphf.bin
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#layer_metajson-layermeta" class="md-nav__link">
<span class="md-ellipsis">
layer_meta.json (LayerMeta)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#evidencebin-exact" class="md-nav__link">
<span class="md-ellipsis">
evidence.bin (Exact)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#fingerprintbin-approx" class="md-nav__link">
<span class="md-ellipsis">
fingerprint.bin (Approx)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
<span class="md-ellipsis">
counts/ (PersistentCompactIntMatrix)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#presence-persistentbitmatrix" class="md-nav__link">
<span class="md-ellipsis">
presence/ (PersistentBitMatrix)
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#metajson-partitionmeta" class="md-nav__link">
<span class="md-ellipsis">
meta.json (PartitionMeta)
</span>
</a>
</li>
</ul>
</nav> </nav>
</div> </div>
</div> </div>
@@ -889,9 +1316,131 @@
<h1 id="on-disk-collection-structure">On-disk collection structure</h1> <h1 id="on-disk-index-layout">On-disk index layout</h1>
<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p> <h2 id="directory-tree">Directory tree</h2>
<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p> <div class="highlight"><pre><span></span><code>&lt;index_root&gt;/
index.meta ← JSON: IndexMeta
scatter.done ← sentinel: scatter phase complete
count.done ← sentinel: dereplicate + count complete
index.done ← sentinel: MPHF index fully built
spectrums/
&lt;label&gt;.json ← kmer frequency spectrum per genome
partitions/
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits1)
index/
meta.json ← PartitionMeta { n_layers }
layer_0/
unitigs.bin ← binary unitig sequences (2-bit packed)
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
mphf.bin ← serialised PtrHash MPHF
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
presence/ ← PersistentBitMatrix (if presence mode, merge)
layer_1/ ← added by merge; same structure as layer_0
layer_2/ …
part_00001/ …
</code></pre></div>
<h2 id="state-machine-sentinels">State machine (sentinels)</h2>
<p>The sentinels are touched atomically at the end of each pipeline stage.
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
detected as the lowest sentinel present.</p>
<table>
<thead>
<tr>
<th>State</th>
<th>Sentinel present</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Empty</code></td>
<td></td>
<td><code>index.meta</code> exists; scatter not started or interrupted</td>
</tr>
<tr>
<td><code>Scattered</code></td>
<td><code>scatter.done</code></td>
<td>All super-kmers routed to partition files</td>
</tr>
<tr>
<td><code>Counted</code></td>
<td><code>count.done</code></td>
<td>Partitions dereplicated; <code>spectrums/</code> written</td>
</tr>
<tr>
<td><code>Indexed</code></td>
<td><code>index.done</code></td>
<td>All MPHF layers built; index ready for queries</td>
</tr>
</tbody>
</table>
<h2 id="indexmeta-indexmeta">index.meta (IndexMeta)</h2>
<div class="highlight"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">&quot;version&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;config&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">&quot;kmer_size&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">31</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;minimizer_size&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">11</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;n_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;with_counts&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;evidence&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;Exact&quot;</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;block_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="nt">&quot;genomes&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;label&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;genome_A&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;meta&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;species&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;Homo sapiens&quot;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p><code>n_bits</code> determines the partition count: <code>2^n_bits</code> directories under <code>partitions/</code>.</p>
<p><code>evidence</code> is either the string <code>"Exact"</code> or <code>{"Approx": {"b": 8, "z": 1}}</code>.</p>
<p><code>block_bits</code> controls the <code>.idx</code> granularity: one offset entry every <code>2^block_bits</code>
chunks. <code>block_bits=0</code> stores one entry per chunk (O(1) random access, largest <code>.idx</code>).</p>
<p><code>GenomeInfo.meta</code> is a free-form string→string map for categorical metadata (e.g.
taxonomy, sample origin). It is optional; defaults to empty.</p>
<h2 id="layer-files">Layer files</h2>
<h3 id="unitigsbin">unitigs.bin</h3>
<p>2-bit packed binary unitig sequences. Each record: 1 byte <code>seql_minus_k</code>
(nucleotide length k), followed by <code>ceil((seql_minus_k + k) / 4)</code> bytes of
packed sequence. Long unitigs are transparently split into overlapping chunks
(k1 nucleotide overlap) so no k-mer crosses a chunk boundary.</p>
<h3 id="unitigsbinidx-exact-only">unitigs.bin.idx (Exact only)</h3>
<p>Magic <code>UIX3</code>, little-endian header: <code>block_bits</code> (u32), <code>n_unitigs</code> (u32),
<code>n_kmers</code> (u64), then <code>ceil(n_unitigs / 2^block_bits) + 1</code> byte-offset entries
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.</p>
<h3 id="mphfbin">mphf.bin</h3>
<p>PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
2-bit) to a slot index in <code>[0, n_kmers)</code>.</p>
<h3 id="layer_metajson-layermeta">layer_meta.json (LayerMeta)</h3>
<p><div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">&quot;evidence&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;exact&quot;</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
</code></pre></div>
or
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">&quot;evidence&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;approx&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;b&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;z&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
</code></pre></div></p>
<h3 id="evidencebin-exact">evidence.bin (Exact)</h3>
<p>One <code>(chunk_id: u32, rank: u8)</code> record per MPHF slot, packed. Used to verify
that the kmer mapped to a slot is actually present: <code>unitigs.bin[chunk_id][rank]</code>
is re-read and compared against the query.</p>
<h3 id="fingerprintbin-approx">fingerprint.bin (Approx)</h3>
<p><code>b</code>-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
False-positive rate per query ≈ <code>1/2^b</code>. With Findere parameter <code>z ≥ 2</code>,
<code>z</code> consecutive k-mers must all match, reducing the effective FP rate to
approximately <code>W / 2^(b·z)</code> per read of length <code>L</code>
(where <code>W = L k z + 2</code>).</p>
<h3 id="counts-persistentcompactintmatrix">counts/ (PersistentCompactIntMatrix)</h3>
<p>Present when <code>with_counts=true</code>. One column per genome; each row holds the
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
during indexing and merge.</p>
<h3 id="presence-persistentbitmatrix">presence/ (PersistentBitMatrix)</h3>
<p>Present when the layer was built in presence/absence mode (merge path).
One bit per genome per MPHF slot. Written during merge; never present on a
freshly indexed single-genome layer.</p>
<h2 id="metajson-partitionmeta">meta.json (PartitionMeta)</h2>
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">&quot;n_layers&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="p">}</span>
</code></pre></div>
<p>Records how many <code>layer_N/</code> directories exist under <code>index/</code>. Incremented by
each merge that adds a layer.</p>
File diff suppressed because it is too large Load Diff
+125 -58
View File
@@ -751,6 +751,34 @@
<li class="md-nav__item">
<a href="../evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../obilayeredmap/" class="md-nav__link"> <a href="../obilayeredmap/" class="md-nav__link">
@@ -829,6 +857,62 @@
<li class="md-nav__item">
<a href="../merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1046,61 +1130,49 @@
<h1 id="superkmer-implementation">SuperKmer — implementation</h1> <h1 id="superkmer-implementation">SuperKmer — implementation</h1>
<h2 id="memory-layout">Memory layout</h2> <h2 id="memory-layout">Memory layout</h2>
<p>A super-kmer is stored as a <strong>32-bit header</strong> followed by a <strong>byte-aligned nucleotide sequence</strong> (2 bits/base, nucleotide 0 at the MSB of the first byte):</p> <p><code>SuperKmer</code> holds two separate fields:</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SuperKmer</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">count</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">,</span>
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">inner</span><span class="p">:</span><span class="w"> </span><span class="nc">PackedSeq</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div>
<p><code>PackedSeq</code> stores a 2-bit packed DNA sequence as a heap-allocated <code>Box&lt;[u8]&gt;</code> plus a <code>tail: u8</code> field:</p>
<table> <table>
<thead> <thead>
<tr> <tr>
<th>Field</th> <th>Field</th>
<th>Bits</th> <th>Type</th>
<th>Role</th> <th>Role</th>
</tr> </tr>
</thead> </thead>
<tbody> <tbody>
<tr> <tr>
<td>COUNT</td> <td><code>tail</code></td>
<td>24</td> <td><code>u8</code></td>
<td>Occurrence count (≤ 16 M)</td> <td>Number of valid nucleotides in the last byte: 0 encodes 4, 13 are identity</td>
</tr> </tr>
<tr> <tr>
<td>NKMERS</td> <td><code>seq</code></td>
<td>8</td> <td><code>Box&lt;[u8]&gt;</code></td>
<td>Number of kmers (= seq_length k + 1, range 1255)</td> <td>2-bit packed bytes, nucleotide 0 at bits 76 of <code>seq[0]</code></td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p>Bit layout (MSB to LSB): <code>[31:8] COUNT [7:0] NKMERS</code></p> <p>Nucleotide length is recovered without storing it explicitly:</p>
<p>NKMERS is stored as a raw <code>u8</code> in <strong>kmer units</strong>, not nucleotides. The nucleotide length is recovered as <code>NKMERS + k 1</code>. This avoids the awkward wrapping convention (<code>0 = 256</code>) that would be needed if nucleotide length were stored directly, and gains k1 = 30 units of headroom:</p> <div class="highlight"><pre><span></span><code>seql = (seq.len() - 1) * 4 + tail_count(tail)
<table> </code></pre></div>
<thead> <p>There is no packed header word — <code>count</code> and the sequence live in separate fields.</p>
<tr> <p>The on-disk binary format (produced by <code>write_to_binary</code>) is:</p>
<th>unit</th> <div class="highlight"><pre><span></span><code>[varint(count)] [u8: seql k] [packed bytes…]
<th>u8 covers</th> </code></pre></div>
<th>max nucleotides</th> <p><code>seql k</code> fits in a <code>u8</code> when <code>n_kmers = seql k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)</code>. If a super-kmer exceeds 256 kmers, <code>write_to_binary</code> splits it into overlapping chunks (k1 nucleotide overlap, same count per chunk), each a self-contained record readable by <code>read_from_binary</code>.</p>
</tr> <p>The public accessors operate on the struct fields directly:</p>
</thead> <div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">inner</span><span class="p">.</span><span class="n">seql</span><span class="p">()</span><span class="w"> </span><span class="p">}</span>
<tbody> <span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="p">}</span>
<tr> <span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<td>nucleotides</td> <span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<td>255 nt</td> <span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<td>225 kmers</td>
</tr>
<tr>
<td><strong>kmers</strong></td>
<td><strong>255 kmers</strong></td>
<td><strong>285 nt</strong></td>
</tr>
</tbody>
</table>
<p>The public accessors:</p>
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">n_kmers</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">}</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">K</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="p">}</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">8</span><span class="p">);</span><span class="w"> </span><span class="p">}</span>
</code></pre></div> </code></pre></div>
<p>In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).</p>
<p>The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.</p>
<h2 id="ascii-encoding-and-decoding">ASCII encoding and decoding</h2> <h2 id="ascii-encoding-and-decoding">ASCII encoding and decoding</h2>
<p>Two lookup tables handle ASCII ↔ 2-bit conversion:</p> <p>Two lookup tables handle ASCII ↔ 2-bit conversion:</p>
<ul> <ul>
@@ -1125,7 +1197,7 @@
</code></pre></div> </code></pre></div>
<p><code>REVCOMP4</code> is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.</p> <p><code>REVCOMP4</code> is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.</p>
<p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 seql × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice&lt;u8, Msb0&gt;::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p> <p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 seql × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice&lt;u8, Msb0&gt;::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p>
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span> <div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">seql</span><span class="p">();</span>
<span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span> <span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span>
<span class="n">bits</span><span class="p">.</span><span class="n">rotate_left</span><span class="p">(</span><span class="n">shift</span><span class="p">)</span> <span class="n">bits</span><span class="p">.</span><span class="n">rotate_left</span><span class="p">(</span><span class="n">shift</span><span class="p">)</span>
<span class="n">bits</span><span class="p">[</span><span class="n">len</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">shift</span><span class="o">..</span><span class="p">].</span><span class="n">fill</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span> <span class="n">bits</span><span class="p">[</span><span class="n">len</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">shift</span><span class="o">..</span><span class="p">].</span><span class="n">fill</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
@@ -1143,7 +1215,7 @@
</code></pre></div> </code></pre></div>
</div> </div>
<h2 id="minimizer-sliding-window">Minimizer sliding window</h2> <h2 id="minimizer-sliding-window">Minimizer sliding window</h2>
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which maintains the current minimizer with a <strong>monotonic deque</strong> over a sliding window of W = k m + 1 m-mer positions.</p> <p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which tracks the current minimizer with a <strong>monotonic deque</strong> (<code>Ring&lt;MmerItem, 32&gt;</code>) inside <code>RollingStat</code>, a rolling-window entropy and minimizer tracker.</p>
<p>Each deque entry stores:</p> <p>Each deque entry stores:</p>
<table> <table>
<thead> <thead>
@@ -1167,20 +1239,11 @@
<tr> <tr>
<td><code>hash</code></td> <td><code>hash</code></td>
<td>u64</td> <td>u64</td>
<td><span class="arithmatex">\(H(\text{canonical})\)</span> — ordering key for random minimizer selection</td> <td><code>hash_kmer(canonical &lt;&lt; (64 2m))</code> — ordering key for random minimizer selection</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
<p>The hash <span class="arithmatex">\(H\)</span> is the seeded splitmix64 finalizer (see <a href="../../theory/minimizer/">Minimizer selection</a>):</p> <p>The hash uses the seeded splitmix64 finalizer (<code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>), the same function as <code>kmer::hash_kmer</code>.</p>
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">hash_mmer</span><span class="p">(</span><span class="n">canonical</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="kt">u64</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">canonical</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="mh">0x9e3779b97f4a7c15</span><span class="p">;</span><span class="w"> </span><span class="c1">// seed: eliminates fixed point at 0</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">30</span><span class="p">);</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0xbf58476d1ce4e5b9</span><span class="p">);</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">27</span><span class="p">);</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0x94d049bb133111eb</span><span class="p">);</span>
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">31</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div>
<p>On each new nucleotide, once the window is full, the deque is updated:</p> <p>On each new nucleotide, once the window is full, the deque is updated:</p>
<div class="admonition abstract"> <div class="admonition abstract">
<p class="admonition-title">Algorithm — minimizer deque update</p> <p class="admonition-title">Algorithm — minimizer deque update</p>
@@ -1196,17 +1259,21 @@
</code></pre></div> </code></pre></div>
</div> </div>
<p>The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.</p> <p>The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.</p>
<p>A super-kmer boundary is emitted when the minimizer changes: <code>deque.front.hash ≠ prev_hash</code>. The <code>canonical</code> field of the front entry is <strong>not</strong> used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key <span class="arithmatex">\(H(\text{canonical})\)</span> can be recomputed independently at routing time from the stored <code>minimizer_pos</code>, without inheriting the minimum-order-statistic bias (see <a href="../../theory/minimizer/#partition-key-independence">Minimizer selection — partition key independence</a>).</p> <p>A super-kmer boundary is emitted when the minimizer changes: <code>current_minimizer != prev_minimizer</code>. <code>SuperKmerIter</code> also emits a boundary when:</p>
<ul>
<li>entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k1)</li>
<li>super-kmer length reaches 256 nucleotides (cursor retreated by k)</li>
</ul>
<h2 id="kmer-extraction">Kmer extraction</h2> <h2 id="kmer-extraction">Kmer extraction</h2>
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i, k)</code>, which returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p> <p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i)</code>, which delegates to <code>PackedSeq::extract::&lt;KLen&gt;(i)</code> and returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">&gt;</span> <div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">&gt;</span>
</code></pre></div> </code></pre></div>
<p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a big-endian <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p> <p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p>
<hr /> <hr />
<div class="admonition abstract"> <div class="admonition abstract">
<p class="admonition-title">Algorithm — Super-kmer reverse complement</p> <p class="admonition-title">Algorithm — Super-kmer reverse complement</p>
<div class="highlight"><pre><span></span><code>procedure SuperKmerRevcomp(seq, SEQL): <div class="highlight"><pre><span></span><code>procedure SuperKmerRevcomp(seq, SEQL):
seql ← NKMERS + k 1 -- nucleotide length seql ← nucleotide length
n ← ⌈seql / 4⌉ -- number of bytes n ← ⌈seql / 4⌉ -- number of bytes
shift ← n × 8 seql × 2 -- padding bits to flush shift ← n × 8 seql × 2 -- padding bits to flush
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+294 -1
View File
@@ -213,6 +213,17 @@
</label> </label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#subcommands" class="md-nav__link">
<span class="md-ellipsis">
Subcommands
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#constraints" class="md-nav__link"> <a href="#constraints" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
@@ -222,6 +233,28 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
<span class="md-ellipsis">
Parameter constraints (enforced at CLI)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#genome-label-constraints" class="md-nav__link">
<span class="md-ellipsis">
Genome label constraints
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -714,6 +747,34 @@
<li class="md-nav__item">
<a href="implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="implementation/obilayeredmap/" class="md-nav__link"> <a href="implementation/obilayeredmap/" class="md-nav__link">
@@ -792,6 +853,62 @@
<li class="md-nav__item">
<a href="implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -935,6 +1052,17 @@
</label> </label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#subcommands" class="md-nav__link">
<span class="md-ellipsis">
Subcommands
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#constraints" class="md-nav__link"> <a href="#constraints" class="md-nav__link">
<span class="md-ellipsis"> <span class="md-ellipsis">
@@ -944,6 +1072,28 @@
</span> </span>
</a> </a>
</li>
<li class="md-nav__item">
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
<span class="md-ellipsis">
Parameter constraints (enforced at CLI)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#genome-label-constraints" class="md-nav__link">
<span class="md-ellipsis">
Genome label constraints
</span>
</a>
</li> </li>
<li class="md-nav__item"> <li class="md-nav__item">
@@ -976,12 +1126,155 @@
<h1 id="obikmer">obikmer</h1> <h1 id="obikmer">obikmer</h1>
<p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p> <p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p>
<h2 id="subcommands">Subcommands</h2>
<table>
<thead>
<tr>
<th>Subcommand</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>superkmer</code></td>
<td>Extract super-kmers from a sequence file and write to stdout</td>
</tr>
<tr>
<td><code>index</code></td>
<td>Build a complete genome index (scatter → dereplicate → count → layered MPHF)</td>
</tr>
<tr>
<td><code>merge</code></td>
<td>Merge multiple built indexes into one</td>
</tr>
<tr>
<td><code>rebuild</code></td>
<td>Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata</td>
</tr>
<tr>
<td><code>query</code></td>
<td>Query an index with sequences and annotate matches</td>
</tr>
<tr>
<td><code>dump</code></td>
<td>Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
</tr>
<tr>
<td><code>annotate</code></td>
<td>Add or update genome metadata from a CSV file; or dump metadata as CSV</td>
</tr>
<tr>
<td><code>distance</code></td>
<td>Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees</td>
</tr>
<tr>
<td><code>unitig</code></td>
<td>Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
</tr>
<tr>
<td><code>estimate</code></td>
<td>Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing</td>
</tr>
<tr>
<td><code>reindex</code></td>
<td>Convert an index's evidence in-place: exact ↔ approx</td>
</tr>
<tr>
<td><code>utils</code></td>
<td>Miscellaneous index utilities: <code>--new-label NEW=OLD</code> renames a genome label; <code>--upgrade-index</code> adds missing <code>layer_meta.json</code> to old indexes</td>
</tr>
<tr>
<td><code>pack</code></td>
<td>Pack per-column matrix files into single-file format to reduce query I/O</td>
</tr>
</tbody>
</table>
<h2 id="constraints">Constraints</h2> <h2 id="constraints">Constraints</h2>
<ul> <ul>
<li>Target scale: individual genome datasets, tens of Gbases</li> <li>Target scale: individual genome datasets, tens of Gbases</li>
<li>Maximum efficiency in computation, memory, and disk usage</li> <li>Maximum efficiency in computation, memory, and disk usage</li>
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li> <li>k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)</li>
<li>Canonical form: <code>min(kmer, revcomp(kmer))</code> reduces strand-symmetric space by half</li>
<li>Input formats for <code>index</code>/<code>superkmer</code>: FASTA (<code>.fa</code>, <code>.fasta</code>), FASTQ (<code>.fq</code>, <code>.fastq</code>), GenBank flat file (<code>.gb</code>, <code>.gbk</code>, <code>.gbff</code>), all optionally gzip-compressed; directories expanded recursively; streaming stdin via <code>-</code></li>
<li>Input formats for <code>query</code>: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via <code>-</code></li>
</ul> </ul>
<h2 id="parameter-constraints-enforced-at-cli">Parameter constraints (enforced at CLI)</h2>
<p>All constraints below are checked by <code>CommonArgs::validate()</code> at the start of <code>superkmer</code> and <code>index</code>. Invalid values exit immediately with an error.</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Constraint</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>k (<code>--kmer-size</code>)</td>
<td>odd</td>
<td>even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant</td>
</tr>
<tr>
<td>k (<code>--kmer-size</code>)</td>
<td>k ∈ [11, 31]</td>
<td>k &gt; 31 overflows u64 at 2 bits/base; k &lt; 11 gives insufficient specificity</td>
</tr>
<tr>
<td>m (<code>--minimizer-size</code>)</td>
<td>odd</td>
<td>same palindrome argument as k</td>
</tr>
<tr>
<td>m (<code>--minimizer-size</code>)</td>
<td>3 ≤ m ≤ k1</td>
<td>minimizer must be strictly shorter than the kmer</td>
</tr>
<tr>
<td>z (<code>-z</code>, Findere, <code>index --approx</code> only)</td>
<td>z ≤ k1</td>
<td>effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0</td>
</tr>
</tbody>
</table>
<h2 id="genome-label-constraints">Genome label constraints</h2>
<p>Genome labels are arbitrary Unicode strings with the following restrictions:</p>
<table>
<thead>
<tr>
<th>Character</th>
<th>Forbidden</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>/</code></td>
<td>yes</td>
<td>filesystem path separator</td>
</tr>
<tr>
<td><code>=</code></td>
<td>yes</td>
<td><code>--new-label</code> parser separator</td>
</tr>
<tr>
<td><code>\0</code></td>
<td>yes</td>
<td>null byte</td>
</tr>
<tr>
<td><code>\n</code> <code>\r</code> <code>\t</code></td>
<td>yes</td>
<td>break CSV output</td>
</tr>
<tr>
<td>spaces</td>
<td><strong>allowed</strong></td>
<td>use shell quoting: <code>--new-label 'new label=old label'</code></td>
</tr>
</tbody>
</table>
<p>Empty labels are also rejected. Labels derived automatically from the index directory name (when <code>--label</code> is omitted) are not validated since they come from the filesystem and are already safe.</p>
<h2 id="priority-operations">Priority operations</h2> <h2 id="priority-operations">Priority operations</h2>
<ul> <ul>
<li>Kmer counting (frequencies)</li> <li>Kmer counting (frequencies)</li>
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+87 -2
View File
@@ -746,6 +746,34 @@
<li class="md-nav__item">
<a href="../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../implementation/obilayeredmap/" class="md-nav__link"> <a href="../implementation/obilayeredmap/" class="md-nav__link">
@@ -824,6 +852,62 @@
<li class="md-nav__item">
<a href="../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1038,11 +1122,12 @@
<h2 id="kmers">Kmers</h2> <h2 id="kmers">Kmers</h2>
<p>A <strong>kmer</strong> is a DNA subsequence of fixed length k. Two constraints govern the choice of k:</p> <p>A <strong>kmer</strong> is a DNA subsequence of fixed length k. Two constraints govern the choice of k:</p>
<ul> <ul>
<li><strong>k ∈ [11, 31]</strong>: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.</li> <li><strong>k ∈ [11, 31]</strong>: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k &lt; 11 yields insufficient specificity).</li>
<li><strong>k is odd</strong>: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form <code>min(kmer, revcomp(kmer))</code> is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.</li> <li><strong>k is odd</strong>: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form <code>min(kmer, revcomp(kmer))</code> is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.</li>
</ul> </ul>
<p>Both constraints are <strong>enforced at CLI entry</strong> by <code>CommonArgs::validate()</code> in <code>superkmer</code> and <code>index</code>. Passing an invalid k exits immediately with an error message.</p>
<h2 id="super-kmers">Super-kmers</h2> <h2 id="super-kmers">Super-kmers</h2>
<p>A <strong>super-kmer</strong> is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k1 nucleotides. Each kmer of the run carries the same <strong>canonical minimizer</strong>. The <strong>canonical minimizer</strong> of a kmer is the smallest value of <code>min(m-mer, revcomp(m-mer))</code> over all m-mers within the kmer (m &lt; k, m odd), with the constraint that <strong>non-degenerate m-mers are always preferred</strong> over degenerate ones. A degenerate m-mer is one composed of a single repeated nucleotide (all-A, all-C, all-G, or all-T); such m-mers are selected only if no non-degenerate candidate exists in the window.</p> <p>A <strong>super-kmer</strong> is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k1 nucleotides, sharing the same <strong>canonical minimizer</strong>. The <strong>canonical minimizer</strong> of a kmer is the m-mer (m &lt; k) whose canonical hash <code>hash_kmer(min(m-mer, revcomp(m-mer)))</code> is smallest over all m-mers in the kmer window. The hash function is a <code>mix64</code>-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.</p>
<h3 id="canonical-super-kmers">Canonical super-kmers</h3> <h3 id="canonical-super-kmers">Canonical super-kmers</h3>
<p>A <strong>canonical super-kmer</strong> is the lexicographic minimum of a super-kmer and its reverse complement:</p> <p>A <strong>canonical super-kmer</strong> is the lexicographic minimum of a super-kmer and its reverse complement:</p>
<div class="highlight"><pre><span></span><code>canonical(super-kmer) = min(super-kmer, revcomp(super-kmer)) <div class="highlight"><pre><span></span><code>canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
Binary file not shown.
File diff suppressed because it is too large Load Diff
+93 -6
View File
@@ -718,6 +718,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link"> <a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -796,6 +824,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1010,17 +1094,20 @@
<p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base &amp; 0b11</code>.</p> <p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base &amp; 0b11</code>.</p>
<h2 id="kmer-encoding">Kmer encoding</h2> <h2 id="kmer-encoding">Kmer encoding</h2>
<p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i &lt; k): <code>(kmer &gt;&gt; (62 - 2*i)) &amp; 0b11</code>.</p> <p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i &lt; k): <code>(kmer &gt;&gt; (62 - 2*i)) &amp; 0b11</code>.</p>
<p>Reverse complement is computed via a <strong>16-bit lookup table</strong> (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.</p> <p>Reverse complement is computed by <strong>bit manipulation in four steps</strong>, with no lookup table:</p>
<div class="admonition abstract"> <div class="admonition abstract">
<p class="admonition-title">Algorithm — Kmer reverse complement</p> <p class="admonition-title">Algorithm — Kmer reverse complement</p>
<div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k): <div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer &amp; 0xFFFF] &lt;&lt; 48 x ← ~kmer -- complement all bases
| TABLE16[(kmer &gt;&gt; 16) &amp; 0xFFFF] &lt;&lt; 32 x ← swap_bytes(x) -- reverse byte order
| TABLE16[(kmer &gt;&gt; 32) &amp; 0xFFFF] &lt;&lt; 16 x ← ((x &gt;&gt; 4) &amp; 0x0F0F0F0F0F0F0F0F)
| TABLE16[(kmer &gt;&gt; 48) &amp; 0xFFFF] | ((x &amp; 0x0F0F0F0F0F0F0F0F) &lt;&lt; 4) -- swap nibbles within each byte
return raw &lt;&lt; (64 - 2*k) x ← ((x &gt;&gt; 2) &amp; 0x3333333333333333)
| ((x &amp; 0x3333333333333333) &lt;&lt; 2) -- swap 2-bit pairs within each nibble
return x &lt;&lt; (64 - 2*k) -- re-align to MSB
</code></pre></div> </code></pre></div>
</div> </div>
<p>The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.</p>
<p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p> <p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p>
<div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer)) <div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer))
</code></pre></div> </code></pre></div>
File diff suppressed because it is too large Load Diff
+85 -1
View File
@@ -773,6 +773,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link"> <a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -851,6 +879,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
@@ -1109,7 +1193,7 @@
<h2 id="final-score">Final score</h2> <h2 id="final-score">Final score</h2>
<p>The filter computes <span class="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p> <p>The filter computes <span class="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p>
<div class="arithmatex">\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]</div> <div class="arithmatex">\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]</div>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) \leq \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p> <p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) &lt; \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
<h2 id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2> <h2 id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2>
<p><span class="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <span class="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p> <p><span class="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <span class="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p>
<p>For the normalised score <span class="arithmatex">\(\hat{H}\)</span>, dividing by <span class="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p> <p>For the normalised score <span class="arithmatex">\(\hat{H}\)</span>, dividing by <span class="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p>
File diff suppressed because it is too large Load Diff
+84
View File
@@ -718,6 +718,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link"> <a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -796,6 +824,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
File diff suppressed because it is too large Load Diff
+84
View File
@@ -762,6 +762,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link"> <a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -840,6 +868,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul> </ul>
</nav> </nav>
+179
View File
@@ -0,0 +1,179 @@
# NUMA-aware partition runner
## Problem
All partition-level parallel loops in obikindex currently fall into two
categories:
**Naive Rayon** — used in `build_layers`, `pack_matrices`, `dump`, `select`,
`stats`, `rebuild`, `reindex`:
```rust
(0..n).into_par_iter().for_each(|i| work(i));
```
Threads come from the global Rayon pool with no NUMA awareness. On
multi-socket machines this produces cross-socket memory traffic and degrades
performance super-linearly (see [NUMA-aware worker pools](numa_worker_pools.md)).
**Ad-hoc adaptive pool** — used in `merge`:
A bespoke implementation with pre-spawned workers, channel-based dispatch, and
activation control. It handles NUMA correctly but is not reusable.
Both cases should be replaced by a single generic mechanism.
## Unified model
The key insight is that **UMA is just the NUMA case with a single node**. The
runner always works the same way: one controller thread per node, each
independently managing its own workers with the same adaptive logic. The only
difference between UMA and NUMA is the number of nodes and whether workers are
pinned.
```
NUMA (k nodes) UMA (1 node)
controller-0 controller-1 … controller-0
│ │ │
workers[0] workers[1] workers[0]
(pinned) (pinned) (global pool)
└───────────────┴──────────────────┘
shared work queue
```
On each node, the Rayon `ThreadPool` is pinned to that node's CPUs.
`pool.install()` ensures all internal Rayon calls (inside the work function)
use the node-local pool. Linux first-touch then places heap allocations in
local DRAM automatically.
On UMA the global Rayon pool is used directly — no pinning, no overhead.
## Adaptive mechanism
Each controller follows the same logic regardless of node count:
1. Pre-spawn `workers_per_node` dormant worker threads (blocked on `activate_rx`).
2. Activate the first worker immediately.
3. Loop on result channel with a `SPAWN_POLL` timeout:
- On result: call `on_done`; check whether to activate the next worker.
- On timeout: same check.
- Activation criterion: `should_spawn_worker(active, global_efficiency, prev_efficiency)`.
4. Drop `activate_tx` when done — dormant workers exit cleanly.
**Global CPU efficiency** (`CpuSample`, reads `/proc/stat` on Linux) is used by
all controllers — no per-node measurement needed. The signal is coarser than
per-node efficiency but correct in practice: if any node saturates memory
bandwidth, the global efficiency drops and all controllers stop activating
workers. Using a standard portable primitive avoids platform-specific CPU
accounting and keeps the implementation clean.
## Proposed API
```rust
pub struct PartitionRunner {
// One entry per NUMA node; one entry total on UMA.
nodes: Vec<NodeConfig>,
}
struct NodeConfig {
pool: Option<Arc<rayon::ThreadPool>>, // None = global Rayon pool (UMA)
cpu_ids: Vec<usize>, // empty = no pinning (UMA)
max_workers: usize,
}
impl PartitionRunner {
/// Detect topology and build the runner.
/// Returns a single-node runner on UMA / macOS / hwloc failure.
pub fn new() -> Self;
/// Run `f(i)` for every index in `order`, collecting results.
///
/// `on_done(i, result, elapsed)` is called under an internal mutex as
/// each partition completes — use it for progress bars and aggregation.
/// The runner serialises all calls to `on_done` via an internal
/// `Arc<Mutex<C>>`, so no `Sync` bound is required on the callback.
/// `Send` is required because the Arc clone crosses thread boundaries.
///
/// Serialisation is free in practice: a partition takes seconds to
/// minutes; the callback takes microseconds. Contention is negligible.
///
/// Returns the first error from `f`, if any.
pub fn run<F, R, E, C>(
&self,
order: &[usize],
f: F,
on_done: C,
) -> Result<(), E>
where
F: Fn(usize) -> Result<R, E> + Send + Sync,
R: Send,
E: Send,
C: FnMut(usize, R, Duration) + Send; // Send required, Sync is not
}
```
`order` is caller-supplied so each command chooses its scheduling strategy:
largest-first for `merge`, sequential for `build_layers`, etc.
## Migration examples
### merge.rs (before: ~180 lines of bespoke machinery)
```rust
let runner = PartitionRunner::new();
runner.run(
&order,
|i| dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
.map_err(OKIError::Partition),
|i, g_len, dur| {
pb.inc(1);
debug!("partition {i}: done in {:.1}s — {g_len} new kmers", dur.as_secs_f64());
part_stats.push(PartStat { id: i, unitig_bytes: partition_sizes[i], g_len });
},
)?;
```
### index.rs build_layers (before: naive into_par_iter)
```rust
let order: Vec<usize> = (0..n).collect();
let runner = PartitionRunner::new();
runner.run(
&order,
|i| self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence, block_bits)
.map_err(OKIError::Partition),
|_, n_kmers, _| {
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
pb.inc(1);
},
)?;
```
All other sites (`pack_matrices`, `dump`, `select`, etc.) follow the same
pattern.
## Placement
`PartitionRunner` lives in `obikindex/src/numa.rs` alongside `NumaSetup`.
It depends only on standard library primitives and Rayon — no new dependencies.
A single `PartitionRunner` instance can be built once per command invocation
and reused across multiple `run()` calls (e.g. `merge` runs
`merge_partitions` then `pack_matrices`).
## Open questions
- **Error handling**: `run` currently returns the first error; remaining errors
are dropped. A `Vec<E>` return would give complete diagnostics.
- **`workers_per_node` tuning**: currently `(cpus / 8).max(3).min(8)`, calibrated
for merge on BeeGFS. I/O-bound commands (`dump`, `select`) may benefit from
a higher value. A per-call override could be added to the API.
- **`on_done` ordering**: the runner serialises calls to `on_done` via an
internal `Arc<Mutex<C>>`. `Send` is required (the Arc clone crosses thread
boundaries); `Sync` is not (only one thread holds the lock at a time).
Contention is negligible because a partition takes seconds while the callback
takes microseconds. The callback is therefore simple to write (plain
`Vec::push`, plain `FnMut`) with no measurable performance cost.
+97
View File
@@ -0,0 +1,97 @@
# NUMA-aware worker pools for merge
## Problem
The merge command's bottleneck is `compute_degrees` in `obidebruinj`: a random pointer-chase over 2070 M node hash maps that saturates DRAM bandwidth. When multiple partition workers run concurrently, they contend for the shared memory bus, causing super-linear slowdown (measured: 0.016 µs/node solo → 0.95 µs/node with 45 concurrent workers, ×60 degradation).
Modern HPC nodes are multi-socket NUMA machines (observed: 2 sockets × 4 NUMA nodes × 24 cores = 192 cores). Cross-NUMA memory traffic compounds the contention:
- Full 192-core run: ~15 min/partition (×10 worse than M3 Mac)
- `taskset` restricted to 4 NUMA nodes (96 cores): ~90 s/partition
- OAR job on 1 NUMA node (24 cores): ~80 s/partition, same throughput as 96 cores
**Conclusion**: the bottleneck is memory bandwidth per NUMA node, not core count. 24 cores on one NUMA node achieve the same throughput as 96 cores across four.
## Strategy
Run N worker groups in parallel, one per NUMA node, each with its own Rayon thread pool whose threads are pinned to the NUMA node's CPUs. Linux's first-touch policy then places graph allocations on local DRAM automatically — no explicit NUMA allocator needed.
Expected throughput: N × single-NUMA throughput. On the 8-NUMA-node HPC: 8 × ~80 s = 910 min total instead of >60 min with the current single-pool approach.
## Rayon thread pool isolation
Rayon provides `ThreadPool::install(|| { ... })`: any Rayon call (`par_iter`, `current_num_threads`, etc.) inside the closure uses *that* pool exclusively. Wrapping `merge_partition` in `pool.install()` redirects all downstream Rayon calls — including those in `debruijn.rs` and `partition.rs` — without touching those crates.
```rust
// worker thread, assigned to NUMA pool `pool`
pool.install(|| {
dst_partition.merge_partition(i, srcs, mode, n_dst_genomes, block_bits, evidence)
})
```
`rayon::current_num_threads()` inside `merge_partition` will return the pool size (e.g. 24), not the global thread count — which is the right value for buffer sizing.
## Thread pinning
`ThreadPoolBuilder::spawn_handler` provides a hook executed for each thread at creation. Inside, `libc::sched_setaffinity` pins the thread to a CPU set:
```rust
let cpus: Vec<usize> = numa_node_cpus(node); // from /sys/devices/system/node/nodeN/cpulist
rayon::ThreadPoolBuilder::new()
.num_threads(cpus.len())
.spawn_handler(move |thread| {
let mut b = std::thread::Builder::new();
std::thread::Builder::new().spawn(move || {
pin_to_cpus(&cpus); // sched_setaffinity via libc
thread.run()
})?;
Ok(())
})
.build()?
```
NUMA topology is read from `/sys/devices/system/node/node*/cpulist` — no `libnuma` dependency required. If the `numa` crate is linked, `numa_available()` / `numa_run_on_node()` are an alternative.
## Memory locality
Linux allocates pages on the NUMA node of the thread that first writes them (first-touch policy). Once Rayon threads are pinned to node N, all graph data built by those threads lands on node N's DRAM. No changes to the allocator, no explicit `numa_alloc_onnode` calls.
## Adaptive spawn criterion
The current criterion uses `std::thread::available_parallelism()` (returns total cores = 192) and `max_workers = n_cores / 2`. With NUMA pools:
- `n_cores` per pool = cores per NUMA node (e.g. 24)
- `max_workers` per pool = pool size / 2 (e.g. 12)
- CPU efficiency is measured per pool, not globally
Each NUMA group runs its own independent adaptive pool. Workers are distributed across NUMA groups round-robin or by workload (partition assignment can be pre-split by NUMA group index).
## Required changes
| File | Change |
|------|--------|
| `obikindex/src/merge.rs` | Detect NUMA topology; build N `ThreadPool`s with pinned threads; assign each pre-spawned worker to a pool; wrap `merge_partition` in `pool.install()` |
| `obikindex/src/merge.rs` | Replace `available_parallelism()` with per-NUMA core count for spawn criterion |
| `obikpartitionner/src/merge_layer.rs` | No change — `merge_partition` already works inside any Rayon context |
| `obidebruinj/src/debruijn.rs` | No change — `par_iter` and `current_num_threads` are pool-context-aware |
| `obikpartitionner/src/partition.rs` | No change — same reason |
## Platform guard
NUMA pinning is Linux-only. The fallback is the current single global pool:
```rust
#[cfg(target_os = "linux")]
fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { ... }
#[cfg(not(target_os = "linux"))]
fn build_numa_pools() -> Option<Vec<rayon::ThreadPool>> { None }
```
When `build_numa_pools()` returns `None` (macOS, UMA, or single-socket), `merge.rs` uses the existing code path unchanged.
## Open questions
- **Partition assignment**: split partitions by NUMA group up-front (static) or use a shared queue with per-group workers stealing from a common pool? Static split is simpler; stealing is better for load balance when partitions vary widely in size.
- **Intra-NUMA adaptive criterion**: with 24 cores and ~35 effective workers per NUMA node, the current marginal-gain criterion needs re-tuning or can be left as-is with per-pool `n_cores = 24`.
- **I/O**: partition data (unitig files) is on a shared filesystem. With 8 concurrent NUMA groups, I/O concurrency increases 8× — need to verify the filesystem (Lustre or local SSD) can absorb it without becoming the new bottleneck.
+105
View File
@@ -0,0 +1,105 @@
# Rebuild / filter — column-first design
## Problem with the current two-pass design
`rebuild_partition` currently makes **two full passes** over source data:
**Pass 1** — read unitigs → MPHF lookup (source) → read row (108 values) → apply filter → push kmer into `GraphDeBruijn`, **discard row**.
**Pass 2** — read unitigs again → MPHF lookup again → read row again → for each passing kmer, look up slot in new MPHF → fill column builders.
Both passes do random access into the source matrix: for each kmer, the MPHF returns a slot, then we read 108 values scattered across 108 column positions. This is cache-hostile even with a packed matrix (`.pbmx`), because the matrix is column-major: consecutive row reads jump across the file.
## Memory budget
The `keep` bitvector costs **1 bit per slot**. With 256 partitions and realistic kmer counts, each partition holds at most a few tens of millions of slots → a few MB per bitvector. Even in the absolute worst case (800 M slots), it stays under 100 MB. This is negligible.
The `slot_map` option (Option B, 816 bytes per slot) is heavier but still bounded: at 15 M slots and 8 bytes, that is 120 MB per partition, acceptable for a single worker.
## Key observation
**The filter operates on column values, not on kmers.** A filter like `--max-outgroup-count 0` only needs to know, for each slot, whether any outgroup column is non-zero. It does not need to know which kmer occupies that slot.
This means filtering can be done as a **sequential column scan** that produces a `keep: BitVec[n_slots]` — no MPHF lookups, no kmer knowledge, perfectly cache-friendly.
## Proposed single-scan design
### Step 1 — column scan → `keep` bitvector
```
for each column c in source matrix:
read column c sequentially (one mmap range)
update keep[slot] according to filter contribution of column c
```
For `GroupQuorumFilter` with ingroup/outgroup:
- ingroup columns: count presence per slot → `ingroup_count[slot]`
- outgroup columns: `keep[slot] &= (value[slot] == 0)` (early-exit possible)
Result: `keep: BitVec` of size `n_slots`, computed with purely sequential IO.
### Step 2 — unitig scan → kept kmers + new MPHF
```
for each kmer in unitig files:
old_slot = old_MPHF(kmer)
if keep[old_slot]:
push kmer into new GraphDeBruijn
record (old_slot, kmer) ← or just old_slot in order
```
Build new MPHF from `GraphDeBruijn` via `materialize_layer`.
### Step 3 — fill new matrix
Two sub-options:
**Option A — from recorded (old_slot, kmer) pairs:**
```
for each (old_slot, kmer) in recorded list:
new_slot = new_MPHF(kmer)
for each column c:
new_matrix[new_slot, c] = old_matrix[old_slot, c]
```
Memory cost: `n_kept × (8 + 8)` bytes for `(old_slot: usize, kmer: CanonicalKmer)`.
For species-specific filters, `n_kept` is small. For unfiltered rebuild, `n_kept = n_slots`.
**Option B — column-by-column copy using old→new slot mapping:**
Precompute `slot_map: Vec<Option<usize>>` of size `n_slots`:
- For each kmer in unitig file: `slot_map[old_MPHF(kmer)] = Some(new_MPHF(kmer))`
Then for each source column:
```
read source column sequentially
for each slot where slot_map[slot] = Some(new_slot):
write value to new column at new_slot
```
Memory cost: `n_slots × sizeof(usize)` for the slot map (one usize per source slot).
IO pattern: sequential read of each source column → random write into new column builders.
Option B avoids storing kmer values and works uniformly regardless of filter selectivity.
## Comparison
| | Current | Proposed |
|---|---|---|
| Disk reads | 2× unitigs + 2× random matrix | 1× columns (sequential) + 1× unitigs |
| MPHF lookups (source) | 2× N_kmers | 1× N_kept (step 2) or 0 (option B, col scan only) |
| Cache behavior | poor (random row access) | good (sequential column scan) |
| Extra memory | none | slot_map (option B) or (old_slot, kmer) list (option A) |
## Files to modify
- `src/obikpartitionner/src/rebuild_layer.rs``rebuild_partition` and `iter_src_layers`
- Possibly `src/obicompactvec/` — add column iterator API if not already present
- `src/obilayeredmap/` — check if per-column sequential access is exposed on `SrcLayerData`
## Open questions
- Does `SrcLayerData` expose per-column sequential iteration, or only `lookup(kmer, n_genomes)` random access?
- For option B: are new column builders writable in random-slot order (i.e. `set_val(slot, value)` without sequential constraint)?
- For `GroupQuorumFilter` specifically: can the filter be decomposed into independent per-column contributions, or does it need the full row?
+279
View File
@@ -0,0 +1,279 @@
# Kmer filtering and ingroup/outgroup predicates
The `filter`, `dump`, and `unitig` commands share the same filtering system,
implemented as a shared `FilterArgs` clap argument group embedded in each command
via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum
counts, optionally restricted to **ingroup** and **outgroup** genome sets derived
from genome metadata. All rules described here apply identically to all three commands.
`filter` additionally accepts `--min-total-count` / `--max-total-count` filters
that operate on the sum of counts across all genomes.
## Predicate syntax
Each `--ingroup` and `--outgroup` flag takes a predicate of the form:
```
key OP value1|value2|…
```
| Operator | Meaning |
|----------|---------|
| `*` or `all` | wildcard — every genome matches unconditionally |
| `key=v1\|v2` | exact match — genome's `key` equals `v1` or `v2` |
| `key!=v` | negation — genome's `key` equals none of the values |
| `key~path` | path ancestry — genome's `key` is `path` or a descendant |
| `key!~path` | not a descendant |
Multiple values separated by `|` are always OR-ed within the predicate.
### Path matching (`~` and `!~`)
Metadata values can represent hierarchical concept paths such as
`/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana`.
Stored taxonomy values always start with `/` (the root of the path).
Query patterns do **not** need to start with `/` — a leading `/` is an optional
start anchor, not a requirement.
| Pattern form | Semantics |
|---|---|
| `A/B` | contiguous sub-path A then B, anywhere in the value |
| `/A/B` | value starts with A then B |
| `A/B$` | value ends with A then B |
| `/A/B$` | value is exactly A then B |
| `A@x/B` | A with class `x` followed by B with any class |
- `taxon~/Betulaceae/Betula` matches any path that starts with `Betulaceae` then `Betula`.
- `taxon~Betula` matches any path containing `Betula` as a segment, anywhere.
### Missing metadata key → NA
If a genome does not carry the queried metadata key, the predicate returns **NA**.
NA propagates through the group evaluation logic (see below), and genomes that
cannot be classified are **ignored** in all quorum counts.
## Group semantics
### Multiple predicates
| Flag | Combination rule |
|------|-----------------|
| `--ingroup` (repeated) | **AND** — genome must satisfy all predicates |
| `--outgroup` (repeated) | **OR** — genome satisfies any predicate |
### Three-value logic
Each predicate returns `true`, `false`, or `NA` (absent key).
- AND: `false` absorbs everything; `NA` propagates unless already `false`.
- OR: `true` absorbs everything; `NA` propagates unless already `true`.
### Classification and priority
For each genome:
1. Evaluate `AND(ingroup predicates)``in_result`
2. Evaluate `OR(outgroup predicates)``out_result`
3. If `in_result = true`**Ingroup** (ingroup wins over outgroup)
4. Else if `out_result = true` → **Outgroup**
5. Otherwise → **Uncategorized** (ignored in all quorum counts)
### Implicit groups
| `--ingroup` | `--outgroup` | Effective behaviour |
|-------------|--------------|---------------------|
| not set | not set | all genomes form the ingroup |
| set | not set | only ingroup quorum flags apply |
| not set | set | only outgroup quorum flags apply |
| set | set | both constraints apply simultaneously |
## Quorum flags
| Flag | Applies to | Meaning |
|------|-----------|---------|
| `--min-count N` | ingroup | k-mer present in at least N ingroup genomes |
| `--max-count N` | ingroup | k-mer present in at most N ingroup genomes |
| `--min-frac F` | ingroup | k-mer present in at least fraction F of ingroup genomes |
| `--max-frac F` | ingroup | k-mer present in at most fraction F of ingroup genomes |
| `--min-outgroup-count N` | outgroup | k-mer present in at least N outgroup genomes |
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`filter` only) |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`filter` only) |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
**Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
whether the corresponding group was declared, **and** whether any quorum flag for that group was explicitly set.
> **Rule**: declaring a group activates the smart default **only if no quorum flag for that group is explicitly set**.
> As soon as any quorum flag for a group is present on the command line, all defaults for that group revert to no-op values.
| `--ingroup` | Any ingroup quorum flag? | `--min-frac` default |
|-------------|--------------------------|----------------------|
| not set | — | 0.0 (no-op) |
| set | no | **1.0** — all ingroup genomes must carry the k-mer |
| set | yes | 0.0 — user controls quorum explicitly |
| `--outgroup` | Any outgroup quorum flag? | `--max-outgroup-count` default |
|--------------|---------------------------|-------------------------------|
| not set | — | outgroup size (no-op) |
| set | no | **0** — no outgroup genome may carry the k-mer |
| set | yes | outgroup size — user controls quorum explicitly |
"Any ingroup quorum flag" means any of: `--min-count`, `--max-count`, `--min-frac`, `--max-frac`.
"Any outgroup quorum flag" means any of: `--min-outgroup-count`, `--max-outgroup-count`, `--min-outgroup-frac`, `--max-outgroup-frac`.
**Why this rule?** Setting any quorum flag signals explicit intent — the defaults are there to help when the user omits quorum entirely, not to interfere with deliberate constraints. Mixing implicit and explicit quorum on the same group would risk silent incoherence (e.g. `--max-count 0` with an implicit `--min-frac 1.0`).
All other bounds default to 0 / group size / 0.0 / 1.0 regardless of whether groups are declared.
### Validation
After resolving defaults, the following are checked and cause an immediate error:
| Condition | Error |
|-----------|-------|
| `--min-count > --max-count` | incoherent bounds |
| `--min-frac > --max-frac` | incoherent bounds |
| `--min-outgroup-count > --max-outgroup-count` | incoherent bounds |
| `--min-outgroup-frac > --max-outgroup-frac` | incoherent bounds |
| any fraction outside `[0.0, 1.0]` | invalid value |
The check applies to the **effective** values (after defaults are resolved), so an explicit `--max-frac 0.5` with an implicit `--min-frac 1.0` would have been caught — but the rule above prevents that situation from arising in the first place.
Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure.
### Conservative rounding of fraction thresholds
When a fraction threshold `F` is applied to a group of size `N`, the effective
integer threshold is determined by the direction of the bound:
| Bound | Effective count | Rounding | Rationale |
|-------|----------------|----------|-----------|
| `--min-frac F` | k-mer in ≥ ⌈F·N⌉ genomes | **ceil** | stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction |
| `--max-frac F` | k-mer in ≤ ⌊F·N⌋ genomes | **floor** | stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction |
The same rule applies symmetrically to `--min-outgroup-frac` (ceil) and
`--max-outgroup-frac` (floor). The outgroup direction is not inverted: the
conservative rounding depends only on whether the bound is a minimum or a
maximum, not on which group it applies to.
**Example** — `--min-frac 0.5` with an ingroup of 3 genomes:
`⌈0.5 × 3⌉ = ⌈1.5⌉ = 2` → at least 2 of 3 ingroup genomes must carry the k-mer.
**Implementation note** — the filter evaluates `n / denom < min_frac` directly
(integer `n`, float comparison) rather than pre-computing `⌈F·N⌉`. This is
mathematically equivalent for integer counts: `n / N < F``n < F·N`
`n ≤ ⌈F·N⌉ 1``n < ⌈F·N⌉`. No explicit rounding is needed.
## Examples
Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes
and absent from every other genome in the index:
```sh
obikmer filter src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
```
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
other *Betula*:
```sh
obikmer filter src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
--max-outgroup-count 0
```
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
and in fewer than 10 % of everything outside *Betulaceae*:
```sh
obikmer filter src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
--max-outgroup-frac 0.1
```
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
```sh
obikmer filter src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
--max-outgroup-count 0
```
To dump only k-mers specific to *Betula nana*:
```sh
obikmer dump myindex \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 1 \
--max-outgroup-count 0
```
To enumerate unitigs of the *Betula*-specific subgraph:
```sh
obikmer unitig myindex \
--ingroup "genus=Betula" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
```
## Command-specific options
### `dump --head N`
Stops output after the first N k-mers that pass all active filters.
Iteration terminates immediately — subsequent partitions and layers are not scanned.
Useful for quick inspection of large indexes without loading the entire dataset.
```sh
obikmer dump myindex --head 100
obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1
```
### `distance --presence-threshold N`
When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1).
This option is independent of the `--presence-threshold` used in filtering.
```sh
# Jaccard treating kmers with count ≥ 2 as present
obikmer distance myindex --metric jaccard --presence-threshold 2
```
This parameter has no effect on presence/absence indexes (where values are already 0/1) or on metrics other than Jaccard.
## Implementation
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
using pre-computed ingroup and outgroup index vectors. The heavy logic
(predicate parsing, three-value evaluation, genome classification) happens
once before any iteration; each k-mer row evaluation is a simple index
lookup and counter.
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
embedded via `#[command(flatten)]` in `FilterArgs`, `DumpArgs`, and
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
list.
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
the callback. `filter`, `dump`, and `unitig` all go through this single
entry point.
+207
View File
@@ -0,0 +1,207 @@
# Merge parallelism and memory pressure
## Problem observed
Running `obikmer merge` over 109 indexes (108 sources + 1 bootstrap) on a 192-core machine
produces a fatal OOM during the `merge_partitions` stage:
```
memory allocation of 9126805520 bytes failed
```
A single allocation of ~8.5 GB fails. This is not an aggregate; it is one `malloc` call
from hashbrown during a HashMap resize.
---
## Root cause
### The merge pipeline per partition
```
source unitigs.bin
→ iter_indexed_canonical_kmers()
→ GraphDeBruijn::push() ← HashSet<u64> + 1 byte flags, all in RAM
→ compute_degrees_and_mark_starts()
→ try_for_each_unitig()
→ unitigs.bin (new layer)
→ Layer::build() → MPHF + evidence
```
`GraphDeBruijn` is a `FastHashMap<CanonicalKmer, AtomicU8>` — a `HashSet<u64>` with
one flag byte per node. Neighbor lookup is implicit: 4 probes into the same map.
No edges are stored. The full kmer set of one partition must reside in RAM
simultaneously to compute degrees and mark unitig starts.
The matrix builders that follow (pass 2) are mmapped files — they do **not** consume
significant RAM. The pressure is entirely in pass 1.
### Unbounded Rayon parallelism
With 192 cores, Rayon ran up to 192 partitions concurrently. Each partition built its
own `GraphDeBruijn` accumulating all kmers absent from the destination. Peak memory =
192 × peak_partition_hashset.
### The 8.5 GB single allocation
hashbrown allocates the entire backing array in one call when rehashing.
At load factor 7/8: `capacity × (sizeof(K,V) + 1 control byte)`.
For `(u64, AtomicU8)` with alignment: ~16 bytes per slot.
```
9 127 MB / 16 bytes ≈ 570 M slots → ~380 M new kmers in one partition
```
Plausible for the largest partition of 108 Salix/Betula sources (~450 Mbp each).
---
## Partition size distribution
`obikmer utils --partition-stats` measures the sum of `unitigs.bin` file sizes
per partition across all source indexes (pure `stat()` syscalls, negligible cost).
Observed on a 9-genome pilot (256 partitions):
| Stat | Value |
|---|---|
| min | 30.5 MB |
| max | 232.1 MB |
| mean | 40.1 MB |
| median | 37.2 MB |
| p95 | 47.1 MB |
| max/median ratio | 6.23× |
The distribution is **bimodal with a heavy tail**:
- 238/256 partitions in a narrow 3050 MB band
- 4 structurally extreme partitions (36× the median): 221, 233, 135, 191
These correspond to minimizers over-represented in repetitive regions shared across
all sources. They are extreme in every run on this dataset.
With 109 sources, outlier partitions do not scale linearly: only kmers **absent from
the destination** enter the GraphDeBruijn, and inter-source overlap is high for closely
related species. Partition 221 is the likely trigger for the 8.5 GB crash.
---
## Solution: LFD scheduling + memory budget semaphore
### Principle
Pre-sort partitions by **decreasing estimated size** (First Fit Decreasing — FFD),
then schedule them through a **continuous memory budget semaphore**. Each worker
acquires an estimated cost before starting and releases it on completion.
Large partitions run first when the full budget is available; small partitions fill
the gaps. No hard outlier threshold is needed.
### `MemoryBudget` (`obisys`)
```rust
pub struct MemoryBudget { … }
impl MemoryBudget {
pub fn new(total: u64) -> Self;
pub fn acquire(&self, cost: u64); // blocks until budget available
pub fn release(&self, cost: u64);
pub fn peak_active(&self) -> usize;
}
```
Non-deadlock guarantee: when `active == 0`, acquire always succeeds regardless of cost.
Without this, a partition whose estimated cost exceeds the total budget would block forever.
### Adaptive expansion factor
The expansion factor converts raw `unitigs.bin` bytes into an estimated GraphDeBruijn
RAM footprint. hashbrown stores each kmer as `(u64, AtomicU8)` ≈ 16 bytes/kmer at 7/8
load factor; unitig files encode ≈ 2 bits/base. The ratio depends on average unitig
length (short unitigs: ~2×; long unitigs: up to ~50×).
**Phase 1 — sequential pilot (worst partition)**
The largest partition runs alone first. Its actual `g.len()` seeds the expansion factor
before any parallel job starts. `FALLBACK_EXPANSION = 4×` is used only for empty partitions.
```rust
let worst_g_len = dst_partition.merge_partition(worst_id, …)?;
// ↑ now returns SKResult<usize> (was SKResult<()>)
let seed_expansion = worst_g_len as u64 * 16 * 1000 / worst_bytes;
let max_expansion = AtomicU64::new(seed_expansion);
```
**Phase 2 — parallel with adaptive updates**
```rust
order[1..].into_par_iter().for_each(|&i| {
let cost = partition_sizes[i] * max_expansion.load(Relaxed) / 1000;
budget.acquire(cost);
let g_len = dst_partition.merge_partition(i, …)?;
budget.release(cost); // releases estimated cost, not actual
let actual = g_len as u64 * 16 * 1000 / partition_sizes[i];
max_expansion.fetch_max(actual, Relaxed); // always pessimistic (max)
});
```
`budget.release(cost)` uses the estimated cost, not the actual one. The budget tracks
reservations, not physical RAM; each partition pays what it promised at acquisition.
**On the safety margin**
There is no separate multiplier `k`. It is redundant with `budget_fraction`: both
reduce effective concurrency by the same amount. A single parameter is easier to
calibrate. `budget_fraction = 0.5` (default) reserves half of available RAM for the
OS, MPHF build, pass 2, and estimation error.
`--budget-fraction` is exposed as a CLI flag — the only escape hatch for pathological
cases (extreme repetitive content, unusually long unitigs) that still cause OOM.
### RAM source
`obisys::available_memory_bytes()` — wraps `sysinfo::System::available_memory()`,
falls back to `total / 2` on macOS when the memory compressor returns 0.
---
## Diagnostic report
After the parallel phase, `merge_partition` emits a structured report via `tracing::info!`:
```
─── merge_partitions memory report ───
available RAM : 512.0 GB budget 50% = 256.0 GB
expansion factor — seed: 4.2× final max: 6.1× (mean: 1.8× median: 1.6×)
peak concurrent workers: 42
expansion factor distribution (256 partitions with data):
0.50× – 1.25× │██████████████████████████████ 148
1.25× – 2.00× │████████████████████████ 82
5.50× 6.25× │█ 2
top partitions by actual expansion factor:
partition 221 : 6.10× (232.1 MB unitigs → 48M kmers, reserved at 4.20×)
partition 135 : 5.82× (127.3 MB unitigs → 24M kmers, reserved at 4.20×)
──────────────────────────────────────
```
Fields useful for diagnosis:
| Field | Interpretation |
|---|---|
| `seed` vs `final max` expansion | gap indicates partitions with higher expansion than the worst-by-size |
| `reserved at X×` | the factor used at acquisition; if much lower than actual, the budget was under-reserved for that partition |
| `peak concurrent workers` | effective parallelism achieved under the budget constraint |
| `mean` / `median` expansion | typical dataset characteristic; stable across runs on the same data |
---
## Parameters
| Parameter | Default | CLI flag | Notes |
|---|---|---|---|
| `fallback_expansion` | 4× | — | seed for empty partitions only |
| `budget_fraction` | 0.5 | `--budget-fraction` | reduce if OOM persists |
| RAM source | `obisys::available_memory_bytes()` | — | falls back to `total/2` on macOS |
+520
View File
@@ -0,0 +1,520 @@
# obicompactvec — Complete Reference
## Module structure
```
src/obicompactvec/src/
lib.rs public re-exports
views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation)
bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter
reader.rs PersistentCompactIntVec (read-only)
builder.rs PersistentCompactIntVecBuilder (read-write)
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
colgroup.rs ColGroup, MatrixGroupOps trait
format.rs file format constants, encode/decode helpers
layer_meta.rs LayerMeta (column metadata)
meta.rs matrix metadata
```
```mermaid
graph TD
views --> bitvec
views --> builder
views --> tempbitvec
views --> tempintvec
views --> bitmatrix
views --> intmatrix
format --> reader
format --> builder
reader --> intmatrix
reader --> tempintvec
builder --> intmatrix
builder --> tempintvec
bitvec --> tempbitvec
bitvec --> bitmatrix
tempintvec --> intmatrix
tempintvec --> bitmatrix
tempbitvec --> intmatrix
tempbitvec --> bitmatrix
colgroup --> intmatrix
colgroup --> bitmatrix
layer_meta --> bitmatrix
layer_meta --> intmatrix
meta --> bitmatrix
meta --> intmatrix
```
---
## Compact int encoding
All integer vectors use the same two-tier encoding regardless of storage backend.
**Primary array** — one `u8` per slot:
- Values **0254** are stored directly. No overhead.
- Value **255 is a sentinel**: the slot's actual value is ≥ 255 and lives in the overflow store.
**Overflow store** — maps slot index to a `u32` value ≥ 255:
- In `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
- In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.
```mermaid
flowchart LR
slot --> P["primary[slot]: u8"]
P -->|"< 255"| V["value = byte (0254)"]
P -->|"= 255 sentinel"| OV["overflow store"]
OV -->|"Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
```
**Key property — sentinel 255 = +∞ on `u8`:**
- `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
- `max(a, 255) = 255` → correct sentinel when either side is overflow
- Only the **both-overflow** case requires reading actual values from the overflow store.
In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.
---
## View types
The previous trait hierarchy (`BitSlice`, `BitSliceMut`, `IntSlice`, `IntSliceMut`) has been replaced by two concrete zero-copy view structs with inherent methods. Views are **`Copy`** — passing them is free. All read operations live on these two types.
### `BitSliceView<'a>`
```rust
#[derive(Clone, Copy)]
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }
```
Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero.
| Method | Cost |
|---|---|
| `len()`, `is_empty()` | O(1) |
| `get(slot)` | O(1) |
| `count_ones()` | POPCNT per word, O(n/64) |
| `count_zeros()` | `n count_ones()`, O(n/64) |
| `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration |
| `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount`, `(a\|b).popcount` per word, O(n/64) |
| `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) |
| `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) |
`BitSliceIter<'a>`: word-level scan; one word per 64 iterations.
### `IntSliceView<'a>`
```rust
#[derive(Clone, Copy)]
pub struct IntSliceView<'a> {
pub(crate) primary: &'a [u8],
pub(crate) overflow_raw: &'a [u8], // sorted [(slot:u64, value:u32)] entries
pub(crate) n_overflow: usize,
pub(crate) n: usize,
}
```
`overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()`/`freeze()` time.
| Method | Cost |
|---|---|
| `len()`, `is_empty()` | O(1) |
| `primary_bytes()` | O(1) |
| `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration |
| `get(slot)` | O(1) primary; binary search O(log k) for overflow slots |
| `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) |
| `sum()` | byte scan + overflow, O(n + k) |
| `count_nonzero()` | byte scan, O(n) |
| Distance methods (`bray_dist`, `euclidean_dist`, `jaccard_dist`, …) | O(n + k) |
`IntSliceViewIter<'a>`: merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle.
**Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap`, not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0`. This is intentional — the view is primarily useful after `freeze()`. During building, callers that need overflow use `overflow_entries()` directly.
---
## Concrete types
```mermaid
classDiagram
class BitSliceView {
+words: &[u64]
+n: usize
+get(slot) bool
+count_ones() u64
+iter() BitSliceIter
+jaccard_dist/hamming_dist(other: BitSliceView)
}
class IntSliceView {
+primary: &[u8]
+overflow_raw: &[u8]
+n_overflow: usize
+n: usize
+get(slot) u32
+iter() IntSliceViewIter
+overflow_entries() Iterator
+bray_dist/euclidean_dist/…(other: IntSliceView)
}
class PersistentBitVec {
-mmap: Mmap
-n: usize
+view() BitSliceView
+get(slot) bool
+count_ones/zeros() u64
+iter() BitIter
+partial_jaccard_dist(&Self) (u64,u64)
+jaccard_dist/hamming_dist(&Self) …
}
class PersistentBitVecBuilder {
-mmap: MmapMut
-n: usize
+view() BitSliceView
+set(slot, bool)
+or/and/xor/not(BitSliceView)
+copy_from(BitSliceView)
+close() / finish() → PersistentBitVec
}
class PersistentCompactIntVec {
-mmap: Mmap
-n: usize
-n_overflow: usize
-step: usize
-index: Vec~(usize,usize)~
+view() IntSliceView
+get(slot) u32
+iter() Iter
+sum/count_nonzero() u64
+bray_dist/euclidean_dist/… (&Self)
}
class PersistentCompactIntVecBuilder {
-mmap: MmapMut
-n: usize
-overflow: HashMap~usize,u32~
+view() IntSliceView
+set(slot, u32) / get(slot) u32
+inc / inc_present / inc_present_fast
+inc_predicate / inc_predicate_fast
+add/min/max/diff/mask_with(…View)
+primary_bytes/primary_bytes_mut()
+close() / finish() → PersistentCompactIntVec
}
PersistentBitVec --> BitSliceView : view()
PersistentBitVecBuilder --> BitSliceView : view()
PersistentCompactIntVec --> IntSliceView : view()
PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
PersistentBitVecBuilder --> PersistentBitVec : close() then open()
PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
```
### `PersistentBitVec` / `PersistentBitVecBuilder`
`PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()`, `count_zeros()`, `partial_jaccard_dist(&Self)`, `jaccard_dist(&Self)`, `hamming_dist(&Self)`.
`BitIter<'a>` — exported iterator for `PersistentBitVec::iter()`:
```rust
pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
```
`PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>`:
| Method | Cost |
|---|---|
| `set(slot, bool)` | O(1) |
| `view() -> BitSliceView<'_>` | O(1) |
| `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly |
| `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) |
| `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) |
### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`
`PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers.
`PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories:
**Point mutations:**
| Method | Note |
|---|---|
| `set(slot, u32)` | writes primary[slot] or 255+overflow |
| `get(slot) -> u32` | reads primary byte or HashMap |
| `inc(slot)` | `get` + `set`, O(1) |
**Bulk computation methods** — accept view arguments:
| Method | Semantics | Overflow |
|---|---|---|
| `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc`, safe for any group size |
| `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached |
| `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe |
| `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached |
| `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback |
| `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below |
| `max(IntSliceView)` | pre-pass + byte max | see algorithm below |
| `diff(IntSliceView)` | saturating sub | self<255 hot path |
| `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) |
**`inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast`, or chunk size < 255 for `inc_predicate_fast`). Violation is caught by `debug_assert` in dev builds.
**`min` algorithm:**
Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
```
snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
for (slot, self_val) in self_ov:
if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
```
**`max` algorithm:**
Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
```
Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)
```
---
## Matrix types
Four matrix types, two encodings × two formats:
| | Columnar format | Packed format |
|---|---|---|
| **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) |
| **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) |
Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly:
```rust
// PersistentBitMatrix
pub fn col_view(&self, c: usize) -> BitSliceView<'_>
// PersistentCompactIntMatrix
pub fn col_view(&self, c: usize) -> IntSliceView<'_>
```
No wrapper enums (`BitColView`, `IntColView`): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method.
`pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format.
---
## Aggregation traits (matrix level)
### ColumnWeights
```rust
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>; // sum per column
fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}
```
`partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count.
### CountPartials
Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`.
**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter.
**`partial_threshold_jaccard` returns `(inter, union)`** because `union[i,j]` depends on both columns simultaneously.
Provided finalisations:
| Finalisation | Formula |
|---|---|
| `bray_dist_matrix()` | `1 2·partial_bray[i,j] / (w[i] + w[j])` |
| `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
| `threshold_jaccard_dist_matrix(t)` | `1 inter[i,j] / union[i,j]` |
| `relfreq_bray_dist_matrix()` | `1 partial_relfreq_bray[i,j]` |
| `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
| `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
| `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |
### BitPartials
Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)`, `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.
---
## Temp-file-backed types
**All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
### Lifecycle
```
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …)
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
↓ (optional)
.make_persistent(path) → PersistentCompactIntVec (permanent file)
```
Same pattern for `TempBitVecBuilder``TempBitVec``PersistentBitVec`.
**Drop order**: `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed.
### TempCompactIntVec / TempCompactIntVecBuilder
```rust
pub struct TempCompactIntVec {
vec: PersistentCompactIntVec,
_temp: TempDir, // dropped after vec
}
pub(crate) struct TempCompactIntVecBuilder {
builder: PersistentCompactIntVecBuilder,
temp: TempDir,
}
```
`TempCompactIntVec`: read access via `get(slot)`, `sum()`, `iter()`, `view() -> IntSliceView<'_>`.
`TempCompactIntVecBuilder`: full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast`, `inc_predicate_fast`, `add`, `min`, `max`, `diff`, `mask_with`) are exposed as `pub(crate)`.
### TempBitVec / TempBitVecBuilder
```rust
pub struct TempBitVec {
vec: PersistentBitVec,
_temp: TempDir,
}
pub(crate) struct TempBitVecBuilder {
builder: PersistentBitVecBuilder,
temp: TempDir,
}
```
`TempBitVec`: read access via `get(slot)`, `count_ones()`, `view() -> BitSliceView<'_>`, `iter()`.
`TempBitVecBuilder`: exposes `set(slot, bool)`, `or(BitSliceView)`, and:
```rust
pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)
```
`or_where` — two passes, no intermediate allocation:
```
Pass 1 — primary bytes, O(n):
for slot in 0..n:
b = col.primary_bytes()[slot]
if b < 255 AND pred(b as u32): self.set(slot, true)
Pass 2 — overflow, O(k):
for (slot, val) in col.overflow_entries():
if pred(val): self.set(slot, true)
```
---
## Filter / Select API
### ColGroup
```rust
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
```
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
### Composition axis
- **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.).
### MatrixGroupOps
Five required primitives + two default methods derived from them. All return temp-file-backed types.
```rust
pub trait MatrixGroupOps {
// required
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempCompactIntVec>;
fn partial_group_sum(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>;
fn partial_group_min(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
fn partial_group_max(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
// defaults derived from partial_group_presence_count
fn partial_group_all(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>; // slot=1 iff count == g.indices.len()
fn partial_group_none(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>; // slot=1 iff count == 0
}
```
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
For **bit matrices**: values are 0/1, so `partial_group_sum` = `partial_group_presence_count(g, 1)`; `partial_group_min` is AND (set first column then mask-with remaining); `partial_group_max` is OR via `partial_group_any` + `inc_present`.
**`partial_group_presence_count` — chunking for large groups:**
When `g.indices.len() < 255`: per-slot counts stay within `u8` range. Use `inc_present_fast` (bit) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int) — raw u8 increment, no overflow entry written.
When `g.indices.len() ≥ 255`: process in chunks of 254 columns, accumulate via `.add(chunk_frozen.view())`.
**`partial_group_min` (int matrix)**: copy first column via `.add(col_view(first))` (start from 0 ⇒ copy), then `.min(col_view(c))` for remaining.
**`partial_group_max` (int matrix)**: `.max(col_view(c))` for all columns (start from 0 ⇒ first column acts as copy).
**`partial_group_any`** uses `or_where` on `TempBitVecBuilder` (two-pass: primary bytes then overflow entries).
**`partial_group_all` / `partial_group_none`** (default): call `partial_group_presence_count`, then iterate slots to produce the bit result. O(n) extra pass, not chunked.
### add_col_from — matrix builder integration
Both matrix builders accept temp-file results directly:
```rust
// PersistentBitMatrixBuilder
fn add_col_from(&mut self, src: &TempBitVec) -> io::Result<()>
fn add_col_from_int(&mut self, src: &TempCompactIntVec) -> io::Result<()> // nonzero → 1
// PersistentCompactIntMatrixBuilder
fn add_col_from(&mut self, src: &TempCompactIntVec) -> io::Result<()>
fn add_col_from_bit(&mut self, src: &TempBitVec) -> io::Result<()> // bit → 0/1 u32
```
`add_col_from` copies the temp file to the matrix directory and increments `n_cols`; `close()` writes `meta.json` with the final column count. No separate `write_meta` step needed.
### mask_with
Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder`). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
```
for (w_idx, word) in mask.words():
if word == u64::MAX: continue // skip all-ones words
zeros = !word
while zeros != 0:
bit = trailing_zeros(zeros)
s = w_idx * 64 + bit
if primary[s] != 0: set(s, 0) // clears overflow entry too
zeros &= zeros 1
```
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).
+143
View File
@@ -0,0 +1,143 @@
# `obitaxonomy` — taxonomy concept paths
`obitaxonomy` is a dependency-free crate that defines a typed representation
of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata.
---
## Concept path syntax
A concept path is stored as a metadata value with the prefix `taxonomy:/`:
```
taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species
```
Structure:
- The `taxonomy:/` prefix is the type discriminator. Any metadata value starting
with it is parsed as a `TaxPath`; all others remain plain strings.
- The remainder is one or more `/`-separated segments.
- Each segment is `name` or `name@rank`, where `rank` is a label for the
taxonomic level (e.g. `family`, `genus`, `species`).
- Rank annotations are **optional per segment** and can be mixed freely.
- Spaces are allowed in both names and ranks.
### Reserved character
`@` is reserved throughout the taxonomy system and may **not** appear in:
| Context | Constraint |
|---------|------------|
| Segment name | forbidden |
| Rank/class label | forbidden |
| Metadata key names | forbidden (used as `key@rank` in predicate syntax) |
`@` is freely allowed in plain-text metadata values (non-taxonomy).
### Parse errors
| Condition | Error |
|-----------|-------|
| Value does not start with `taxonomy:/` | `MissingPrefix` |
| No segments after the prefix | `EmptyPath` |
| Segment with empty name (consecutive `/`) | `EmptySegmentName` |
| Segment with trailing `@` and no rank (`name@`) | `EmptyRankName` |
| Segment with more than one `@` | `AmbiguousRank` |
---
## Public API
### `TaxSegment`
A single node: a name and an optional rank.
```rust
seg.name() // &str
seg.rank() // Option<&str>
seg.to_string() // "name" or "name@rank"
TaxSegment::parse(s) // Result<TaxSegment, TaxError>
```
### `TaxPath`
```rust
TaxPath::parse(s) // Result<TaxPath, TaxError>
path.segments() // &[TaxSegment]
path.depth() // usize — number of segments
path.is_ancestor_of(&other) // bool — prefix match by name, ranks ignored
path.name_at_rank("genus") // Option<&str>
path.to_string() // reconstructs "taxonomy:/…"
```
`is_ancestor_of` compares segment **names** only — rank annotations are
informational and do not affect the ancestry relation.
```rust
let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?;
let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?;
assert!(a.is_ancestor_of(&b)); // true
assert!(b.is_ancestor_of(&a)); // false
assert!(a.is_ancestor_of(&a)); // true (equal ⇒ ancestor)
assert_eq!(b.name_at_rank("species"), Some("Escherichia coli"));
assert_eq!(b.name_at_rank("genus"), Some("Escherichia"));
assert_eq!(b.name_at_rank("order"), None);
```
---
## Integration with `GenomeInfo`
At index load time, every metadata value is inspected once:
- Starts with `taxonomy:/` → parsed into `TaxPath`, stored in `genome.taxonomy`.
- Otherwise → kept as-is in `genome.meta`.
```rust
struct GenomeInfo {
label: String,
meta: HashMap<String, String>, // plain text metadata
taxonomy: HashMap<String, TaxPath>, // parsed taxonomy metadata
}
```
The raw string is not duplicated. `TaxPath::to_string()` reconstructs the
original value losslessly for serialisation.
---
## Predicate operators (in `filter` / `select`)
Path predicates use the `~` / `!~` operators. The **stored value** always starts
with `/` (rooted path); the **query pattern** does not need to.
### Path pattern syntax
| Pattern | Semantics |
|---------|-----------|
| `A/B` | contiguous sub-path A then B, anywhere in the value |
| `/A/B` | value starts with A then B (start-anchored) |
| `A/B$` | value ends with A then B (end-anchored) |
| `/A/B$` | value is exactly A then B (fully anchored) |
| `A@x/B` | A with class `x` followed by B with any class |
| `A@x/B@y` | A with class `x` followed by B with class `y` |
A segment pattern without `@` matches the segment name regardless of its stored class.
### Rank-aware queries
```
key@rank=value
```
| Predicate form | Semantics |
|----------------|-----------|
| `key@rank=value` | genome's `key` has `value` at rank `rank` |
| `key@rank!=value` | does not |
| `key@rank=v1\|v2` | value at `rank` is `v1` or `v2` |
`~` combined with `@rank` on the key (e.g. `key@genus~pattern`) is not defined
and is rejected at parse time.
+234
View File
@@ -0,0 +1,234 @@
# `select` — column projection and aggregation
`select` transforms an index by operating on its **genome columns**: projecting a
subset of columns, aggregating groups of genomes into synthetic columns, or both.
It is the column-axis counterpart of `filter` (row-axis operations).
Following relational algebra conventions:
| Command | Relational operation | Axis |
|----------|---------------------|----------|
| `filter` | σ — selection | rows (k-mers) |
| `select` | π — projection | columns (genomes) |
The two commands compose naturally: run `filter` first to restrict the kmer set,
then `select` to reshape the genome columns.
`select` never changes the kmer set. The MPHF and `unitigs.bin` of each layer
are preserved unchanged; only the data matrices are rewritten.
---
## Synopsis
```sh
obikmer select <input-index>
{ --output <dir> | --in-place }
[--group <name>:<pred> ...]
[--group-op <name>:<op> ...]
[--aggregate-by <key> ]
[--aggregate-op <op> ]
[--select <col1,col2,...> ]
[--presence-threshold <N> ]
```
---
## Output destination
Exactly one of `--output` or `--in-place` must be specified.
**`--output <dir>`** — writes a new index to `<dir>`. The source index is
unchanged. The MPHF and unitig files are copied; only the data matrices are
rewritten with the new column layout.
**`--in-place`** — rewrites the data matrices of the source index directly.
Removed or replaced columns are lost. The operation writes to temporary files
first, then renames atomically, so an interrupted run leaves the index intact.
---
## Defining output columns
### Named groups — `--group`
```
--group <name>:<pred>
```
Defines a named group of genomes using the same predicate syntax as `filter`.
Repeatable; a genome can belong to several groups.
```sh
--group "pub:species=Betula_pubescens"
--group "nan:species=Betula_nana"
```
### Per-group operator — `--group-op`
```
--group-op <name>:<op>
```
Assigns an aggregation operator to a named group. Optional; if absent, the
default operator applies (see below).
```sh
--group-op "pub:any"
--group-op "nan:all"
```
### Shorthand — `--aggregate-by` / `--aggregate-op`
`--aggregate-by <key>` automatically creates one group per unique value of the
metadata key `<key>`. Equivalent to one `--group <val>:<key>=<val>` per distinct
value. `--aggregate-op <op>` sets the operator for all auto-generated groups.
`--aggregate-by` and `--group` are mutually exclusive.
### Column selection and ordering — `--select`
```
--select col1,col2,...
```
Lists the output columns in order. Each element is either a group name (defined
by `--group` or generated by `--aggregate-by`) or a genome label from the source
index (pass-through, no aggregation).
**Default when `--select` is absent:**
all defined groups in declaration order (for `--group`), or all generated groups
in metadata-value order (for `--aggregate-by`). Individual genomes not in any
group are excluded unless named explicitly.
**When neither `--group` nor `--aggregate-by` is specified:**
`--select` can still reference genome labels for pure column projection (no
aggregation). If `--select` is also absent, all genomes are output unchanged
(identity transform — useful combined with row filtering via a prior `filter`
run).
---
## Aggregation operators
| Operator | Input | Output | Semantics |
|----------|-------------|----------|-----------|
| `any` | pres / count | presence | 1 if ≥ 1 genome in group carries the k-mer |
| `all` | pres / count | presence | 1 if every genome in group carries the k-mer |
| `none` | pres / count | presence | 1 if no genome in group carries the k-mer |
| `sum` | count | count | sum of counts across the group |
| `min` | count | count | minimum count |
| `max` | count | count | maximum count |
**Default operator:**
- Presence index: `any`
- Count index: `sum`
Logical operators (`any`/`all`/`none`) on a count index use
`--presence-threshold N` (default 0): a genome "carries" the k-mer if its count
is > N.
**Output index type:**
- If the source is a presence index, the output is always a presence index.
- If the source is a count index and every output column uses a logical operator
or is a pass-through from a presence source, the output is a presence index.
- Otherwise (at least one arithmetic operator on a count source), the output is
a count index.
---
## Behaviour for edge cases
| Situation | Behaviour |
|-----------|-----------|
| Genome missing the metadata key in `--aggregate-by` | genome ignored (no `NA` group) |
| Genome in multiple groups | contributes independently to each |
| `--group-op` references undefined group | error |
| `--select` element is neither group name nor genome label | error |
| `--output` and `--in-place` both specified | error |
| Neither `--output` nor `--in-place` | error |
| Group with zero matching genomes | column is all-zeros (or all-ones for `none`) |
---
## Examples
### Aggregate by metadata group, default operators
```sh
obikmer select myindex --output out --aggregate-by group
# one column per unique value of "group"; presence→any, count→sum
```
### Named groups with different operators
```sh
obikmer select myindex --output out \
--group "pub:species=Betula_pubescens" \
--group "nan:species=Betula_nana" \
--group-op "pub:any" \
--group-op "nan:all" \
--select "pub,nan"
```
### Mix aggregated group and individual genome
```sh
obikmer select myindex --output out \
--group "A:group=A" \
--select "A,Betula_nana--IGA-24-39"
```
### Pure column projection (no aggregation)
```sh
obikmer select myindex --output out \
--select "Betula_nana--TROM-V-149986,Betula_nana--AG-P04-25-01"
```
### In-place: keep only group A
```sh
obikmer select myindex --in-place --group "A:group=A" --select "A"
```
### Compose with filter
```sh
# Step 1: keep only B. nana-specific k-mers
obikmer filter myindex --output filtered \
--ingroup "species=Betula_nana" --outgroup "*"
# Step 2: aggregate genome columns by collection site
obikmer select filtered --output final --aggregate-by site
```
---
## Implementation notes
`select` does not rebuild the MPHF. The 256 partitions are processed in parallel
(rayon), each writing its output independently; results require no synchronisation
because every partition owns a distinct set of files.
For each layer in each partition:
1. The slot count `n` is read by opening the source data matrix.
2. A new data matrix is built with M columns (M = number of output columns).
3. For each slot `s` in `0..n`:
- `old_row = matrix.fill_row(s)` — reads the original `N`-column row without allocating.
- For each output column `j`:
- `new_row[j] = aggregate(op, old_row[group_indices])`.
- Pass-through columns are represented as single-element groups with the
default operator (`any` for presence, `sum` for count) — same code path.
- The new row is written slot by slot into each column builder.
4. All plain files in the source layer directory (`mphf.bin`, `unitigs.bin`,
evidence files, `layer_meta.json`) are copied verbatim; only the `presence/`
or `counts/` subdirectory is rewritten.
5. `index.meta` is rewritten with the new genome list and updated `with_counts`.
**`--in-place` write strategy:** new data is written to a temporary sibling
directory (`presence_new/` or `counts_new/`); on success the old directory is
removed and the temporary one is renamed into place. An interrupted run leaves
at most one stale `*_new/` directory; the original data is intact until the
rename step.
+7 -5
View File
@@ -9,15 +9,17 @@
| `superkmer` | Extract super-kmers from a sequence file and write to stdout | | `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) | | `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one | | `merge` | Merge multiple built indexes into one |
| `rebuild` | Filter and compact an existing index into a new single-layer index | | `filter` | Apply row-level selection (σ) to an index: retain only k-mers matching the ingroup/outgroup predicates. Output is a new single-layer index — compaction is a consequence, not the goal. Supports the shared [kmer filtering](implementation/filtering.md) system |
| `query` | Query an index with sequences and annotate matches | | `query` | Query an index with sequences and annotate matches |
| `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) | | `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV | | `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees | | `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
| `unitig` | Dump unitigs from a built index to stdout (debug) | | `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
| `select` | Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see [select](implementation/select.md)) |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing | | `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx | | `reindex` | Convert an index's evidence in-place: exact ↔ approx |
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) | | `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
| `pack` | Pack per-column matrix files into single-file format to reduce query I/O |
## Constraints ## Constraints
+84
View File
@@ -0,0 +1,84 @@
# Installation
## Prerequisites
### Rust toolchain
`obikmer` requires **Rust 1.85 or later** (edition 2024). Install or update via [rustup](https://rustup.rs):
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update stable
```
### C build environment (required for hwloc)
`obikmer` embeds [hwloc](https://www.open-mpi.org/projects/hwloc/) (Hardware Locality) for NUMA-aware thread placement on multi-socket machines. hwloc is built from source at compile time via the `vendored` feature of the `hwlocality` crate. This requires a standard C build environment.
#### Linux (Debian/Ubuntu)
```bash
apt install build-essential automake libtool autoconf pkg-config
```
#### Linux (RHEL/Rocky/AlmaLinux)
```bash
dnf install gcc make automake libtool autoconf pkgconfig
```
#### HPC clusters
Most HPC clusters provide these tools via the module system:
```bash
module load gcc automake libtool autoconf
```
If in doubt, check whether `autoreconf --version` and `libtool --version` return successfully.
#### macOS
```bash
brew install automake libtool autoconf pkg-config
```
## Building
```bash
git clone <repository-url>
cd obikmer/src
cargo build --release
```
The compiled binary is at `target/release/obikmer`.
### Building on HPC clusters (network filesystems)
HPC home directories are typically on a network filesystem (Lustre, NFS) optimised for large sequential reads — not for the thousands of small file operations that Cargo generates during compilation. Building directly on such a filesystem can be extremely slow (0.1% CPU utilisation, tens of minutes for what should take seconds).
**Always redirect the build directory to a local scratch disk:**
```bash
CARGO_TARGET_DIR=/scratch/$USER/cargo-target cargo build --release
```
Adapt the path to the local scratch available on your cluster (`/var/tmp`, `/tmp`, `/scratch/local`, etc.). Once built, copy the binary to a permanent location:
```bash
cp /scratch/$USER/cargo-target/release/obikmer ~/bin/
```
## NUMA support
NUMA-aware thread placement is active automatically on multi-socket Linux machines (detected at runtime via hwloc). No special build flag is required — the detection is built in and falls back gracefully to the single-pool adaptive strategy on:
- macOS (Apple Silicon, unified memory)
- single-socket Linux machines
- any system where hwloc reports only one NUMA node
## Verifying the installation
```bash
obikmer --help
```
+1
View File
@@ -2,3 +2,4 @@
- [Project domain](project_domain.md) — obikmer est pour la génomique (génomes individuels), pas la métagénomique - [Project domain](project_domain.md) — obikmer est pour la génomique (génomes individuels), pas la métagénomique
- [No architectural decisions without authorization](feedback_architectural_decisions.md) — toute décision architecturale (mémoire, algo, structure) requiert l'accord explicite de l'utilisateur avant toute action - [No architectural decisions without authorization](feedback_architectural_decisions.md) — toute décision architecturale (mémoire, algo, structure) requiert l'accord explicite de l'utilisateur avant toute action
- [Phases intra-partition parallèles](feedback_phases_parallelism.md) — graph build, compute_degrees, unitig traversal, MPHF utilisent Rayon — ne jamais les appeler "séquentielles"
+12
View File
@@ -0,0 +1,12 @@
---
name: feedback-phases-parallelism
description: Les phases intra-partition (graph build, compute_degrees, unitig traversal, MPHF) utilisent toutes Rayon — elles ne sont PAS séquentielles
metadata:
type: feedback
---
Ne jamais qualifier les phases intra-partition de "séquentielles". Chaque phase (graph build, compute_degrees, unitig traversal, MPHF build) utilise Rayon en interne et s'exécute en parallèle sur plusieurs cœurs.
**Why:** L'utilisateur a corrigé ce point plusieurs fois. Le décrire comme "séquentiel" est une erreur factuelle qui fausse l'analyse de performance.
**How to apply:** Quand on analyse l'efficacité CPU ou les 25% manquants, chercher la cause dans le déséquilibre de charge entre partitions, la contention Rayon entre workers, ou la latence inter-partitions — pas dans une prétendue sérialisation des phases.
+7
View File
@@ -29,6 +29,7 @@ extra_javascript:
nav: nav:
- Home: index.md - Home: index.md
- Installation: installation.md
- Theory: - Theory:
- Kmers and super-kmers: kmers.md - Kmers and super-kmers: kmers.md
- DNA encoding: theory/encoding.md - DNA encoding: theory/encoding.md
@@ -49,9 +50,15 @@ nav:
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md - PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
- PersistentBitVec: implementation/persistent_bit_vec.md - PersistentBitVec: implementation/persistent_bit_vec.md
- Merge command: implementation/merge.md - Merge command: implementation/merge.md
- Merge parallelism & memory: implementation/merge_parallelism.md
- Kmer filtering: implementation/filtering.md
- Select command: implementation/select.md
- obitaxonomy crate: implementation/obitaxonomy.md
- Architecture: - Architecture:
- Sequences: architecture/sequences/invariant.md - Sequences: architecture/sequences/invariant.md
- Kmer index: architecture/index_architecture.md - Kmer index: architecture/index_architecture.md
- NUMA-aware worker pools: architecture/numa_worker_pools.md
- NUMA-aware partition runner: architecture/numa_partition_runner.md
watch: watch:
- docmd - docmd
+44
View File
@@ -0,0 +1,44 @@
# La crate obicompactvector
Le code actuelle est ce qu'il est. Ce n'est pad la vrérité absolue, c'est un premier effort d'implémentation rien de plus. Ci-dessous je vais décrire les objectif et la structure qui devrait être. LA VERITE A ATTEINDRE.
La crate fournie des représentations les plus compact possible en mémoire de matrice de comptage ou de présence de k-mer dans des génomes. Chaque colonne représente un génome chaque ligne un kmer. une matrice est une collection de vecteur ou chacun des vecteur est un colonne de la matrice.
Les matrices comme les colonnes ont vocation à être persistante. Les données sont stockées dans des fichiers binaires. Les données sont mappées en mémoire via `mmap`
Les structure sont par essence immutables. Il existe des représentations mutables des colonnes qui permettent leur construction. À la fin de leur construction, les colonnes sont fermée ce qui les rends immutable.
Les matrices peuvent êtres représenté de deux façons:
- via un répertoire contenant une collection de fichier colonnes
- via un fichier matrix qui est la concatenation de plusieurs fichiers colonnes.
## Les matrices de comptage
Ce sont des matrice d'entiers positif la plus part du temps de petites valeurs (inferieurs à 255). On assume que toutes les valeurs sont représentables sur un `u32`
## Les matrices de presence
Ce sont des matrices de boolean représenté comme des champs de bits
Il existe une forme implicite des vecteur de présence, qui n'est représenté par aucun fichier pour lequel toutes les valeurs sont vraies
## représentation légère des colonnes
Les colonnes qu'elles soient de unitiaire (fichier colonne) ou partie d'un fichier composite matrice peuvent être représenté par un objet léger donnant acces à ces valeurs ainsi qu'à la longeur du vecteurs. Toutes les méthodes de calcules doivent uniquement travailler à partir de ces représentations légère unifiées des colonnes.
### Représentation légère d'un vecteur de présence
Le vecteur est représenté par
- un champs de bits encodé comme un [u64]
- un usize encodant la longeur du champs de bits
### Représentation légère d'un vecteur de présence
Le vecteur est représenté par
- un vecteur [u8] encodant directement les valeur faibe du vecteur [0,255[
La valeur 255 est une valeur sentinelle indiquant que la valeure vraie est >=255
et se trouvent dans une structure d'overflow
- un iterateur de (usize,u32) listant les valeurs d'overflow coorespondant aux valeurs
sentinels (255) du [u8]
- un usize encodant la longeur du champs de bits
+347
View File
@@ -0,0 +1,347 @@
#!/usr/bin/env python3
"""Parse obikmer merge debug log → Markdown performance report."""
import re
import sys
from datetime import datetime
from collections import defaultdict
from statistics import mean, median, stdev
ANSI = re.compile(r'\x1b\[[0-9;]*m')
def strip(s):
return ANSI.sub('', s)
def parse_ts(s):
return datetime.fromisoformat(s.replace('Z', '+00:00'))
def dur_s(s):
s = s.strip()
if s.endswith('ms'): return float(s[:-2]) / 1e3
if s.endswith('µs'): return float(s[:-2]) / 1e6
if s.endswith('us'): return float(s[:-2]) / 1e6
if s.endswith('ns'): return float(s[:-2]) / 1e9
if s.endswith('s'): return float(s[:-1])
return float(s)
def fmt_s(s):
if s < 0.001: return f"{s*1e6:.0f}µs"
if s < 1: return f"{s*1e3:.0f}ms"
if s < 60: return f"{s:.2f}s"
return f"{s/60:.1f}min ({s:.0f}s)"
def fmt_rate(n, s):
if s <= 0: return ""
r = n / s
if r >= 1e9: return f"{r/1e9:.2f}G/s"
if r >= 1e6: return f"{r/1e6:.2f}M/s"
if r >= 1e3: return f"{r/1e3:.2f}K/s"
return f"{r:.0f}/s"
def pct(a, b):
return f"{100*a/b:.1f}%" if b else ""
def stats_row(label, values, unit="s", fmt=fmt_s):
if not values: return f"| {label} | — | — | — | — | — |"
mn, mx, med, av = min(values), max(values), median(values), mean(values)
sd = stdev(values) if len(values) > 1 else 0
return f"| {label} | {fmt(mn)} | {fmt(med)} | {fmt(av)} | {fmt(mx)} | {fmt(sd)} |"
# ── patterns ──────────────────────────────────────────────────────────────────
TS = r'(\d{4}-\d{2}-\d{2}T[\d:.]+Z)'
pats = {
'graph_done': re.compile(TS + r'.*partition (\d+): de Bruijn graph done — (\d+) new kmers'),
'trav_start': re.compile(TS + r'.*partition (\d+): unitig traversal start — (\d+) nodes'),
'trav_closing': re.compile(TS + r'.*partition (\d+): unitig writer closing'),
'trav_closed': re.compile(TS + r'.*partition (\d+): unitig writer closed'),
'graph_dropped': re.compile(TS + r'.*partition (\d+): graph dropped — starting MPHF build \((\d+) unitigs\)'),
'mphf_done': re.compile(TS + r'.*partition (\d+): MPHF build done'),
'mphf_open': re.compile(TS + r'.*partition (\d+): MPHF open in ([\d.]+)s'),
'bld_ready': re.compile(TS + r'.*partition (\d+): builders ready in ([\d.]+)s'),
'pass2_done': re.compile(TS + r'.*partition (\d+): pass2 pipeline done in ([\d.]+)s'),
'bld_closed': re.compile(TS + r'.*partition (\d+): builders closed in ([\d.]+)s'),
'part_done': re.compile(TS + r'.*partition (\d+): done in ([\d.]+)s — (\d+) new kmers'),
'worker': re.compile(TS + r'.*activated worker (\d+).*efficiency (\d+)%.*gain vs prev (\d+)%'),
'worker_poll': re.compile(TS + r'.*activated worker (\d+) \(poll\).*efficiency (\d+)%'),
'compute_deg': re.compile(TS + r'.*partition (\d+): compute_degrees in ([\d.]+)s — (\d+) nodes'),
'stage_done': re.compile(TS + r'.*done stage=merge_partitions wall_secs=([\d.]+)'),
'workers_rep': re.compile(r'workers spawned: (\d+) / (\d+)'),
}
# ── parse ─────────────────────────────────────────────────────────────────────
P = defaultdict(dict) # partition_id → timing dict
workers_ev = []
wall_total = None
workers_final = (None, None)
with open(sys.argv[1]) as f:
for raw in f:
line = strip(raw)
m = pats['graph_done'].search(line)
if m:
pid = int(m.group(2))
P[pid]['n_kmers'] = int(m.group(3))
P[pid]['graph_done_ts'] = parse_ts(m.group(1))
continue
m = pats['trav_start'].search(line)
if m:
pid = int(m.group(2))
P[pid]['trav_start_ts'] = parse_ts(m.group(1))
P[pid]['n_nodes'] = int(m.group(3))
continue
m = pats['trav_closing'].search(line)
if m:
pid = int(m.group(2))
P[pid]['trav_closing_ts'] = parse_ts(m.group(1))
continue
m = pats['trav_closed'].search(line)
if m:
pid = int(m.group(2))
P[pid]['trav_closed_ts'] = parse_ts(m.group(1))
continue
m = pats['graph_dropped'].search(line)
if m:
pid = int(m.group(2))
P[pid]['drop_ts'] = parse_ts(m.group(1))
P[pid]['n_unitigs'] = int(m.group(3))
continue
m = pats['mphf_done'].search(line)
if m:
pid = int(m.group(2))
P[pid]['mphf_done_ts'] = parse_ts(m.group(1))
continue
m = pats['mphf_open'].search(line)
if m:
pid = int(m.group(2))
P[pid]['mphf_open_s'] = float(m.group(3))
continue
m = pats['bld_ready'].search(line)
if m:
pid = int(m.group(2))
P[pid]['bld_ready_s'] = float(m.group(3))
continue
m = pats['pass2_done'].search(line)
if m:
pid = int(m.group(2))
P[pid]['pass2_s'] = float(m.group(3))
continue
m = pats['bld_closed'].search(line)
if m:
pid = int(m.group(2))
P[pid]['bld_closed_s'] = float(m.group(3))
continue
m = pats['part_done'].search(line)
if m:
pid = int(m.group(2))
P[pid]['total_s'] = float(m.group(3))
P[pid]['done_ts'] = parse_ts(m.group(1))
continue
m = pats['worker'].search(line)
if m:
workers_ev.append({'n': int(m.group(2)), 'eff': int(m.group(3)),
'gain': int(m.group(4)), 'ts': parse_ts(m.group(1)), 'poll': False})
continue
m = pats['worker_poll'].search(line)
if m:
workers_ev.append({'n': int(m.group(2)), 'eff': int(m.group(3)),
'gain': None, 'ts': parse_ts(m.group(1)), 'poll': True})
continue
m = pats['compute_deg'].search(line)
if m:
pid = int(m.group(2))
P[pid]['cdeg_s'] = float(m.group(3))
P[pid]['n_nodes'] = P[pid].get('n_nodes') or int(m.group(4))
continue
m = pats['stage_done'].search(line)
if m:
wall_total = float(m.group(2))
continue
m = pats['workers_rep'].search(line)
if m:
workers_final = (int(m.group(1)), int(m.group(2)))
continue
# ── derive per-partition phases ───────────────────────────────────────────────
def tsdiff(p, k1, k2):
if k1 in p and k2 in p:
return (p[k2] - p[k1]).total_seconds()
return None
phases = {}
for pid, p in P.items():
row = {'pid': pid}
row['n_kmers'] = p.get('n_kmers', 0)
row['n_nodes'] = p.get('n_nodes', 0)
row['n_unitigs']= p.get('n_unitigs', 0)
row['total_s'] = p.get('total_s')
row['cdeg_s'] = p.get('cdeg_s')
row['mphf_open_s'] = p.get('mphf_open_s')
row['bld_ready_s'] = p.get('bld_ready_s')
row['pass2_s'] = p.get('pass2_s')
row['bld_closed_s'] = p.get('bld_closed_s')
# Traversal: trav_start → trav_closing (= writing all unitigs)
row['trav_s'] = tsdiff(p, 'trav_start_ts', 'trav_closing_ts')
# Writer close: trav_closing → trav_closed
row['close_s'] = tsdiff(p, 'trav_closing_ts', 'trav_closed_ts')
# Graph drop: trav_closed → drop_ts
row['drop_s'] = tsdiff(p, 'trav_closed_ts', 'drop_ts')
# MPHF build: drop_ts → mphf_done_ts
row['mphf_s'] = tsdiff(p, 'drop_ts', 'mphf_done_ts')
# After MPHF: mphf_done → done_ts
row['post_s'] = tsdiff(p, 'mphf_done_ts', 'done_ts')
# Graph build: total - known phases (rough estimate)
known = sum(v for v in [row['cdeg_s'], row['trav_s'], row['close_s'], row['drop_s'],
row['mphf_s'], row['mphf_open_s'], row['bld_ready_s'],
row['pass2_s'], row['bld_closed_s']] if v is not None)
row['graph_build_s'] = (row['total_s'] - known) if row['total_s'] else None
phases[pid] = row
# helpers
def collect(key):
return [r[key] for r in phases.values() if r.get(key) is not None]
def rate_stats(n_key, t_key):
"""Returns list of throughput values (items/s)."""
result = []
for r in phases.values():
n, t = r.get(n_key), r.get(t_key)
if n and t and t > 0:
result.append(n / t)
return result
# ── output ────────────────────────────────────────────────────────────────────
out = []
w = out.append
w("# obikmer merge — performance report\n")
# Run info
n_parts = len([r for r in phases.values() if r['n_kmers'] > 0])
n_empty = len([r for r in phases.values() if r['n_kmers'] == 0])
total_kmers = sum(r['n_kmers'] for r in phases.values())
w("## Run summary\n")
w(f"- **Partitions**: {len(phases)} total — {n_parts} non-empty, {n_empty} empty")
w(f"- **New kmers (total)**: {total_kmers:,}")
if wall_total:
w(f"- **merge_partitions wall time**: {fmt_s(wall_total)}")
if workers_final[0]:
w(f"- **Workers spawned**: {workers_final[0]} / {workers_final[1]} (max)")
w("")
# Worker spawn timeline
if workers_ev:
w("## Worker activation\n")
w("| Time | Worker # | Trigger | Efficiency | Gain vs prev |")
w("|------|----------|---------|------------|--------------|")
t0 = workers_ev[0]['ts']
for e in workers_ev:
elapsed = fmt_s((e['ts'] - t0).total_seconds())
trigger = "poll (timeout)" if e['poll'] else "partition done"
gain = f"{e['gain']}%" if e.get('gain') is not None else ""
w(f"| +{elapsed} | {e['n']} | {trigger} | {e['eff']}% | {gain} |")
w("")
# Phase breakdown table
w("## Phase timing statistics\n")
w("Columns: min | median | mean | max | stdev\n")
w("| Phase | min | median | mean | max | stdev |")
w("|-------|-----|--------|------|-----|-------|")
w(stats_row("Graph build (estimated)", collect('graph_build_s')))
w(stats_row("compute_degrees", collect('cdeg_s')))
w(stats_row("Unitig traversal", collect('trav_s')))
w(stats_row("Writer close (uw.close)", collect('close_s')))
w(stats_row("Graph drop", collect('drop_s')))
w(stats_row("MPHF build", collect('mphf_s')))
w(stats_row("MPHF open", collect('mphf_open_s')))
w(stats_row("Builders ready", collect('bld_ready_s')))
w(stats_row("Pass2 pipeline", collect('pass2_s')))
w(stats_row("Builders close", collect('bld_closed_s')))
w(stats_row("Post-MPHF (residual)", collect('post_s')))
w(stats_row("**Total per partition**", collect('total_s')))
w("")
# Throughput
w("## Throughput by phase\n")
w("| Phase | metric | min | median | mean | max |")
w("|-------|--------|-----|--------|------|-----|")
def rate_row(label, rates):
if not rates: return f"| {label} | — | — | — | — | — |"
f = lambda x: fmt_rate(x, 1)
mn, med, av, mx = min(rates), median(rates), mean(rates), max(rates)
return f"| {label} | nodes/s | {f(mn)} | {f(med)} | {f(av)} | {f(mx)} |"
w(rate_row("compute_degrees", rate_stats('n_nodes', 'cdeg_s')))
w(rate_row("Unitig traversal", rate_stats('n_nodes', 'trav_s')))
w(rate_row("MPHF build", rate_stats('n_unitigs', 'mphf_s')))
w("")
# Top 10 slowest partitions
w("## Top 10 slowest partitions\n")
w("| Partition | nodes | unitigs | total | trav | MPHF | graph build |")
w("|-----------|-------|---------|-------|------|------|-------------|")
sorted_parts = sorted(phases.values(), key=lambda r: r['total_s'] or 0, reverse=True)
for r in sorted_parts[:10]:
pid = r['pid']
def f(k): return fmt_s(r[k]) if r.get(k) is not None else ""
nodes = f"{r['n_nodes']/1e6:.1f}M" if r['n_nodes'] else ""
unitigs = f"{r['n_unitigs']/1e6:.1f}M" if r['n_unitigs'] else ""
w(f"| {pid} | {nodes} | {unitigs} | {f('total_s')} | {f('trav_s')} | {f('mphf_s')} | {f('graph_build_s')} |")
w("")
# Phase share of total time (for non-empty partitions with full data)
complete = [r for r in phases.values()
if all(r.get(k) is not None
for k in ('total_s','trav_s','close_s','drop_s','mphf_s',
'mphf_open_s','bld_ready_s','pass2_s','bld_closed_s'))
and r['total_s'] and r['total_s'] > 0]
if complete:
w("## Phase share of total time (mean across complete partitions)\n")
total_mean = mean(r['total_s'] for r in complete)
w(f"_Based on {len(complete)} partitions with full timing data. Mean total: {fmt_s(total_mean)}_\n")
w("| Phase | mean time | share |")
w("|-------|-----------|-------|")
for label, key in [
("Graph build", 'graph_build_s'),
("compute_degrees", 'cdeg_s'),
("Unitig traversal", 'trav_s'),
("Writer close", 'close_s'),
("Graph drop", 'drop_s'),
("MPHF build", 'mphf_s'),
("MPHF open", 'mphf_open_s'),
("Builders ready", 'bld_ready_s'),
("Pass2 pipeline", 'pass2_s'),
("Builders close", 'bld_closed_s'),
("Post-MPHF (residual)", 'post_s'),
]:
vals = [r[key] for r in complete]
m = mean(vals)
w(f"| {label} | {fmt_s(m)} | {pct(m, total_mean)} |")
w("")
print('\n'.join(out))
+313 -6
View File
@@ -128,6 +128,12 @@ version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2ad8689a486416c401ea15715a4694de30054248ec627edbf31f49cb64ee4086" checksum = "2ad8689a486416c401ea15715a4694de30054248ec627edbf31f49cb64ee4086"
[[package]]
name = "arrayvec"
version = "0.7.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50"
[[package]] [[package]]
name = "as-slice" name = "as-slice"
version = "0.2.1" version = "0.2.1"
@@ -143,6 +149,15 @@ version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8"
[[package]]
name = "autotools"
version = "0.2.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ef941527c41b0fc0dd48511a8154cd5fc7e29200a0ff8b7203c5d777dbc795cf"
dependencies = [
"cc",
]
[[package]] [[package]]
name = "backtrace" name = "backtrace"
version = "0.3.76" version = "0.3.76"
@@ -224,6 +239,15 @@ dependencies = [
"generic-array", "generic-array",
] ]
[[package]]
name = "block-buffer"
version = "0.12.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2f6c7dbe95a6ed67ad9f18e57daf93a2f034c524b99fd2b76d18fdfeb6660aa"
dependencies = [
"hybrid-array",
]
[[package]] [[package]]
name = "block-pseudorand" name = "block-pseudorand"
version = "0.1.2" version = "0.1.2"
@@ -415,6 +439,15 @@ version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8d4a3bb8b1e0c1050499d1815f5ab16d04f0959b233085fb31653fbfc9d98f9" checksum = "c8d4a3bb8b1e0c1050499d1815f5ab16d04f0959b233085fb31653fbfc9d98f9"
[[package]]
name = "cmake"
version = "0.1.58"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c0f78a02292a74a88ac736019ab962ece0bc380e3f977bf72e376c5d78ff0678"
dependencies = [
"cc",
]
[[package]] [[package]]
name = "colorchoice" name = "colorchoice"
version = "1.0.5" version = "1.0.5"
@@ -464,6 +497,21 @@ dependencies = [
"windows-sys 0.59.0", "windows-sys 0.59.0",
] ]
[[package]]
name = "const-oid"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a6ef517f0926dd24a1582492c791b6a4818a4d94e789a334894aa15b0d12f55c"
[[package]]
name = "convert_case"
version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "633458d4ef8c78b72454de2d54fd6ab2e60f9e02be22f3c6104cdc8a4e0fceb9"
dependencies = [
"unicode-segmentation",
]
[[package]] [[package]]
name = "core-foundation-sys" name = "core-foundation-sys"
version = "0.8.7" version = "0.8.7"
@@ -488,6 +536,15 @@ dependencies = [
"libc", "libc",
] ]
[[package]]
name = "cpufeatures"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b2a41393f66f16b0823bb79094d54ac5fbd34ab292ddafb9a0456ac9f87d201"
dependencies = [
"libc",
]
[[package]] [[package]]
name = "crc32fast" name = "crc32fast"
version = "1.5.0" version = "1.5.0"
@@ -601,6 +658,15 @@ dependencies = [
"typenum", "typenum",
] ]
[[package]]
name = "crypto-common"
version = "0.2.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ce6e4c961d6cd6c9a86db418387425e8bdeaf05b3c8bc1411e6dca4c252f1453"
dependencies = [
"hybrid-array",
]
[[package]] [[package]]
name = "csv" name = "csv"
version = "1.4.0" version = "1.4.0"
@@ -640,14 +706,48 @@ dependencies = [
"uuid", "uuid",
] ]
[[package]]
name = "derive_more"
version = "2.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d751e9e49156b02b44f9c1815bcb94b984cdcc4396ecc32521c739452808b134"
dependencies = [
"derive_more-impl",
]
[[package]]
name = "derive_more-impl"
version = "2.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "799a97264921d8623a957f6c3b9011f3b5492f557bbb7a5a19b7fa6d06ba8dcb"
dependencies = [
"convert_case",
"proc-macro2",
"quote",
"rustc_version",
"syn",
"unicode-xid",
]
[[package]] [[package]]
name = "digest" name = "digest"
version = "0.10.7" version = "0.10.7"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292"
dependencies = [ dependencies = [
"block-buffer", "block-buffer 0.10.4",
"crypto-common", "crypto-common 0.1.7",
]
[[package]]
name = "digest"
version = "0.11.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1dd6dbb5841937940781866fa1281a1ff7bd3bf827091440879f9994983d5c2"
dependencies = [
"block-buffer 0.12.1",
"const-oid",
"crypto-common 0.2.2",
] ]
[[package]] [[package]]
@@ -742,6 +842,16 @@ version = "2.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6" checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6"
[[package]]
name = "filetime"
version = "0.2.29"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c287a33c7f0a620c38e641e7f60827713987b3c0f26e8ddc9462cc69cf75759"
dependencies = [
"cfg-if",
"libc",
]
[[package]] [[package]]
name = "find-msvc-tools" name = "find-msvc-tools"
version = "0.1.9" version = "0.1.9"
@@ -884,6 +994,7 @@ checksum = "e5274423e17b7c9fc20b6e7e208532f9b19825d82dfd615708b70edd83df41f1"
dependencies = [ dependencies = [
"ahash", "ahash",
"allocator-api2", "allocator-api2",
"rayon",
] ]
[[package]] [[package]]
@@ -915,6 +1026,65 @@ version = "0.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c" checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c"
[[package]]
name = "http"
version = "1.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6970f50e31d6fc17d3fa27329444bfa74e196cf62e95052a3f6fee181dba6425"
dependencies = [
"bytes",
"itoa",
]
[[package]]
name = "httparse"
version = "1.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"
[[package]]
name = "hwlocality"
version = "1.0.0-alpha.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4c2e65a48d3b300843ac84a2fe8e166bb5a5b00f30054593bcee8157e4b465fd"
dependencies = [
"arrayvec",
"bitflags 2.11.1",
"derive_more",
"errno",
"hwlocality-sys",
"libc",
"strum",
"thiserror 2.0.18",
"windows-sys 0.61.2",
]
[[package]]
name = "hwlocality-sys"
version = "0.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "10a83c43a772c1f774b806deb44891c2a9578eb33cec48aad513482e0da3d4d4"
dependencies = [
"autotools",
"cmake",
"flate2",
"libc",
"pkg-config",
"sha3",
"tar",
"ureq 3.3.0",
"windows-sys 0.61.2",
]
[[package]]
name = "hybrid-array"
version = "0.4.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9155a582abd142abc056962c29e3ce5ff2ad5469f4246b537ed42c5deba857da"
dependencies = [
"typenum",
]
[[package]] [[package]]
name = "icu_collections" name = "icu_collections"
version = "2.2.0" version = "2.2.0"
@@ -1144,6 +1314,16 @@ dependencies = [
"wasm-bindgen", "wasm-bindgen",
] ]
[[package]]
name = "keccak"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9e24a010dd405bd7ed803e5253182815b41bf2e6a80cc3bfc066658e03a198aa"
dependencies = [
"cfg-if",
"cpufeatures 0.3.0",
]
[[package]] [[package]]
name = "kodama" name = "kodama"
version = "0.2.3" version = "0.2.3"
@@ -1485,9 +1665,12 @@ name = "obidebruinj"
version = "0.1.0" version = "0.1.0"
dependencies = [ dependencies = [
"ahash", "ahash",
"crossbeam-channel",
"hashbrown 0.14.5", "hashbrown 0.14.5",
"obifastwrite", "obifastwrite",
"obikseq", "obikseq",
"rayon",
"tracing",
"xxhash-rust", "xxhash-rust",
] ]
@@ -1503,6 +1686,8 @@ dependencies = [
name = "obikindex" name = "obikindex"
version = "0.1.0" version = "0.1.0"
dependencies = [ dependencies = [
"crossbeam-channel",
"hwlocality",
"indicatif", "indicatif",
"ndarray", "ndarray",
"obicompactvec", "obicompactvec",
@@ -1519,12 +1704,13 @@ dependencies = [
[[package]] [[package]]
name = "obikmer" name = "obikmer"
version = "0.1.0" version = "1.1.9"
dependencies = [ dependencies = [
"clap", "clap",
"csv", "csv",
"indicatif", "indicatif",
"kodama", "kodama",
"obidebruinj",
"obifastwrite", "obifastwrite",
"obikindex", "obikindex",
"obikpartitionner", "obikpartitionner",
@@ -1536,6 +1722,7 @@ dependencies = [
"obiskbuilder", "obiskbuilder",
"obiskio", "obiskio",
"obisys", "obisys",
"obitaxonomy",
"pprof", "pprof",
"rayon", "rayon",
"serde_json", "serde_json",
@@ -1558,6 +1745,7 @@ dependencies = [
"obikrope", "obikrope",
"obikseq", "obikseq",
"obilayeredmap", "obilayeredmap",
"obipipeline",
"obiread", "obiread",
"obiskbuilder", "obiskbuilder",
"obiskio", "obiskio",
@@ -1629,7 +1817,7 @@ dependencies = [
"regex", "regex",
"tracing", "tracing",
"tracing-subscriber", "tracing-subscriber",
"ureq", "ureq 2.12.1",
] ]
[[package]] [[package]]
@@ -1663,8 +1851,13 @@ dependencies = [
"indicatif", "indicatif",
"libc", "libc",
"sysinfo", "sysinfo",
"tracing",
] ]
[[package]]
name = "obitaxonomy"
version = "0.1.0"
[[package]] [[package]]
name = "object" name = "object"
version = "0.37.3" version = "0.37.3"
@@ -2169,6 +2362,15 @@ version = "2.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "94300abf3f1ae2e2b8ffb7b58043de3d399c73fa6f4b73826402a5c457614dbe" checksum = "94300abf3f1ae2e2b8ffb7b58043de3d399c73fa6f4b73826402a5c457614dbe"
[[package]]
name = "rustc_version"
version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cfcb3a22ef46e85b45de6ee7e79d063319ebb6594faafcf1c225ea92ab6e9b92"
dependencies = [
"semver",
]
[[package]] [[package]]
name = "rustix" name = "rustix"
version = "1.1.4" version = "1.1.4"
@@ -2255,6 +2457,12 @@ dependencies = [
"syn", "syn",
] ]
[[package]]
name = "semver"
version = "1.0.28"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8a7852d02fc848982e0c167ef163aaff9cd91dc640ba85e263cb1ce46fae51cd"
[[package]] [[package]]
name = "serde" name = "serde"
version = "1.0.228" version = "1.0.228"
@@ -2305,8 +2513,18 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283"
dependencies = [ dependencies = [
"cfg-if", "cfg-if",
"cpufeatures", "cpufeatures 0.2.17",
"digest", "digest 0.10.7",
]
[[package]]
name = "sha3"
version = "0.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "be176f1a57ce4e3d31c1a166222d9768de5954f811601fb7ca06fc8203905ce1"
dependencies = [
"digest 0.11.3",
"keccak",
] ]
[[package]] [[package]]
@@ -2367,6 +2585,27 @@ version = "0.11.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f" checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f"
[[package]]
name = "strum"
version = "0.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9628de9b8791db39ceda2b119bbe13134770b56c138ec1d3af810d045c04f9bd"
dependencies = [
"strum_macros",
]
[[package]]
name = "strum_macros"
version = "0.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ab85eea0270ee17587ed4156089e10b9e6880ee688791d45a905f5b1ca36f664"
dependencies = [
"heck",
"proc-macro2",
"quote",
"syn",
]
[[package]] [[package]]
name = "subtle" name = "subtle"
version = "2.6.1" version = "2.6.1"
@@ -2462,6 +2701,17 @@ version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "55937e1799185b12863d447f42597ed69d9928686b8d88a1df17376a097d8369" checksum = "55937e1799185b12863d447f42597ed69d9928686b8d88a1df17376a097d8369"
[[package]]
name = "tar"
version = "0.4.46"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f6221d9a6003c78398e3b239969f352578258df48c8eb051caadae0015bc840"
dependencies = [
"filetime",
"libc",
"xattr",
]
[[package]] [[package]]
name = "tempfile" name = "tempfile"
version = "3.27.0" version = "3.27.0"
@@ -2637,12 +2887,24 @@ version = "1.0.24"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
[[package]]
name = "unicode-segmentation"
version = "1.13.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c6f5d3c3b1bf09027a88a6bc961fc00497d651009560b5463668dc81b0fa87a8"
[[package]] [[package]]
name = "unicode-width" name = "unicode-width"
version = "0.2.2" version = "0.2.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254" checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254"
[[package]]
name = "unicode-xid"
version = "0.2.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
[[package]] [[package]]
name = "untrusted" name = "untrusted"
version = "0.9.0" version = "0.9.0"
@@ -2665,6 +2927,35 @@ dependencies = [
"webpki-roots 0.26.11", "webpki-roots 0.26.11",
] ]
[[package]]
name = "ureq"
version = "3.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dea7109cdcd5864d4eeb1b58a1648dc9bf520360d7af16ec26d0a9354bafcfc0"
dependencies = [
"base64",
"flate2",
"log",
"percent-encoding",
"rustls",
"rustls-pki-types",
"ureq-proto",
"utf8-zero",
"webpki-roots 1.0.7",
]
[[package]]
name = "ureq-proto"
version = "0.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e994ba84b0bd1b1b0cf92878b7ef898a5c1760108fe7b6010327e274917a808c"
dependencies = [
"base64",
"http",
"httparse",
"log",
]
[[package]] [[package]]
name = "url" name = "url"
version = "2.5.8" version = "2.5.8"
@@ -2677,6 +2968,12 @@ dependencies = [
"serde", "serde",
] ]
[[package]]
name = "utf8-zero"
version = "0.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b8c0a043c9540bae7c578c88f91dda8bd82e59ae27c21baca69c8b191aaf5a6e"
[[package]] [[package]]
name = "utf8_iter" name = "utf8_iter"
version = "1.0.4" version = "1.0.4"
@@ -3102,6 +3399,16 @@ dependencies = [
"tap", "tap",
] ]
[[package]]
name = "xattr"
version = "1.6.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32e45ad4206f6d2479085147f02bc2ef834ac85886624a23575ae137c8aa8156"
dependencies = [
"libc",
"rustix",
]
[[package]] [[package]]
name = "xxhash-rust" name = "xxhash-rust"
version = "0.8.15" version = "0.8.15"
+1 -1
View File
@@ -1,5 +1,5 @@
[workspace] [workspace]
resolver = "3" resolver = "3"
members = ["obikseq", "obiread", "obiskbuilder", "obifastwrite", "obikmer","obikrope","obipipeline", "obikpartitionner","obiskio","obidebruinj","obilayeredmap", "obicompactvec", "obisys", "obikindex"] members = ["obikseq", "obiread", "obiskbuilder", "obifastwrite", "obikmer","obikrope","obipipeline", "obikpartitionner","obiskio","obidebruinj","obilayeredmap", "obicompactvec", "obisys", "obikindex", "obitaxonomy"]
[profile.release] [profile.release]
debug = 1 debug = 1
+1 -1
View File
@@ -7,6 +7,6 @@ edition = "2024"
memmap2 = "0.9" memmap2 = "0.9"
ndarray = "0.16" ndarray = "0.16"
rayon = "1" rayon = "1"
tempfile = "3"
[dev-dependencies] [dev-dependencies]
tempfile = "3"
+226 -79
View File
@@ -1,5 +1,5 @@
use std::fs::{self, File}; use std::fs::{self, File};
use std::io::{self, Write as _}; use std::io::{self, BufWriter, Write as _};
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use memmap2::Mmap; use memmap2::Mmap;
@@ -7,8 +7,12 @@ use ndarray::{Array1, Array2};
use rayon::prelude::*; use rayon::prelude::*;
use crate::bitvec::{PersistentBitVec, PersistentBitVecBuilder}; use crate::bitvec::{PersistentBitVec, PersistentBitVecBuilder};
use crate::colgroup::{ColGroup, MatrixGroupOps};
use crate::layer_meta::LayerMeta; use crate::layer_meta::LayerMeta;
use crate::meta::MatrixMeta; use crate::meta::MatrixMeta;
use crate::tempbitvec::{TempBitVec, TempBitVecBuilder};
use crate::tempintvec::{TempCompactIntVec, TempCompactIntVecBuilder};
use crate::views::BitSliceView;
fn col_path(dir: &Path, col: usize) -> PathBuf { fn col_path(dir: &Path, col: usize) -> PathBuf {
dir.join(format!("col_{col:06}.pbiv")) dir.join(format!("col_{col:06}.pbiv"))
@@ -53,52 +57,12 @@ impl ColumnarBitMatrix {
Array1::from_vec(counts) Array1::from_vec(counts)
} }
pub(crate) fn jaccard_dist_matrix(&self) -> Array2<f64> {
self.pairwise_f64(|i, j| self.col(i).jaccard_dist(self.col(j)))
}
pub(crate) fn hamming_dist_matrix(&self) -> Array2<u64> {
self.pairwise_u64(|i, j| self.col(i).hamming_dist(self.col(j)))
}
pub(crate) fn partial_jaccard_dist_matrix(&self) -> (Array2<u64>, Array2<u64>) { pub(crate) fn partial_jaccard_dist_matrix(&self) -> (Array2<u64>, Array2<u64>) {
let n = self.n_cols(); pairwise2_matrix(self.n_cols(), |i, j| self.col(i).partial_jaccard_dist(self.col(j)))
let results: Vec<(usize, usize, u64, u64)> = upper_pairs(n)
.into_par_iter()
.map(|(i, j)| {
let (inter, union) = self.col(i).partial_jaccard_dist(self.col(j));
(i, j, inter, union)
})
.collect();
let mut inter_m = Array2::zeros((n, n));
let mut union_m = Array2::zeros((n, n));
for (i, j, inter, union) in results {
inter_m[[i, j]] = inter; inter_m[[j, i]] = inter;
union_m[[i, j]] = union; union_m[[j, i]] = union;
}
(inter_m, union_m)
} }
pub(crate) fn partial_hamming_dist_matrix(&self) -> Array2<u64> { pub(crate) fn partial_hamming_dist_matrix(&self) -> Array2<u64> {
self.pairwise_u64(|i, j| self.col(i).hamming_dist(self.col(j))) pairwise_matrix(self.n_cols(), |i, j| self.col(i).hamming_dist(self.col(j)))
}
fn pairwise_f64(&self, f: impl Fn(usize, usize) -> f64 + Sync) -> Array2<f64> {
let n = self.n_cols();
let results: Vec<(usize, usize, f64)> = upper_pairs(n)
.into_par_iter()
.map(|(i, j)| (i, j, f(i, j)))
.collect();
fill_symmetric(n, results.into_iter().map(|(i, j, v)| (i, j, v, v)))
}
fn pairwise_u64(&self, f: impl Fn(usize, usize) -> u64 + Sync) -> Array2<u64> {
let n = self.n_cols();
let results: Vec<(usize, usize, u64)> = upper_pairs(n)
.into_par_iter()
.map(|(i, j)| (i, j, f(i, j)))
.collect();
fill_symmetric(n, results.into_iter().map(|(i, j, v)| (i, j, v, v)))
} }
pub(crate) fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> io::Result<()> { pub(crate) fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> io::Result<()> {
@@ -163,34 +127,93 @@ impl PackedBitMatrix {
(self.mmap[self.data_offsets[c] + (slot >> 3)] >> (slot & 7)) & 1 != 0 (self.mmap[self.data_offsets[c] + (slot >> 3)] >> (slot & 7)) & 1 != 0
}).collect() }).collect()
} }
fn col_bytes(&self, c: usize) -> &[u8] {
let start = self.data_offsets[c];
&self.mmap[start..start + self.n_rows.div_ceil(8)]
}
fn col_words(&self, c: usize) -> &[u64] {
let nw = self.n_rows.div_ceil(64);
// SAFETY: data_offsets[c] is always 8-byte aligned.
// PBMX header = 24 + n_cols×8 (multiple of 8); each PBIV blob =
// 16 + nwords×8 (multiple of 8); mmap base is page-aligned.
let ptr = self.mmap[self.data_offsets[c]..].as_ptr() as *const u64;
unsafe { std::slice::from_raw_parts(ptr, nw) }
}
pub(crate) fn col_slice(&self, c: usize) -> BitSliceView<'_> {
BitSliceView::new(self.col_words(c), self.n_rows)
}
pub(crate) fn col_persist(&self, c: usize, path: &Path) -> io::Result<PersistentBitVecBuilder> {
PersistentBitVecBuilder::from_raw_bytes(self.col_bytes(c), self.n_rows, path)
}
pub(crate) fn count_ones(&self) -> Array1<u64> {
Array1::from_vec(
(0..self.n_cols).into_par_iter()
.map(|c| self.col_slice(c).count_ones())
.collect()
)
}
pub(crate) fn partial_jaccard_dist_matrix(&self) -> (Array2<u64>, Array2<u64>) {
pairwise2_matrix(self.n_cols, |i, j| {
self.col_slice(i).partial_jaccard_dist(self.col_slice(j))
})
}
pub(crate) fn partial_hamming_dist_matrix(&self) -> Array2<u64> {
pairwise_matrix(self.n_cols, |i, j| {
self.col_slice(i).hamming_dist(self.col_slice(j))
})
}
} }
/// Build `presence/matrix.pbmx` from existing `col_*.pbiv` files. /// Build `presence/matrix.pbmx` from existing `col_*.pbiv` files.
pub fn pack_bit_matrix(dir: &Path) -> io::Result<()> { pub fn pack_bit_matrix(dir: &Path) -> io::Result<()> {
let packed_path = dir.join("matrix.pbmx");
if packed_path.exists() {
// Matrix complete; remove any leftover column files from a killed cleanup.
if let Ok(meta) = MatrixMeta::load(dir) {
for c in 0..meta.n_cols { let _ = fs::remove_file(col_path(dir, c)); }
let _ = fs::remove_file(dir.join("meta.json"));
}
return Ok(());
}
let meta = MatrixMeta::load(dir)?; let meta = MatrixMeta::load(dir)?;
let n_cols = meta.n_cols; let n_cols = meta.n_cols;
let col_files: Vec<Vec<u8>> = (0..n_cols) // Compute offsets from file sizes — no column data loaded into RAM.
.map(|c| fs::read(col_path(dir, c))) let col_sizes: Vec<u64> = (0..n_cols)
.map(|c| fs::metadata(col_path(dir, c)).map(|m| m.len()))
.collect::<io::Result<_>>()?; .collect::<io::Result<_>>()?;
let header_size = PBMX_HEADER + n_cols * 8; let header_size = (PBMX_HEADER + n_cols * 8) as u64;
let mut col_offset = header_size; let mut col_offset = header_size;
let mut offsets = Vec::with_capacity(n_cols); let mut offsets = Vec::with_capacity(n_cols);
for data in &col_files { for &size in &col_sizes {
offsets.push(col_offset as u64); offsets.push(col_offset);
col_offset += data.len(); col_offset += size;
} }
let packed_path = dir.join("matrix.pbmx"); // Write to a temp file; rename atomically so a killed process never leaves
let mut file = File::create(&packed_path)?; // a truncated matrix.pbmx that would be mistaken for a complete file.
file.write_all(&PBMX_MAGIC)?; let tmp_path = dir.join("matrix.pbmx.tmp");
file.write_all(&[0u8; 4])?; let mut out = BufWriter::new(File::create(&tmp_path)?);
file.write_all(&(meta.n as u64).to_le_bytes())?; out.write_all(&PBMX_MAGIC)?;
file.write_all(&(n_cols as u64).to_le_bytes())?; out.write_all(&[0u8; 4])?;
for &off in &offsets { file.write_all(&off.to_le_bytes())?; } out.write_all(&(meta.n as u64).to_le_bytes())?;
for data in &col_files { file.write_all(data)?; } out.write_all(&(n_cols as u64).to_le_bytes())?;
drop(file); for &off in &offsets { out.write_all(&off.to_le_bytes())?; }
for c in 0..n_cols {
io::copy(&mut File::open(col_path(dir, c))?, &mut out)?;
}
out.flush()?;
drop(out);
fs::rename(&tmp_path, &packed_path)?;
for c in 0..n_cols { fs::remove_file(col_path(dir, c))?; } for c in 0..n_cols { fs::remove_file(col_path(dir, c))?; }
fs::remove_file(dir.join("meta.json"))?; fs::remove_file(dir.join("meta.json"))?;
@@ -263,6 +286,24 @@ impl PersistentBitMatrix {
} }
} }
pub fn col_view(&self, c: usize) -> BitSliceView<'_> {
match self {
Self::Columnar(m) => m.col(c).view(),
Self::Packed(m) => m.col_slice(c),
Self::Implicit { .. } => panic!("col_view() not available on Implicit PersistentBitMatrix"),
}
}
pub fn col_persist(&self, c: usize, path: &Path) -> io::Result<PersistentBitVecBuilder> {
match self {
Self::Columnar(m) => PersistentBitVecBuilder::build_from(m.col(c), path),
Self::Packed(m) => m.col_persist(c, path),
Self::Implicit { n_rows, .. } => {
PersistentBitVecBuilder::new_ones(*n_rows, path)
}
}
}
pub fn row(&self, slot: usize) -> Box<[bool]> { pub fn row(&self, slot: usize) -> Box<[bool]> {
match self { match self {
Self::Columnar(m) => m.row(slot), Self::Columnar(m) => m.row(slot),
@@ -282,36 +323,34 @@ impl PersistentBitMatrix {
pub fn count_ones(&self) -> Array1<u64> { pub fn count_ones(&self) -> Array1<u64> {
match self { match self {
Self::Columnar(m) => m.count_ones(), Self::Columnar(m) => m.count_ones(),
_ => panic!("count_ones() only available on Columnar PersistentBitMatrix"), Self::Packed(m) => m.count_ones(),
} Self::Implicit { n_rows, n_cols } => Array1::from_elem(*n_cols, *n_rows as u64),
}
pub fn jaccard_dist_matrix(&self) -> Array2<f64> {
match self {
Self::Columnar(m) => m.jaccard_dist_matrix(),
_ => panic!("jaccard_dist_matrix() only available on Columnar PersistentBitMatrix"),
}
}
pub fn hamming_dist_matrix(&self) -> Array2<u64> {
match self {
Self::Columnar(m) => m.hamming_dist_matrix(),
_ => panic!("hamming_dist_matrix() only available on Columnar PersistentBitMatrix"),
} }
} }
pub fn partial_jaccard_dist_matrix(&self) -> (Array2<u64>, Array2<u64>) { pub fn partial_jaccard_dist_matrix(&self) -> (Array2<u64>, Array2<u64>) {
match self { match self {
Self::Columnar(m) => m.partial_jaccard_dist_matrix(), Self::Columnar(m) => m.partial_jaccard_dist_matrix(),
_ => panic!("partial_jaccard_dist_matrix() only available on Columnar PersistentBitMatrix"), Self::Packed(m) => m.partial_jaccard_dist_matrix(),
Self::Implicit { n_rows, n_cols } => {
let v = *n_rows as u64;
let n = *n_cols;
let mut inter = Array2::zeros((n, n));
let mut union = Array2::zeros((n, n));
for i in 0..n { for j in 0..n {
inter[[i, j]] = v; union[[i, j]] = v;
}}
(inter, union)
}
} }
} }
pub fn partial_hamming_dist_matrix(&self) -> Array2<u64> { pub fn partial_hamming_dist_matrix(&self) -> Array2<u64> {
match self { match self {
Self::Columnar(m) => m.partial_hamming_dist_matrix(), Self::Columnar(m) => m.partial_hamming_dist_matrix(),
_ => panic!("partial_hamming_dist_matrix() only available on Columnar PersistentBitMatrix"), Self::Packed(m) => m.partial_hamming_dist_matrix(),
Self::Implicit { n_cols, .. } => Array2::zeros((*n_cols, *n_cols)),
} }
} }
@@ -361,12 +400,93 @@ impl PersistentBitMatrixBuilder {
PersistentBitVecBuilder::new(self.n, &path) PersistentBitVecBuilder::new(self.n, &path)
} }
pub fn add_col_ones(&mut self) -> io::Result<PersistentBitVecBuilder> {
let path = col_path(&self.dir, self.n_cols);
self.n_cols += 1;
PersistentBitVecBuilder::new_ones(self.n, &path)
}
pub fn add_col_from(&mut self, src: &TempBitVec) -> io::Result<()> {
src.make_persistent(&col_path(&self.dir, self.n_cols))?;
self.n_cols += 1;
Ok(())
}
pub fn add_col_from_int(&mut self, src: &TempCompactIntVec) -> io::Result<()> {
let path = col_path(&self.dir, self.n_cols);
self.n_cols += 1;
let mut b = PersistentBitVecBuilder::new(self.n, &path)?;
b.or_where(src.view(), |v| v > 0);
b.close()
}
pub fn close(self) -> io::Result<()> { pub fn close(self) -> io::Result<()> {
MatrixMeta { n: self.n, n_cols: self.n_cols }.save(&self.dir) MatrixMeta { n: self.n, n_cols: self.n_cols }.save(&self.dir)
} }
} }
// ── Helpers ─────────────────────────────────────────────────────────────────── // ── MatrixGroupOps ────────────────────────────────────────────────────────────
impl MatrixGroupOps for PersistentBitMatrix {
fn partial_group_presence_count(&self, g: &ColGroup, _threshold: u32) -> io::Result<TempCompactIntVec> {
// Bit matrices store 0/1 — threshold is structurally always 1.
let n = self.n();
if g.indices.len() < 255 {
let mut builder = TempCompactIntVecBuilder::new(n)?;
for &c in &g.indices {
builder.inc_present_fast(self.col_view(c));
}
builder.freeze()
} else {
let mut result = TempCompactIntVecBuilder::new(n)?;
for chunk in g.indices.chunks(254) {
let mut chunk_b = TempCompactIntVecBuilder::new(n)?;
for &c in chunk {
chunk_b.inc_present_fast(self.col_view(c));
}
let frozen = chunk_b.freeze()?;
result.add(frozen.view());
}
result.freeze()
}
}
fn partial_group_sum(&self, g: &ColGroup) -> io::Result<TempCompactIntVec> {
// For bit matrices, sum = count of 1-bits — identical to presence_count.
self.partial_group_presence_count(g, 1)
}
fn partial_group_any(&self, g: &ColGroup, _threshold: u32) -> io::Result<TempBitVec> {
let n = self.n();
let mut result = TempBitVecBuilder::new(n)?;
for &c in &g.indices {
result.or(self.col_view(c));
}
result.freeze()
}
fn partial_group_min(&self, g: &ColGroup) -> io::Result<TempCompactIntVec> {
// min of 0/1 values = AND: 1 only if ALL columns are 1
let n = self.n();
let mut result = TempCompactIntVecBuilder::new(n)?;
if let Some((&first, rest)) = g.indices.split_first() {
result.inc_present_fast(self.col_view(first));
for &c in rest { result.mask_with(self.col_view(c)); }
}
result.freeze()
}
fn partial_group_max(&self, g: &ColGroup) -> io::Result<TempCompactIntVec> {
// max of 0/1 values = OR: 1 if any column is 1
let any = self.partial_group_any(g, 1)?;
let n = any.len();
let mut result = TempCompactIntVecBuilder::new(n)?;
result.inc_present(any.view());
result.freeze()
}
}
// ── Shared matrix helpers (also used by intmatrix.rs) ─────────────────────────
fn upper_pairs(n: usize) -> Vec<(usize, usize)> { fn upper_pairs(n: usize) -> Vec<(usize, usize)> {
(0..n).flat_map(|i| (i + 1..n).map(move |j| (i, j))).collect() (0..n).flat_map(|i| (i + 1..n).map(move |j| (i, j))).collect()
@@ -378,3 +498,30 @@ where T: Clone + Default {
for (i, j, vij, vji) in vals { m[[i, j]] = vij; m[[j, i]] = vji; } for (i, j, vij, vji) in vals { m[[i, j]] = vij; m[[j, i]] = vji; }
m m
} }
/// Compute a symmetric `n×n` matrix in parallel by evaluating `f(i,j)` for
/// all upper-triangle pairs. `T: Copy` avoids the `.clone()` needed for the
/// lower-triangle mirror.
pub(crate) fn pairwise_matrix<T>(n: usize, f: impl Fn(usize, usize) -> T + Sync) -> Array2<T>
where T: Copy + Default + Send {
let results: Vec<(usize, usize, T)> = upper_pairs(n)
.into_par_iter().map(|(i, j)| (i, j, f(i, j))).collect();
fill_symmetric(n, results.into_iter().map(|(i, j, v)| (i, j, v, v)))
}
/// Same as `pairwise_matrix` but `f` returns two values that fill two
/// symmetric matrices simultaneously (e.g. intersection + union for Jaccard).
pub(crate) fn pairwise2_matrix<T>(n: usize, f: impl Fn(usize, usize) -> (T, T) + Sync) -> (Array2<T>, Array2<T>)
where T: Copy + Default + Send {
let results: Vec<(usize, usize, T, T)> = upper_pairs(n)
.into_par_iter()
.map(|(i, j)| { let (a, b) = f(i, j); (i, j, a, b) })
.collect();
let mut m0 = Array2::from_elem((n, n), T::default());
let mut m1 = Array2::from_elem((n, n), T::default());
for (i, j, a, b) in results {
m0[[i, j]] = a; m0[[j, i]] = a;
m1[[i, j]] = b; m1[[j, i]] = b;
}
(m0, m1)
}
+221 -179
View File
@@ -5,29 +5,25 @@ use std::path::{Path, PathBuf};
use memmap2::{Mmap, MmapMut}; use memmap2::{Mmap, MmapMut};
use crate::reader::PersistentCompactIntVec; use crate::reader::PersistentCompactIntVec;
use crate::views::{BitSliceIter, BitSliceView, IntSliceView};
const MAGIC: [u8; 4] = *b"PBIV"; const MAGIC: [u8; 4] = *b"PBIV";
// Header: magic(4) + _pad(4) + n(8) = 16 bytes. // Header: magic(4) + _pad(4) + n(8) = 16 bytes.
// Data starts at offset 16, which is divisible by 8 → u64-aligned // Data starts at offset 16, u64-aligned (mmap base is page-aligned, 16 % 8 == 0).
// (mmap base is page-aligned, 16 % 8 == 0).
const HEADER_SIZE: usize = 16; const HEADER_SIZE: usize = 16;
#[inline] #[inline]
fn n_words(n: usize) -> usize { pub(crate) fn n_words(n: usize) -> usize { n.div_ceil(64) }
n.div_ceil(64)
}
#[inline] #[inline]
fn n_bytes_for_words(n: usize) -> usize { fn n_bytes_for_words(n: usize) -> usize { n_words(n) * 8 }
n_words(n) * 8
}
// ── Reader ──────────────────────────────────────────────────────────────────── // ── PersistentBitVec ──────────────────────────────────────────────────────────
pub struct PersistentBitVec { pub struct PersistentBitVec {
mmap: Mmap, mmap: Mmap,
n: usize, n: usize,
path: PathBuf, path: PathBuf,
} }
@@ -35,157 +31,145 @@ impl PersistentBitVec {
pub fn open(path: &Path) -> io::Result<Self> { pub fn open(path: &Path) -> io::Result<Self> {
let mmap = unsafe { Mmap::map(&File::open(path)?)? }; let mmap = unsafe { Mmap::map(&File::open(path)?)? };
if mmap.len() < HEADER_SIZE { if mmap.len() < HEADER_SIZE {
return Err(io::Error::new( return Err(io::Error::new(io::ErrorKind::InvalidData, "PBIV file too short"));
io::ErrorKind::InvalidData,
"PBIV file too short",
));
} }
if &mmap[0..4] != &MAGIC { if &mmap[0..4] != &MAGIC {
return Err(io::Error::new(io::ErrorKind::InvalidData, "bad PBIV magic")); return Err(io::Error::new(io::ErrorKind::InvalidData, "bad PBIV magic"));
} }
let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize; let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize;
Ok(Self { Ok(Self { mmap, n, path: path.to_path_buf() })
mmap,
n,
path: path.to_path_buf(),
})
} }
pub fn path(&self) -> &Path { pub fn path(&self) -> &Path { &self.path }
&self.path pub fn len(&self) -> usize { self.n }
} pub fn is_empty(&self) -> bool { self.n == 0 }
pub fn len(&self) -> usize {
self.n
}
pub fn is_empty(&self) -> bool {
self.n == 0
}
pub fn get(&self, slot: usize) -> bool { pub fn get(&self, slot: usize) -> bool {
(self.mmap[HEADER_SIZE + (slot >> 3)] >> (slot & 7)) & 1 != 0 (self.mmap[HEADER_SIZE + (slot >> 3)] >> (slot & 7)) & 1 != 0
} }
// Used by iter() and get(): exact byte window, no padding. // SAFETY: mmap is page-aligned, HEADER_SIZE=16 divisible by 8 → u64-aligned.
fn data_bytes(&self) -> &[u8] {
&self.mmap[HEADER_SIZE..HEADER_SIZE + self.n.div_ceil(8)]
}
// Bulk word view. SAFETY: mmap is page-aligned, HEADER_SIZE=16 is divisible by 8,
// so &mmap[HEADER_SIZE] is u64-aligned. Slice length is n_words * 8 bytes.
fn data_words(&self) -> &[u64] { fn data_words(&self) -> &[u64] {
let nw = n_words(self.n); let nw = n_words(self.n);
let ptr = self.mmap[HEADER_SIZE..].as_ptr() as *const u64; let ptr = self.mmap[HEADER_SIZE..].as_ptr() as *const u64;
unsafe { std::slice::from_raw_parts(ptr, nw) } unsafe { std::slice::from_raw_parts(ptr, nw) }
} }
pub fn count_ones(&self) -> u64 { pub fn view(&self) -> BitSliceView<'_> {
// Padding bits in the last word are 0, so no masking needed. BitSliceView::new(self.data_words(), self.n)
self.data_words()
.iter()
.map(|w| w.count_ones() as u64)
.sum()
} }
pub fn count_zeros(&self) -> u64 { pub fn words(&self) -> &[u64] { self.data_words() }
self.n as u64 - self.count_ones()
}
pub fn jaccard_dist(&self, other: &PersistentBitVec) -> f64 { pub fn count_ones(&self) -> u64 { self.view().count_ones() }
let (inter, union) = self.partial_jaccard_dist(other); pub fn count_zeros(&self) -> u64 { self.view().count_zeros() }
if union == 0 {
return 0.0;
}
1.0 - inter as f64 / union as f64
}
pub fn partial_jaccard_dist(&self, other: &PersistentBitVec) -> (u64, u64) { pub fn partial_jaccard_dist(&self, other: &PersistentBitVec) -> (u64, u64) {
assert_eq!(self.n, other.n, "length mismatch"); self.view().partial_jaccard_dist(other.view())
self.data_words() }
.iter() pub fn jaccard_dist(&self, other: &PersistentBitVec) -> f64 {
.zip(other.data_words()) self.view().jaccard_dist(other.view())
.fold((0u64, 0u64), |(i, u), (&a, &b)| {
(
i + (a & b).count_ones() as u64,
u + (a | b).count_ones() as u64,
)
})
} }
pub fn hamming_dist(&self, other: &PersistentBitVec) -> u64 { pub fn hamming_dist(&self, other: &PersistentBitVec) -> u64 {
assert_eq!(self.n, other.n, "length mismatch"); self.view().hamming_dist(other.view())
self.data_words()
.iter()
.zip(other.data_words())
.map(|(&a, &b)| (a ^ b).count_ones() as u64)
.sum()
} }
pub fn iter(&self) -> BitIter<'_> { pub fn iter(&self) -> BitIter<'_> {
BitIter { BitIter { words: self.data_words(), slot: 0, n: self.n }
bytes: self.data_bytes(),
slot: 0,
n: self.n,
}
} }
} }
impl<'a> IntoIterator for &'a PersistentBitVec { impl<'a> IntoIterator for &'a PersistentBitVec {
type Item = bool; type Item = bool;
type IntoIter = BitIter<'a>; type IntoIter = BitIter<'a>;
fn into_iter(self) -> BitIter<'a> { fn into_iter(self) -> BitIter<'a> { self.iter() }
self.iter()
}
} }
// ── BitIter ───────────────────────────────────────────────────────────────────
pub struct BitIter<'a> { pub struct BitIter<'a> {
bytes: &'a [u8], words: &'a [u64],
slot: usize, slot: usize,
n: usize, n: usize,
} }
impl ExactSizeIterator for BitIter<'_> {} impl ExactSizeIterator for BitIter<'_> {}
impl Iterator for BitIter<'_> { impl Iterator for BitIter<'_> {
type Item = bool; type Item = bool;
fn next(&mut self) -> Option<bool> { fn next(&mut self) -> Option<bool> {
if self.slot >= self.n { if self.slot >= self.n { return None; }
return None; let v = (self.words[self.slot >> 6] >> (self.slot & 63)) & 1 != 0;
}
let v = (self.bytes[self.slot >> 3] >> (self.slot & 7)) & 1 != 0;
self.slot += 1; self.slot += 1;
Some(v) Some(v)
} }
fn size_hint(&self) -> (usize, Option<usize>) { fn size_hint(&self) -> (usize, Option<usize>) {
let rem = self.n - self.slot; let rem = self.n - self.slot;
(rem, Some(rem)) (rem, Some(rem))
} }
} }
// ── Builder ─────────────────────────────────────────────────────────────────── // ── PersistentBitVecBuilder ───────────────────────────────────────────────────
pub struct PersistentBitVecBuilder { pub struct PersistentBitVecBuilder {
mmap: MmapMut, mmap: MmapMut,
n: usize, n: usize,
path: PathBuf,
} }
impl PersistentBitVecBuilder { impl PersistentBitVecBuilder {
pub fn new(n: usize, path: &Path) -> io::Result<Self> { pub fn new(n: usize, path: &Path) -> io::Result<Self> {
let file_size = HEADER_SIZE + n_bytes_for_words(n); let file_size = HEADER_SIZE + n_bytes_for_words(n);
let mut file = OpenOptions::new() let mut file = OpenOptions::new()
.read(true) .read(true).write(true).create(true).truncate(true)
.write(true)
.create(true)
.truncate(true)
.open(path)?; .open(path)?;
file.write_all(&MAGIC)?; file.write_all(&MAGIC)?;
file.write_all(&[0u8; 4])?; // padding file.write_all(&[0u8; 4])?;
file.write_all(&(n as u64).to_le_bytes())?; file.write_all(&(n as u64).to_le_bytes())?;
file.seek(SeekFrom::Start(0))?; file.seek(SeekFrom::Start(0))?;
file.set_len(file_size as u64)?; file.set_len(file_size as u64)?;
let mmap = unsafe { MmapMut::map_mut(&file)? }; let mmap = unsafe { MmapMut::map_mut(&file)? };
Ok(Self { mmap, n }) Ok(Self { mmap, n, path: path.to_path_buf() })
}
pub fn from_raw_bytes(bytes: &[u8], n: usize, path: &Path) -> io::Result<Self> {
let file_size = HEADER_SIZE + n_bytes_for_words(n);
let file = OpenOptions::new()
.read(true).write(true).create(true).truncate(true)
.open(path)?;
file.set_len(file_size as u64)?;
let mut mmap = unsafe { MmapMut::map_mut(&file)? };
mmap[0..4].copy_from_slice(&MAGIC);
mmap[8..16].copy_from_slice(&(n as u64).to_le_bytes());
mmap[HEADER_SIZE..HEADER_SIZE + bytes.len()].copy_from_slice(bytes);
Ok(Self { mmap, n, path: path.to_path_buf() })
}
/// Create an all-ones bit vector of length `n` at `path`.
///
/// More efficient than `new(n, path)` + `not()`: the data is written as
/// 0xFF bytes in a single sequential pass, with no intermediate all-zeros state.
pub fn new_ones(n: usize, path: &Path) -> io::Result<Self> {
let nw = n_words(n);
let file_size = HEADER_SIZE + nw * 8;
let mut file = OpenOptions::new()
.read(true).write(true).create(true).truncate(true)
.open(path)?;
file.write_all(&MAGIC)?;
file.write_all(&[0u8; 4])?;
file.write_all(&(n as u64).to_le_bytes())?;
file.write_all(&vec![0xFFu8; nw * 8])?;
file.seek(SeekFrom::Start(0))?;
file.set_len(file_size as u64)?;
let mut mmap = unsafe { MmapMut::map_mut(&file)? };
// Clear padding bits in the last word so trailing bits are always 0.
let rem = n % 64;
if rem != 0 {
let ptr = mmap[HEADER_SIZE..].as_mut_ptr() as *mut u64;
let words = unsafe { std::slice::from_raw_parts_mut(ptr, nw) };
words[nw - 1] &= (1u64 << rem) - 1;
}
Ok(Self { mmap, n, path: path.to_path_buf() })
} }
pub fn build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self> { pub fn build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self> {
@@ -193,86 +177,14 @@ impl PersistentBitVecBuilder {
let file = OpenOptions::new().read(true).write(true).open(path)?; let file = OpenOptions::new().read(true).write(true).open(path)?;
let mmap = unsafe { MmapMut::map_mut(&file)? }; let mmap = unsafe { MmapMut::map_mut(&file)? };
let n = source.len(); let n = source.len();
Ok(Self { mmap, n }) Ok(Self { mmap, n, path: path.to_path_buf() })
} }
pub fn len(&self) -> usize { pub fn build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self> {
self.n
}
pub fn is_empty(&self) -> bool {
self.n == 0
}
pub fn get(&self, slot: usize) -> bool {
(self.mmap[HEADER_SIZE + (slot >> 3)] >> (slot & 7)) & 1 != 0
}
pub fn set(&mut self, slot: usize, value: bool) {
let byte = HEADER_SIZE + (slot >> 3);
let bit = 1u8 << (slot & 7);
if value {
self.mmap[byte] |= bit;
} else {
self.mmap[byte] &= !bit;
}
}
// SAFETY: same alignment argument as PersistentBitVec::data_words.
fn data_words_mut(&mut self) -> &mut [u64] {
let nw = n_words(self.n);
let ptr = self.mmap[HEADER_SIZE..].as_mut_ptr() as *mut u64;
unsafe { std::slice::from_raw_parts_mut(ptr, nw) }
}
pub fn and(&mut self, other: &PersistentBitVec) {
assert_eq!(self.n, other.n, "length mismatch");
for (sw, &ow) in self.data_words_mut().iter_mut().zip(other.data_words()) {
*sw &= ow;
}
}
pub fn or(&mut self, other: &PersistentBitVec) {
assert_eq!(self.n, other.n, "length mismatch");
for (sw, &ow) in self.data_words_mut().iter_mut().zip(other.data_words()) {
*sw |= ow;
}
}
pub fn xor(&mut self, other: &PersistentBitVec) {
assert_eq!(self.n, other.n, "length mismatch");
for (sw, &ow) in self.data_words_mut().iter_mut().zip(other.data_words()) {
*sw ^= ow;
}
}
pub fn not(&mut self) {
let rem = self.n % 64;
let words = self.data_words_mut();
for w in words.iter_mut() {
*w ^= u64::MAX;
}
// Zero padding bits in the last word so count_ones / jaccard remain correct.
if rem != 0 {
if let Some(last) = words.last_mut() {
*last &= (1u64 << rem) - 1;
}
}
}
/// Convert a count vector to a bit vector: bit set iff count >= threshold.
/// Fills u64 words directly from the count iterator — O(n), no bit-level set() overhead.
pub fn build_from_counts(
source: &PersistentCompactIntVec,
threshold: u32,
path: &Path,
) -> io::Result<Self> {
let n = source.len(); let n = source.len();
let file_size = HEADER_SIZE + n_bytes_for_words(n); let file_size = HEADER_SIZE + n_bytes_for_words(n);
let mut file = OpenOptions::new() let mut file = OpenOptions::new()
.read(true) .read(true).write(true).create(true).truncate(true)
.write(true)
.create(true)
.truncate(true)
.open(path)?; .open(path)?;
file.write_all(&MAGIC)?; file.write_all(&MAGIC)?;
file.write_all(&[0u8; 4])?; file.write_all(&[0u8; 4])?;
@@ -280,27 +192,157 @@ impl PersistentBitVecBuilder {
file.seek(SeekFrom::Start(0))?; file.seek(SeekFrom::Start(0))?;
file.set_len(file_size as u64)?; file.set_len(file_size as u64)?;
let mut mmap = unsafe { MmapMut::map_mut(&file)? }; let mut mmap = unsafe { MmapMut::map_mut(&file)? };
{ {
let nw = n_words(n); let nw = n_words(n);
let ptr = mmap[HEADER_SIZE..].as_mut_ptr() as *mut u64; let ptr = mmap[HEADER_SIZE..].as_mut_ptr() as *mut u64;
let words = unsafe { std::slice::from_raw_parts_mut(ptr, nw) }; let words = unsafe { std::slice::from_raw_parts_mut(ptr, nw) };
for (slot, count) in source.iter().enumerate() { for (slot, count) in source.iter().enumerate() {
if count >= threshold { if count >= threshold { words[slot >> 6] |= 1u64 << (slot & 63); }
words[slot >> 6] |= 1u64 << (slot & 63);
}
} }
} }
Ok(Self { mmap, n, path: path.to_path_buf() })
Ok(Self { mmap, n })
} }
/// Convert a count vector to a presence/absence bit vector (threshold = 1).
pub fn build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self> { pub fn build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self> {
Self::build_from_counts(source, 1, path) Self::build_from_counts(source, 1, path)
} }
pub fn close(self) -> io::Result<()> { pub fn len(&self) -> usize { self.n }
self.mmap.flush() pub fn is_empty(&self) -> bool { self.n == 0 }
pub fn get(&self, slot: usize) -> bool {
(self.mmap[HEADER_SIZE + (slot >> 3)] >> (slot & 7)) & 1 != 0
}
pub fn set(&mut self, slot: usize, value: bool) {
let bit = 1u64 << (slot & 63);
if value { self.data_words_mut()[slot >> 6] |= bit; }
else { self.data_words_mut()[slot >> 6] &= !bit; }
}
fn data_words(&self) -> &[u64] {
let nw = n_words(self.n);
let ptr = self.mmap[HEADER_SIZE..].as_ptr() as *const u64;
unsafe { std::slice::from_raw_parts(ptr, nw) }
}
// SAFETY: same alignment argument as PersistentBitVec::data_words.
fn data_words_mut(&mut self) -> &mut [u64] {
let nw = n_words(self.n);
let ptr = self.mmap[HEADER_SIZE..].as_mut_ptr() as *mut u64;
unsafe { std::slice::from_raw_parts_mut(ptr, nw) }
}
pub fn view(&self) -> BitSliceView<'_> {
BitSliceView::new(self.data_words(), self.n)
}
pub fn words(&self) -> &[u64] { self.data_words() }
pub fn copy_from(&mut self, src: BitSliceView<'_>) {
assert_eq!(self.n, src.len(), "BitSliceView length mismatch");
self.data_words_mut().copy_from_slice(src.words());
}
pub fn and(&mut self, other: BitSliceView<'_>) {
assert_eq!(self.n, other.len(), "BitSliceView length mismatch");
for (w, &o) in self.data_words_mut().iter_mut().zip(other.words()) { *w &= o; }
}
pub fn or(&mut self, other: BitSliceView<'_>) {
assert_eq!(self.n, other.len(), "BitSliceView length mismatch");
for (w, &o) in self.data_words_mut().iter_mut().zip(other.words()) { *w |= o; }
}
pub fn xor(&mut self, other: BitSliceView<'_>) {
assert_eq!(self.n, other.len(), "BitSliceView length mismatch");
for (w, &o) in self.data_words_mut().iter_mut().zip(other.words()) { *w ^= o; }
}
pub fn not(&mut self) {
let rem = self.n % 64;
let words = self.data_words_mut();
for w in words.iter_mut() { *w ^= u64::MAX; }
if rem != 0 {
if let Some(last) = words.last_mut() { *last &= (1u64 << rem) - 1; }
}
}
/// OR in bits at slots where `pred(col[slot])` is true.
pub fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool) {
assert_eq!(self.n, col.len(), "IntSliceView length mismatch");
let n = self.n;
let primary = col.primary_bytes();
let words = self.data_words_mut();
let nw = n_words(n);
for wi in 0..nw {
let base = wi * 64;
let limit = (base + 64).min(n);
let mut mask = 0u64;
for bit in 0..(limit - base) {
let b = primary[base + bit];
if b < 255 && pred(b as u32) { mask |= 1u64 << bit; }
}
words[wi] |= mask;
}
for (slot, val) in col.overflow_entries() {
if pred(val) { words[slot >> 6] |= 1u64 << (slot & 63); }
}
}
/// Clear bits at slots where `pred(col[slot])` is false.
pub fn and_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool) {
assert_eq!(self.n, col.len(), "IntSliceView length mismatch");
let n = self.n;
let primary = col.primary_bytes();
let words = self.data_words_mut();
let nw = n_words(n);
for wi in 0..nw {
let base = wi * 64;
let limit = (base + 64).min(n);
let mut mask = 0u64;
for bit in 0..(limit - base) {
let b = primary[base + bit];
if b < 255 && !pred(b as u32) { mask |= 1u64 << bit; }
}
words[wi] &= !mask;
}
for (slot, val) in col.overflow_entries() {
if !pred(val) { words[slot >> 6] &= !(1u64 << (slot & 63)); }
}
}
/// Toggle bits at slots where `pred(col[slot])` is true.
pub fn xor_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool) {
assert_eq!(self.n, col.len(), "IntSliceView length mismatch");
let n = self.n;
let primary = col.primary_bytes();
let words = self.data_words_mut();
let nw = n_words(n);
for wi in 0..nw {
let base = wi * 64;
let limit = (base + 64).min(n);
let mut mask = 0u64;
for bit in 0..(limit - base) {
let b = primary[base + bit];
if b < 255 && pred(b as u32) { mask |= 1u64 << bit; }
}
words[wi] ^= mask;
}
for (slot, val) in col.overflow_entries() {
if pred(val) { words[slot >> 6] ^= 1u64 << (slot & 63); }
}
}
pub fn iter(&self) -> BitSliceIter<'_> {
self.view().iter()
}
pub fn close(self) -> io::Result<()> { self.mmap.flush() }
pub fn finish(self) -> io::Result<PersistentBitVec> {
let path = self.path.clone();
self.close()?;
PersistentBitVec::open(&path)
} }
} }

Some files were not shown because too many files have changed in this diff Show More