Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
Refactors column and matrix access to use unified `BitSliceView` and `IntSliceView` abstractions, replacing legacy `PackedCol`/`IntColView` types. Introduces `BitSlice`/`IntSlice` traits for zero-copy, trait-based bitwise and arithmetic operations across persistent and temporary vector types. Removes deprecated in-memory `MemoryBitVec` and `MemoryIntVec` implementations and their tests, while updating dependent crates to use the new view-based API and `BitSliceMut` trait.
Introduces `TempCompactIntVec` and `TempBitVec` as temporary, file-backed intermediates to replace eager in-memory vectors, enabling OS-level paging under memory pressure. Updates the `MatrixGroupOps` trait to return `io::Result` types, allowing proper error propagation and supporting chunked accumulation for large column groups. Includes builder patterns with `.freeze()` finalization, automatic `TempDir` cleanup on drop, and necessary test updates to handle the new fallible signatures. Also fixes `Cargo.toml` section ordering.
Introduce the `ColGroup` struct and `MatrixGroupOps` trait to manage named subsets of column indices and perform additive aggregations (count, sum, any). Implement these operations for `PersistentBitMatrix` and `PersistentCompactIntMatrix`, applying size-optimized branches for presence counts and direct accumulation for small groups. Additionally, add a `mask_with` trait method that efficiently zero-sets elements based on a mask, optimized for sparse masks with O(n_zeros) complexity. Include comprehensive tests covering overflow handling, slot masking, and result additivity across partitioned data.
This commit introduces `BitColView` and `IntColView` to abstract over Columnar and Packed storage formats, implementing `BitSlice` and `IntSlice` for uniform column access. It adds `col_view()` accessors to `PersistentBitMatrix` and `PackedCompactIntMatrix`, explicitly panicking on implicit variants. The new types are publicly re-exported, and unit tests are added to validate per-element retrieval, aggregation methods, and parity with the original columnar representation.
Add Mermaid diagrams to visualize the trait hierarchy, compact int storage layout, and SIMD-vectorizable arithmetic operations for MemoryIntVec and PersistentCompactIntVec. Also document concrete type structures and planned layer/partition composition rules to improve documentation clarity.
Document the two-tier compact integer encoding, BitSlice/IntSlice trait hierarchy, and SIMD-friendly O(n+k) algorithms. Include details on concrete memory and persistent vector types, matrix aggregation traits, and planned group-filtering APIs.
Refactor `cmp_scalar`, `min`, `max`, `add`, and `diff` to operate directly on the primary byte array, deferring overflow slot resolution to a secondary pass. This eliminates HashMap lookups in the hot path and enables SIMD vectorization. Add six unit tests to validate correct promotion and demotion between storage slots when values cross the 255 threshold.
Implemented `sum()`, `count_nonzero()`, and `iter()` to complete the numeric vector interface. The builder now computes aggregate values across memory-mapped regions and overflow entries, while the reader delegates these operations to its inherent methods. The iterator provides zero-copy access to underlying `u32` elements.
Refactor `BitIter` to process `u64` chunks using word-aligned shifts instead of byte-level operations. Introduce a dedicated `MemoryBitIter` for `MemoryBitVec`, updating its `iter()` and `IntoIterator` implementations accordingly. Hide `MemoryBitIter` from the public API to narrow the crate's interface, while leveraging explicit alignment guarantees for safer and more efficient bit extraction.
Centralizes overflow handling and improves modularity by replacing manual mmap indexing and row loops with composable iterator patterns. This change leverages Rust's iterator traits for efficient, idiomatic column traversal while encapsulating data access in a dedicated view struct.
Refactors bit manipulation and distance calculations to leverage standardized `BitSlice` traits, replacing manual byte/word logic with safer, reusable methods. Extends `IntSlice` and `IntSliceMut` traits to expose direct memory-mapped access and overflow management, enabling efficient bulk data extraction and serialization. Replaces manual bit-shifting loops with optimized table-based unpacking and adds population count and distance metric methods for improved performance. Updates `PersistentBitVecBuilder` with file tracking and safe flushing, and aligns test imports with new trait bounds.
Introduce `byte_sum` and `byte_count_nonzero` to efficiently aggregate compact-int byte slices, bypassing per-element decoding and overflow map lookups. Refactor `sum()` and `count_nonzero()` across the matrix, reader, and traits modules to use direct memory-mapped slice iteration and idiomatic Rust iterators. Additionally, expose `MemoryIntIter` publicly and implement `IntoIterator` and `IntSlice` for `MemoryIntVec` to enable standard iteration and delegate aggregation to the new helpers.
Refactor parallel matrix construction by extracting reusable `pairwise_matrix` and `pairwise2_matrix` helpers, and consolidate binary record deserialization into dedicated parsing functions. Add `set` and `iter` methods to `BitSliceMut` and `MemoryBitVec` for ergonomic bit manipulation and iteration. Standardize JSON field extraction via `meta::field`, expose `MemoryBitIter`, and improve test reliability by automatically cleaning up temporary directories.
Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.
Replaces `rayon` parallel iteration across index, rebuild, reindex, and select modules with a custom `PartitionRunner`. This introduces NUMA-aware task distribution with CPU pinning and round-robin scheduling, eliminating `Arc`, `Mutex`, and atomic synchronization primitives in favor of a flat, pre-spawned worker architecture. Error handling is simplified via `.map_err()` and the `?` operator, while progress bar updates are decoupled into dedicated callbacks.
This commit restructures the partitioner crate by extracting shared utilities and the `ColBuilder` enum into a new `common` module. It introduces a multi-phase `graph_pipeline` for constructing and materializing De Bruijn graphs, replacing manual graph construction in `index_layer`, `merge_layer`, and `rebuild_layer`. All layer workflows now use centralized `build_graph` and `materialize_layer` abstractions, with standardized error context strings for improved diagnostics.
Replaces the previous partition processing logic with an adaptive, NUMA-aware multi-threaded worker pool that dynamically scales active threads based on real-time CPU efficiency. Introduces pre-spawned, CPU-pinned threads managed via crossbeam channels and Rayon to optimize memory bandwidth and core utilization. Adds a `max_workers()` accessor to aggregate maximum worker capacity across NUMA nodes and updates diagnostics to report active versus maximum worker counts.
Replace NUMA-naive Rayon loops and ad-hoc adaptive pools with a unified `PartitionRunner` that manages a NUMA-aware worker pool. The implementation uses pinned Rayon thread pools per node and activates dormant threads based on real-time CPU efficiency metrics. This standardizes partition-level parallelism, optimizes memory locality, and eliminates cross-socket traffic. Includes architecture documentation and updates mkdocs navigation.
Introduces a counting semaphore-based throttling mechanism to limit concurrent file I/O and pipeline processing. Replaces custom path wrappers with standardized `Throttled` types across `obikmer` and `obikpartitionner`, ensuring RAII-based resource cleanup and explicit backpressure. Additionally, documents how to redirect Cargo build artifacts to local scratch storage on HPC filesystems to prevent compilation slowdowns.
Replaces the global Rayon pool with per-NUMA-node thread pools that pin worker threads to their respective nodes, leveraging Linux first-touch allocation to reduce cross-NUMA memory contention and improve cache locality. Integrates the `hwlocality` crate with a vendored build, includes graceful fallbacks for single-socket or non-Linux systems, and updates dependency constraints. Also adds installation and architecture documentation, and corrects parallelism detection in the partitioner.
Centralize inline spawn checks into a `should_spawn_worker` function with adaptive thresholds. The first worker spawns at <95% CPU efficiency, while subsequent workers only trigger if marginal efficiency gain exceeds 25% of the expected `1/n_workers` (minimum 3%). Also increases the spawn poll interval from 10s to 20s.
Updated memory guidelines and feedback docs to explicitly classify intra-partition phases as parallel, correcting prior assumptions of sequential execution. Refactored merge_layer.rs to wrap column builders in Arc<Mutex<ColBuilder>> and use Arc::try_unwrap for safe concurrent access, eliminating race conditions and preventing double-closes during pass2.
This change enhances observability and adaptability in the merge pipeline. Performance timing and debug logging are added to the De Bruijn graph and partition merge layers to track phase durations and pipeline metrics. The merge module replaces blocking receives with timed polls to sample CPU efficiency, dynamically spawning workers when utilization drops below a threshold. A new script is also introduced to parse merge debug logs and generate structured Markdown reports detailing throughput, phase breakdowns, and partition performance.
Applies consistent formatting, whitespace normalization, and indentation standardization to `debruijn.rs` and `merge.rs`. Reorganizes imports and downgrades a unitig traversal log from `info!` to `debug!`. No functional logic or runtime behavior is altered.
Introduce CpuSample to measure process-level CPU efficiency and wall-clock time. Use crossbeam-channel to distribute partition merging tasks to a dynamic worker pool that scales based on CPU utilization, capped at half the available cores. Update diagnostics to track pool usage.
Refactored `try_for_each_unitig` and related pipelines across `obidebruinj` and `obikpartitionner` to accept `Unitig` instances directly. This eliminates manual `Unitig::from_nucleotides()` conversions, simplifies the data flow, and reduces unnecessary allocation overhead.
Add `rayon` and `crossbeam-channel` dependencies to support concurrent execution. Replace the synchronous, mutex-protected closure pattern with a channel-based producer-consumer approach using `std::thread::scope`. This decouples unitig iteration from processing, eliminating lock contention and `Mutex` overhead while enabling parallel workloads.
Introduce a thread-local buffer to materialize nucleotide iterators into contiguous slices. Update `try_for_each_unitig` across the debruijn, index, merge, and rebuild layers to directly instantiate `Unitig` via `from_nucleotides()` instead of explicitly collecting iterators. This eliminates intermediate allocations and aligns test code with the new approach.
Refactored neighbor resolution to explicitly track unvisited indices for degree-1 nodes, updated display formatting, and added timing and debug logging to the degree computation routine. Migrated pipeline stages from eager vector returns to explicit flat implementations, enabling backpressure-aware streaming, configurable batch processing, incremental yielding, and progress tracking via a delta channel.
Replace eager resolution of `Vec<u32>` values with an `Arc<SrcLayerData>` handle passed alongside `Vec<CanonicalKmer>`. This shifts the lookup logic to the subsequent transform step, reducing memory overhead and enabling shared, thread-safe access to the source layer data.
Replaces the synchronous sequential loop with a multi-threaded pipeline using `WorkerPool` and custom stage macros. Shared mutable state is wrapped in `Arc<Mutex<>>` for thread-safe updates, while pipeline errors are centralized via `Arc<Mutex<Option<String>>>` to propagate failures before thread join.
Introduce the `obipipeline` dependency and refactor merge and partition logic to leverage parallel execution. Update `merge_partitions` to use rayon with dynamic memory budgeting and concurrency control via a pilot run. Refactor Pass 1 to concurrently read unitigs, filter kmers through a shared `LayeredMap`, and populate the graph safely. Simplify diagnostics to report total kmer counts and replace manual flags with graph length validation.
Introduce a `peak_rss_bytes()` utility for accurate per-phase RAM measurement. Replace the genome-length heuristic with a dynamic seed expansion ratio based on actual RSS delta. Explicitly drop the `GraphDeBruijn` instance before MPHF construction to prevent resource contention and ensure proper memory management.
This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
Introduces a memory-aware scheduling strategy for parallel partition merging that replaces unbounded concurrency with a First-Fit Decreasing approach gated by a thread-safe `MemoryBudget` semaphore. An adaptive expansion factor, seeded by a sequential pilot run, dynamically caps concurrent workers to prevent hashbrown OOMs. Adds a `--budget-fraction` CLI flag to configure RAM allocation, enhances the CLI to accept multiple indexes, and introduces comprehensive partition diagnostics including memory utilization tracking, concurrency metrics, and statistical summaries with ASCII histograms. Updates documentation and navigation accordingly.
Introduces allocation-free `sum()` and `count_nonzero()` methods for compact integer vectors, extending the `ColumnWeights` trait with `partial_kmer_counts`. Adds parallel partition scanning to the k-mer index for computing per-genome distinct k-mer counts, and exposes a new `--stats` CLI flag to output these statistics as CSV.
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.