Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
- Add `hash` field to MmerItem for stable, randomized minimizer ordering
- Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T)
- Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only
- Update push logic: compare hashes instead of canonical values with degeneracy checks
- Introduce `obipipline` crate with multi-threaded data pipeline architecture
- Implement core types: SourceFn, SharedFun (Arc), SinkFN with biased scheduler and crossbeam channels
- Add macros: `make_source!`, `transform!/fallible`/sink!, and high-level DSL macro
- Replace old wrapper/error modules with unified scheduler module (renamed types, improved error variants)
- Update workspace: add `obipipeline` member to Cargo.toml and lockfile
- Document pipeline in docmd/implementation with full architecture, error handling & example
- Refactor sandbox_pipeline.rs to use new DSL instead of manual channel wiring
- Introduce new `obipackage` library with pipeline stages, scheduler and worker pool
- Refactor path expansion in `obiread`: replace old list_of_files with new PathIter iterator
- Add MIME type detection using `infer` crate (fastq/fasta)
- Update dependencies in Cargo.lock: add bumpalo, byteorder, cfb (with deps), fnv,
infer, js-sys/uuid/wasm-bindgen ecosystem
- Fix formatting and improve tests in SuperKmer (canonical, revcomp)
* Note: edition = "2024" in obipipeline/Cargo.toml is invalid; should be 2021
- Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer).
- Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool.
- Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder.
- Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking.
- Minor formatting cleanup in superkmer command and chunk processing.