obikmer

Author	SHA1	Message	Date
Eric Coissac	1881e98bad	feat(bitvec): add partial Jaccard, fix padding, optimize constructor Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.	2026-05-15 15:04:10 +08:00
Eric Coissac	b218bf012b	feat: introduce column-major matrix storage and migrate layered map Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.	2026-05-14 21:19:18 +08:00
Eric Coissac	f48f7500cd	refactor(obilayeredmap): support generic payload types Replace the hardcoded `Counts` module with a generic `LayerData` trait, parameterizing `Layer` and `LayeredMap` over arbitrary payload types. This decouples read-path access from build-path logic, enabling both set membership and count-based indexing via `PersistentCompactIntVec`. Adds the `obicompactvec` dependency, implements streaming layer construction, and expands test coverage for persistence and multi-layer resolution.	2026-05-14 09:33:18 +08:00
Eric Coissac	0b3fcf3cf0	feat: add PersistentBitVec and upgrade PersistentCompactIntVec format Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.	2026-05-14 09:01:36 +08:00
Eric Coissac	f2de79acde	Add persistent compact integer vector and cache-line-optimized MPHF Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.	2026-05-13 10:09:46 +08:00
Eric Coissac	5169f65dc9	feat: implement persistent layered index and chunked binary format Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.	2026-05-09 17:38:29 +08:00
Eric Coissac	defeeb9460	feat: enforce canonical k-mer representation throughout the codebase Refactor core types to consistently use `CanonicalKMer` (lexicographically minimal of k-mer and its reverse complement) as the canonical representation, ensuring deterministic behavior in graph traversal (unitig decomposition), neighbor resolution (`unique_neighbor` with `[CanonicalKmer; 4]` input) and scatter output generation. Introduce `RoutableSuperKmer`, add `.seq_hash()` support, fix type syntax errors in unitig extraction methods and deduplication tests. Update all k-mer construction to use canonical-aware APIs, including unsafe unchecked constructors for performance-critical paths.	2026-05-02 16:31:08 +02:00
Eric Coissac	27f5e88a7b	refactor: implement RoutableSuperKmer and update k-mer indexing pipeline Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.	2026-05-01 09:33:26 +02:00
Eric Coissac	ebbfe35cbc	Refactor: Extract utility function for string reversal Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.	2026-04-30 06:58:46 +02:00
Eric Coissac	58391886a3	🔧 Replace degenerate minimizer logic with hash-based random ordering - Add `hash` field to MmerItem for stable, randomized minimizer ordering - Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T) - Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only - Update push logic: compare hashes instead of canonical values with degeneracy checks	2026-04-27 20:19:43 +02:00
Eric Coissac	22951fb0e8	🔖 Add obipipeline parallel pipeline library - Introduce `obipipline` crate with multi-threaded data pipeline architecture - Implement core types: SourceFn, SharedFun (Arc), SinkFN with biased scheduler and crossbeam channels - Add macros: `make_source!`, `transform!/fallible`/sink!, and high-level DSL macro - Replace old wrapper/error modules with unified scheduler module (renamed types, improved error variants) - Update workspace: add `obipipeline` member to Cargo.toml and lockfile - Document pipeline in docmd/implementation with full architecture, error handling & example - Refactor sandbox_pipeline.rs to use new DSL instead of manual channel wiring	2026-04-24 17:10:07 +02:00
Eric Coissac	664d0216b5	📦 Add obipipeline crate and refactor path handling - Introduce new `obipackage` library with pipeline stages, scheduler and worker pool - Refactor path expansion in `obiread`: replace old list_of_files with new PathIter iterator - Add MIME type detection using `infer` crate (fastq/fasta) - Update dependencies in Cargo.lock: add bumpalo, byteorder, cfb (with deps), fnv, infer, js-sys/uuid/wasm-bindgen ecosystem - Fix formatting and improve tests in SuperKmer (canonical, revcomp) * Note: edition = "2024" in obipipeline/Cargo.toml is invalid; should be 2021	2026-04-23 21:06:11 +02:00
Eric Coissac	380b5a6f94	📖 Update super-kmer theory and implementation to prefer non-degenerate m-mers - Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer). - Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool. - Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder. - Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking. - Minor formatting cleanup in superkmer command and chunk processing.	2026-04-20 17:50:09 +02:00
Eric Coissac	de3f9b16cf	first implementation but far to be optimal	2026-04-19 12:17:16 +02:00

14 Commits