mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.6 KiB
1.6 KiB
KDI File Format and API
The obikmer package implements a compact, sorted k-mer storage format (.kdi) with delta compression for efficient disk persistence and retrieval.
Core Features
- Sorted k-mer serialization: K-mers (as
uint64) are written in ascending order. - Delta encoding: Consecutive differences (deltas) between k-mers are stored using variable-length integers (
varint), drastically reducing size for dense sequences. - Round-trip integrity: Full write/read cycles preserve exact k-mer values and counts.
File Structure
A .kdi file contains:
- Magic header (4 bytes): Identifies the format.
- Count field (8 bytes,
uint64): Number of stored k-mers. - First value (8 bytes,
uint64): Initial k-mer. - Delta-encoded tail:
(n−1)deltas, each encoded as avarint.
API
NewKdiWriter(path string): Creates a writer;Write(v uint64)appends k-mers.Writer.Count(): Returns the number of written items before closing.NewKdiReader(path string): Opens a reader;Next() (uint64, bool)yields k-mers in order.Reader.Count(): Returns total stored count.
Tests Validate
- Basic round-trip with diverse values (including large
uint64s). - Empty and single-k-mer files.
- Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
- Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
- Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.