Files
obitools4/autodoc/docmd/pkg/obikmer/kdi_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

35 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# KDI File Format and API
The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
## Core Features
- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
## File Structure
A `.kdi` file contains:
1. **Magic header** (4 bytes): Identifies the format.
2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
3. **First value** (8 bytes, `uint64`): Initial k-mer.
4. **Delta-encoded tail**: `(n1)` deltas, each encoded as a `varint`.
## API
- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
- **`Writer.Count()`**: Returns the number of written items before closing.
- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
- **`Reader.Count()`**: Returns total stored count.
## Tests Validate
1. Basic round-trip with diverse values (including large `uint64`s).
2. Empty and single-k-mer files.
3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.