Files
obitools4/autodoc/docmd/pkg/obikmer/kdi_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.6 KiB
Raw Blame History

KDI File Format and API

The obikmer package implements a compact, sorted k-mer storage format (.kdi) with delta compression for efficient disk persistence and retrieval.

Core Features

  • Sorted k-mer serialization: K-mers (as uint64) are written in ascending order.
  • Delta encoding: Consecutive differences (deltas) between k-mers are stored using variable-length integers (varint), drastically reducing size for dense sequences.
  • Round-trip integrity: Full write/read cycles preserve exact k-mer values and counts.

File Structure

A .kdi file contains:

  1. Magic header (4 bytes): Identifies the format.
  2. Count field (8 bytes, uint64): Number of stored k-mers.
  3. First value (8 bytes, uint64): Initial k-mer.
  4. Delta-encoded tail: (n1) deltas, each encoded as a varint.

API

  • NewKdiWriter(path string): Creates a writer; Write(v uint64) appends k-mers.
  • Writer.Count(): Returns the number of written items before closing.
  • NewKdiReader(path string): Opens a reader; Next() (uint64, bool) yields k-mers in order.
  • Reader.Count(): Returns total stored count.

Tests Validate

  1. Basic round-trip with diverse values (including large uint64s).
  2. Empty and single-k-mer files.
  3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
  4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
  5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.

The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.