Files
obitools4/autodoc/docmd/pkg/obikmer/kdi_writer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.5 KiB

KDI File Format and Writer

The obikmer package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.

Core Format (.kdi)

  • Magic header: KDI\x01 (4 bytes) identifies the file type.
  • Count field: uint64 LE, total number of k-mers (patched at close).
  • First value: uint64 LE, the initial k-mer stored as an absolute integer.
  • Deltas: Subsequent values encoded via delta-varint (difference from previous k-mer), enabling high compression for sorted sequences.

Writer (KdiWriter)

  • Strict ordering: K-mers must be written in strictly increasing order.
  • Efficient buffering via bufio.Writer (64 KB buffer).
  • Internally tracks:
    • Current k-mer count,
    • Previous value (for delta computation),
    • Bytes written in data section.
  • Sparse indexing: Every defaultKdxStride k-mers, an entry is recorded in memory for random access.

Companion Index (.kdx)

  • Written automatically on Close() if indexing entries exist.
  • Stores (kmer, file_offset) pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
  • Enables efficient random access without full file scan.

Usage Pattern

w, _ := obikmer.NewKdiWriter("data.kdi")
for _, kmer := range sortedKMers {
    w.Write(kmer)
}
w.Close()  // finalizes header, writes .kdx index

The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.