Files
obikmer/docmd/implementation/storage.md
T
2026-04-19 12:17:16 +02:00

3.0 KiB
Raw Blame History

On-disk collection structure

Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:

collection/
  metadata.toml          — collection parameters (see below)
  part_XXXX/
    superkmers.bin.gz    — dereplicated super-kmers for this partition (construction artifact)
    mphf.bin             — minimal perfect hash function for this partition
    counts.bin           — packed n-bit count array (or 1-bit presence array)
    refs.bin             — back-references u32 nucleotide offset into unitigs.bin per kmer
    unitigs.bin          — local de Bruijn unitigs (permanent evidence structure)
    overflow.bin         — counts exceeding the packed range (optional)

superkmers.bin.gz is produced during phase 1 and consumed through phases 24. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is mphf.bin + counts.bin + refs.bin + unitigs.bin.

Collection parameters

Stored in metadata.toml:

Parameter Role
k kmer length
m minimizer length (odd, < k)
p partition bits (0 ≤ p ≤ min(14, 2m16))
mode presence (1 bit/kmer) or count (n bits/kmer)
n bits per kmer in count mode (chosen at construction)
min_count singleton filtering threshold (0 = keep all)
hash_fn hash function identifier
hash_seed seed for the hash function

Count storage

refs.bin capacity: unitigs.bin is a flat 2-bit-packed nucleotide stream with no separators. Each entry in refs.bin is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.

Presence mode (coverage ≤ 1x, or when only presence/absence matters):

  • counts.bin is a packed 1-bit array — all bits set to 1 for indexed kmers
  • Singletons are the signal, not filtered

Count mode (coverage > 1x):

  • counts.bin is a packed n-bit array; n chosen at construction from the observed distribution
  • Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
  • Overflow counts stored in a separate overflow.bin as sorted (index: u32, count: u32) pairs
  • Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
  • min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0

Query protocol

query kmer q
  → canonical_minimizer(q) → hash → PART → part_XXXX/
  → MPHF(q) → index i
  → refs[i] = (unitig_id, kmer_offset)
  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
  → match   : return counts[i]
  → no match: kmer absent