Files
obikmer/docmd/implementation/storage.md
T
2026-04-19 12:17:16 +02:00

62 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# On-disk collection structure
Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
```
collection/
metadata.toml — collection parameters (see below)
part_XXXX/
superkmers.bin.gz — dereplicated super-kmers for this partition (construction artifact)
mphf.bin — minimal perfect hash function for this partition
counts.bin — packed n-bit count array (or 1-bit presence array)
refs.bin — back-references u32 nucleotide offset into unitigs.bin per kmer
unitigs.bin — local de Bruijn unitigs (permanent evidence structure)
overflow.bin — counts exceeding the packed range (optional)
```
`superkmers.bin.gz` is produced during phase 1 and consumed through phases 24. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is `mphf.bin + counts.bin + refs.bin + unitigs.bin`.
## Collection parameters
Stored in `metadata.toml`:
| Parameter | Role |
|-----------|------|
| k | kmer length |
| m | minimizer length (odd, < k) |
| p | partition bits (0 ≤ p ≤ min(14, 2m16)) |
| mode | `presence` (1 bit/kmer) or `count` (n bits/kmer) |
| n | bits per kmer in count mode (chosen at construction) |
| min_count | singleton filtering threshold (0 = keep all) |
| hash_fn | hash function identifier |
| hash_seed | seed for the hash function |
## Count storage
**refs.bin capacity:** `unitigs.bin` is a flat 2-bit-packed nucleotide stream with no separators. Each entry in `refs.bin` is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.
*Presence mode* (coverage ≤ 1x, or when only presence/absence matters):
- `counts.bin` is a packed 1-bit array — all bits set to 1 for indexed kmers
- Singletons are the signal, not filtered
*Count mode* (coverage > 1x):
- `counts.bin` is a packed n-bit array; n chosen at construction from the observed distribution
- Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
- Overflow counts stored in a separate `overflow.bin` as sorted `(index: u32, count: u32)` pairs
- Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
- min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0
## Query protocol
```
query kmer q
→ canonical_minimizer(q) → hash → PART → part_XXXX/
→ MPHF(q) → index i
→ refs[i] = (unitig_id, kmer_offset)
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
→ match : return counts[i]
→ no match: kmer absent
```