first implementation but far to be optimal
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
# On-disk collection structure
|
||||
|
||||
Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
|
||||
|
||||
```
|
||||
collection/
|
||||
metadata.toml — collection parameters (see below)
|
||||
part_XXXX/
|
||||
superkmers.bin.gz — dereplicated super-kmers for this partition (construction artifact)
|
||||
mphf.bin — minimal perfect hash function for this partition
|
||||
counts.bin — packed n-bit count array (or 1-bit presence array)
|
||||
refs.bin — back-references u32 nucleotide offset into unitigs.bin per kmer
|
||||
unitigs.bin — local de Bruijn unitigs (permanent evidence structure)
|
||||
overflow.bin — counts exceeding the packed range (optional)
|
||||
```
|
||||
|
||||
`superkmers.bin.gz` is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is `mphf.bin + counts.bin + refs.bin + unitigs.bin`.
|
||||
|
||||
## Collection parameters
|
||||
|
||||
Stored in `metadata.toml`:
|
||||
|
||||
| Parameter | Role |
|
||||
|-----------|------|
|
||||
| k | kmer length |
|
||||
| m | minimizer length (odd, < k) |
|
||||
| p | partition bits (0 ≤ p ≤ min(14, 2m−16)) |
|
||||
| mode | `presence` (1 bit/kmer) or `count` (n bits/kmer) |
|
||||
| n | bits per kmer in count mode (chosen at construction) |
|
||||
| min_count | singleton filtering threshold (0 = keep all) |
|
||||
| hash_fn | hash function identifier |
|
||||
| hash_seed | seed for the hash function |
|
||||
|
||||
## Count storage
|
||||
|
||||
**refs.bin capacity:** `unitigs.bin` is a flat 2-bit-packed nucleotide stream with no separators. Each entry in `refs.bin` is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.
|
||||
|
||||
*Presence mode* (coverage ≤ 1x, or when only presence/absence matters):
|
||||
|
||||
- `counts.bin` is a packed 1-bit array — all bits set to 1 for indexed kmers
|
||||
- Singletons are the signal, not filtered
|
||||
|
||||
*Count mode* (coverage > 1x):
|
||||
|
||||
- `counts.bin` is a packed n-bit array; n chosen at construction from the observed distribution
|
||||
- Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
|
||||
- Overflow counts stored in a separate `overflow.bin` as sorted `(index: u32, count: u32)` pairs
|
||||
- Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
|
||||
- min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0
|
||||
|
||||
## Query protocol
|
||||
|
||||
```
|
||||
query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXX/
|
||||
→ MPHF(q) → index i
|
||||
→ refs[i] = (unitig_id, kmer_offset)
|
||||
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
|
||||
→ match : return counts[i]
|
||||
→ no match: kmer absent
|
||||
```
|
||||
Reference in New Issue
Block a user