2026-05-23 13:24:25 +02:00
|
|
|
|
# On-disk index layout
|
2026-04-16 22:38:20 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## Directory tree
|
2026-04-16 22:38:20 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
```
|
|
|
|
|
|
<index_root>/
|
|
|
|
|
|
index.meta ← JSON: IndexMeta
|
|
|
|
|
|
scatter.done ← sentinel: scatter phase complete
|
|
|
|
|
|
count.done ← sentinel: dereplicate + count complete
|
|
|
|
|
|
index.done ← sentinel: MPHF index fully built
|
|
|
|
|
|
spectrums/
|
|
|
|
|
|
<label>.json ← kmer frequency spectrum per genome
|
|
|
|
|
|
partitions/
|
|
|
|
|
|
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
|
|
|
|
|
|
index/
|
|
|
|
|
|
meta.json ← PartitionMeta { n_layers }
|
|
|
|
|
|
layer_0/
|
|
|
|
|
|
unitigs.bin ← binary unitig sequences (2-bit packed)
|
|
|
|
|
|
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
|
|
|
|
|
|
mphf.bin ← serialised PtrHash MPHF
|
|
|
|
|
|
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
|
|
|
|
|
|
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
|
|
|
|
|
|
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
|
|
|
|
|
|
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
|
|
|
|
|
|
presence/ ← PersistentBitMatrix (if presence mode, merge)
|
|
|
|
|
|
layer_1/ ← added by merge; same structure as layer_0
|
|
|
|
|
|
layer_2/ …
|
|
|
|
|
|
part_00001/ …
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## State machine (sentinels)
|
|
|
|
|
|
|
|
|
|
|
|
The sentinels are touched atomically at the end of each pipeline stage.
|
|
|
|
|
|
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
|
|
|
|
|
|
detected as the lowest sentinel present.
|
|
|
|
|
|
|
|
|
|
|
|
| State | Sentinel present | Meaning |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
|
|
|
|
|
|
| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
|
|
|
|
|
|
| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
|
|
|
|
|
|
| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |
|
|
|
|
|
|
|
|
|
|
|
|
## index.meta (IndexMeta)
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"version": 1,
|
|
|
|
|
|
"config": {
|
|
|
|
|
|
"kmer_size": 31,
|
|
|
|
|
|
"minimizer_size": 11,
|
|
|
|
|
|
"n_bits": 8,
|
|
|
|
|
|
"with_counts": false,
|
|
|
|
|
|
"evidence": "Exact",
|
|
|
|
|
|
"block_bits": 0
|
|
|
|
|
|
},
|
|
|
|
|
|
"genomes": [
|
|
|
|
|
|
{ "label": "genome_A", "meta": { "species": "Homo sapiens" } }
|
|
|
|
|
|
]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.
|
|
|
|
|
|
|
|
|
|
|
|
`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.
|
|
|
|
|
|
|
|
|
|
|
|
`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
|
|
|
|
|
|
chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).
|
|
|
|
|
|
|
|
|
|
|
|
`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
|
|
|
|
|
|
taxonomy, sample origin). It is optional; defaults to empty.
|
|
|
|
|
|
|
|
|
|
|
|
## Layer files
|
|
|
|
|
|
|
|
|
|
|
|
### unitigs.bin
|
|
|
|
|
|
|
|
|
|
|
|
2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
|
|
|
|
|
|
(nucleotide length − k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
|
|
|
|
|
|
packed sequence. Long unitigs are transparently split into overlapping chunks
|
|
|
|
|
|
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.
|
|
|
|
|
|
|
|
|
|
|
|
### unitigs.bin.idx (Exact only)
|
|
|
|
|
|
|
|
|
|
|
|
Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
|
|
|
|
|
|
`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
|
|
|
|
|
|
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
|
|
|
|
|
|
|
|
|
|
|
|
### mphf.bin
|
|
|
|
|
|
|
|
|
|
|
|
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
|
|
|
|
|
|
2-bit) to a slot index in `[0, n_kmers)`.
|
|
|
|
|
|
|
|
|
|
|
|
### layer_meta.json (LayerMeta)
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{ "evidence": { "type": "exact" } }
|
|
|
|
|
|
```
|
|
|
|
|
|
or
|
|
|
|
|
|
```json
|
|
|
|
|
|
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### evidence.bin (Exact)
|
|
|
|
|
|
|
|
|
|
|
|
One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
|
|
|
|
|
|
that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
|
|
|
|
|
|
is re-read and compared against the query.
|
|
|
|
|
|
|
|
|
|
|
|
### fingerprint.bin (Approx)
|
|
|
|
|
|
|
|
|
|
|
|
`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
|
|
|
|
|
|
False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
|
|
|
|
|
|
`z` consecutive k-mers must all match, reducing the effective FP rate to
|
|
|
|
|
|
approximately `W / 2^(b·z)` per read of length `L`
|
|
|
|
|
|
(where `W = L − k − z + 2`).
|
|
|
|
|
|
|
|
|
|
|
|
### counts/ (PersistentCompactIntMatrix)
|
|
|
|
|
|
|
|
|
|
|
|
Present when `with_counts=true`. One column per genome; each row holds the
|
|
|
|
|
|
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
|
|
|
|
|
|
during indexing and merge.
|
|
|
|
|
|
|
|
|
|
|
|
### presence/ (PersistentBitMatrix)
|
|
|
|
|
|
|
|
|
|
|
|
Present when the layer was built in presence/absence mode (merge path).
|
|
|
|
|
|
One bit per genome per MPHF slot. Written during merge; never present on a
|
|
|
|
|
|
freshly indexed single-genome layer.
|
|
|
|
|
|
|
|
|
|
|
|
## meta.json (PartitionMeta)
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{ "n_layers": 2 }
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Records how many `layer_N/` directories exist under `index/`. Incremented by
|
|
|
|
|
|
each merge that adds a layer.
|