On-disk index layout
Directory tree
<index_root>/
index.meta ← JSON: IndexMeta
scatter.done ← sentinel: scatter phase complete
count.done ← sentinel: dereplicate + count complete
index.done ← sentinel: MPHF index fully built
spectrums/
<label>.json ← kmer frequency spectrum per genome
partitions/
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
index/
meta.json ← PartitionMeta { n_layers }
layer_0/
unitigs.bin ← binary unitig sequences (2-bit packed)
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
mphf.bin ← serialised PtrHash MPHF
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
presence/ ← PersistentBitMatrix (if presence mode, merge)
layer_1/ ← added by merge; same structure as layer_0
layer_2/ …
part_00001/ …
State machine (sentinels)
The sentinels are touched atomically at the end of each pipeline stage. A partial run (e.g. scatter interrupted) leaves no sentinel; the state is detected as the lowest sentinel present.
| State | Sentinel present | Meaning |
|---|---|---|
Empty |
— | index.meta exists; scatter not started or interrupted |
Scattered |
scatter.done |
All super-kmers routed to partition files |
Counted |
count.done |
Partitions dereplicated; spectrums/ written |
Indexed |
index.done |
All MPHF layers built; index ready for queries |
index.meta (IndexMeta)
{
"version": 1,
"config": {
"kmer_size": 31,
"minimizer_size": 11,
"n_bits": 8,
"with_counts": false,
"evidence": "Exact",
"block_bits": 0
},
"genomes": [
{ "label": "genome_A", "meta": { "species": "Homo sapiens" } }
]
}
n_bits determines the partition count: 2^n_bits directories under partitions/.
evidence is either the string "Exact" or {"Approx": {"b": 8, "z": 1}}.
block_bits controls the .idx granularity: one offset entry every 2^block_bits
chunks. block_bits=0 stores one entry per chunk (O(1) random access, largest .idx).
GenomeInfo.meta is a free-form string→string map for categorical metadata (e.g.
taxonomy, sample origin). It is optional; defaults to empty.
Layer files
unitigs.bin
2-bit packed binary unitig sequences. Each record: 1 byte seql_minus_k
(nucleotide length − k), followed by ceil((seql_minus_k + k) / 4) bytes of
packed sequence. Long unitigs are transparently split into overlapping chunks
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.
unitigs.bin.idx (Exact only)
Magic UIX3, little-endian header: block_bits (u32), n_unitigs (u32),
n_kmers (u64), then ceil(n_unitigs / 2^block_bits) + 1 byte-offset entries
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
mphf.bin
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
2-bit) to a slot index in [0, n_kmers).
layer_meta.json (LayerMeta)
{ "evidence": { "type": "exact" } }
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
evidence.bin (Exact)
One (chunk_id: u32, rank: u8) record per MPHF slot, packed. Used to verify
that the kmer mapped to a slot is actually present: unitigs.bin[chunk_id][rank]
is re-read and compared against the query.
fingerprint.bin (Approx)
b-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
False-positive rate per query ≈ 1/2^b. With Findere parameter z ≥ 2,
z consecutive k-mers must all match, reducing the effective FP rate to
approximately W / 2^(b·z) per read of length L
(where W = L − k − z + 2).
counts/ (PersistentCompactIntMatrix)
Present when with_counts=true. One column per genome; each row holds the
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
during indexing and merge.
presence/ (PersistentBitMatrix)
Present when the layer was built in presence/absence mode (merge path). One bit per genome per MPHF slot. Written during merge; never present on a freshly indexed single-genome layer.
meta.json (PartitionMeta)
{ "n_layers": 2 }
Records how many layer_N/ directories exist under index/. Incremented by
each merge that adds a layer.