Files
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

137 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# On-disk index layout
## Directory tree
```
<index_root>/
index.meta ← JSON: IndexMeta
scatter.done ← sentinel: scatter phase complete
count.done ← sentinel: dereplicate + count complete
index.done ← sentinel: MPHF index fully built
spectrums/
<label>.json ← kmer frequency spectrum per genome
partitions/
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits1)
index/
meta.json ← PartitionMeta { n_layers }
layer_0/
unitigs.bin ← binary unitig sequences (2-bit packed)
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
mphf.bin ← serialised PtrHash MPHF
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
presence/ ← PersistentBitMatrix (if presence mode, merge)
layer_1/ ← added by merge; same structure as layer_0
layer_2/ …
part_00001/ …
```
## State machine (sentinels)
The sentinels are touched atomically at the end of each pipeline stage.
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
detected as the lowest sentinel present.
| State | Sentinel present | Meaning |
|---|---|---|
| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |
## index.meta (IndexMeta)
```json
{
"version": 1,
"config": {
"kmer_size": 31,
"minimizer_size": 11,
"n_bits": 8,
"with_counts": false,
"evidence": "Exact",
"block_bits": 0
},
"genomes": [
{ "label": "genome_A", "meta": { "species": "Homo sapiens" } }
]
}
```
`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.
`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.
`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).
`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
taxonomy, sample origin). It is optional; defaults to empty.
## Layer files
### unitigs.bin
2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
(nucleotide length k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
packed sequence. Long unitigs are transparently split into overlapping chunks
(k1 nucleotide overlap) so no k-mer crosses a chunk boundary.
### unitigs.bin.idx (Exact only)
Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
### mphf.bin
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
2-bit) to a slot index in `[0, n_kmers)`.
### layer_meta.json (LayerMeta)
```json
{ "evidence": { "type": "exact" } }
```
or
```json
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
```
### evidence.bin (Exact)
One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
is re-read and compared against the query.
### fingerprint.bin (Approx)
`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
`z` consecutive k-mers must all match, reducing the effective FP rate to
approximately `W / 2^(b·z)` per read of length `L`
(where `W = L k z + 2`).
### counts/ (PersistentCompactIntMatrix)
Present when `with_counts=true`. One column per genome; each row holds the
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
during indexing and merge.
### presence/ (PersistentBitMatrix)
Present when the layer was built in presence/absence mode (merge path).
One bit per genome per MPHF slot. Written during merge; never present on a
freshly indexed single-genome layer.
## meta.json (PartitionMeta)
```json
{ "n_layers": 2 }
```
Records how many `layer_N/` directories exist under `index/`. Incremented by
each merge that adds a layer.