docs: update architecture and storage specs for approximate index

Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:24:25 +02:00
parent b7db3a33ed
commit da56c3e290
5 changed files with 814 additions and 882 deletions
@@ -1,5 +1,136 @@
-# On-disk collection structure
+# On-disk index layout

-See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout.
+## Directory tree

-The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory.
+```
+<index_root>/
+  index.meta                  ← JSON: IndexMeta
+  scatter.done                ← sentinel: scatter phase complete
+  count.done                  ← sentinel: dereplicate + count complete
+  index.done                  ← sentinel: MPHF index fully built
+  spectrums/
+    <label>.json              ← kmer frequency spectrum per genome
+  partitions/
+    part_00000/               ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
+      index/
+        meta.json             ← PartitionMeta { n_layers }
+        layer_0/
+          unitigs.bin         ← binary unitig sequences (2-bit packed)
+          unitigs.bin.idx     ← block-sampled offset index (exact evidence only)
+          mphf.bin            ← serialised PtrHash MPHF
+          layer_meta.json     ← LayerMeta { evidence: EvidenceKind }
+          evidence.bin        ← chunk_id:rank per MPHF slot (Exact only)
+          fingerprint.bin     ← b-bit fingerprints per MPHF slot (Approx only)
+          counts/             ← PersistentCompactIntMatrix (if with_counts=true)
+          presence/           ← PersistentBitMatrix (if presence mode, merge)
+        layer_1/              ← added by merge; same structure as layer_0
+        layer_2/ …
+    part_00001/ …
+```
+
+## State machine (sentinels)
+
+The sentinels are touched atomically at the end of each pipeline stage.
+A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
+detected as the lowest sentinel present.
+
+| State | Sentinel present | Meaning |
+|---|---|---|
+| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
+| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
+| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
+| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |
+
+## index.meta (IndexMeta)
+
+```json
+{
+  "version": 1,
+  "config": {
+    "kmer_size": 31,
+    "minimizer_size": 11,
+    "n_bits": 8,
+    "with_counts": false,
+    "evidence": "Exact",
+    "block_bits": 0
+  },
+  "genomes": [
+    { "label": "genome_A", "meta": { "species": "Homo sapiens" } }
+  ]
+}
+```
+
+`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.
+
+`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.
+
+`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
+chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).
+
+`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
+taxonomy, sample origin). It is optional; defaults to empty.
+
+## Layer files
+
+### unitigs.bin
+
+2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
+(nucleotide length − k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
+packed sequence. Long unitigs are transparently split into overlapping chunks
+(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.
+
+### unitigs.bin.idx (Exact only)
+
+Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
+`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
+(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
+
+### mphf.bin
+
+PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
+2-bit) to a slot index in `[0, n_kmers)`.
+
+### layer_meta.json (LayerMeta)
+
+```json
+{ "evidence": { "type": "exact" } }
+```
+or
+```json
+{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
+```
+
+### evidence.bin (Exact)
+
+One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
+that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
+is re-read and compared against the query.
+
+### fingerprint.bin (Approx)
+
+`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
+False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
+`z` consecutive k-mers must all match, reducing the effective FP rate to
+approximately `W / 2^(b·z)` per read of length `L`
+(where `W = L − k − z + 2`).
+
+### counts/ (PersistentCompactIntMatrix)
+
+Present when `with_counts=true`. One column per genome; each row holds the
+per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
+during indexing and merge.
+
+### presence/ (PersistentBitMatrix)
+
+Present when the layer was built in presence/absence mode (merge path).
+One bit per genome per MPHF slot. Written during merge; never present on a
+freshly indexed single-genome layer.
+
+## meta.json (PartitionMeta)
+
+```json
+{ "n_layers": 2 }
+```
+
+Records how many `layer_N/` directories exist under `index/`. Incremented by
+each merge that adds a layer.