obikmer/docmd/implementation/storage.md

# On-disk index layout

## Directory tree

```
<index_root>/
  index.meta                  ← JSON: IndexMeta
  scatter.done                ← sentinel: scatter phase complete
  count.done                  ← sentinel: dereplicate + count complete
  index.done                  ← sentinel: MPHF index fully built
  spectrums/
    <label>.json              ← kmer frequency spectrum per genome
  partitions/
    part_00000/               ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
      index/
        meta.json             ← PartitionMeta { n_layers }
        layer_0/
          unitigs.bin         ← binary unitig sequences (2-bit packed)
          unitigs.bin.idx     ← block-sampled offset index (exact evidence only)
          mphf.bin            ← serialised PtrHash MPHF
          layer_meta.json     ← LayerMeta { evidence: EvidenceKind }
          evidence.bin        ← chunk_id:rank per MPHF slot (Exact only)
          fingerprint.bin     ← b-bit fingerprints per MPHF slot (Approx only)
          counts/             ← PersistentCompactIntMatrix (if with_counts=true)
          presence/           ← PersistentBitMatrix (if presence mode, merge)
        layer_1/              ← added by merge; same structure as layer_0
        layer_2/ …
    part_00001/ …
```

## State machine (sentinels)

The sentinels are touched atomically at the end of each pipeline stage.
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
detected as the lowest sentinel present.

| State | Sentinel present | Meaning |
|---|---|---|
| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |

## index.meta (IndexMeta)

```json
{
  "version": 1,
  "config": {
    "kmer_size": 31,
    "minimizer_size": 11,
    "n_bits": 8,
    "with_counts": false,
    "evidence": "Exact",
    "block_bits": 0
  },
  "genomes": [
    { "label": "genome_A", "meta": { "species": "Homo sapiens" } }
  ]
}
```

`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.

`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.

`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).

`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
taxonomy, sample origin). It is optional; defaults to empty.

## Layer files

### unitigs.bin

2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
(nucleotide length − k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
packed sequence. Long unitigs are transparently split into overlapping chunks
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.

### unitigs.bin.idx (Exact only)

Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.

### mphf.bin

PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
2-bit) to a slot index in `[0, n_kmers)`.

### layer_meta.json (LayerMeta)

```json
{ "evidence": { "type": "exact" } }
```
or
```json
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
```

### evidence.bin (Exact)

One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
is re-read and compared against the query.

### fingerprint.bin (Approx)

`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
`z` consecutive k-mers must all match, reducing the effective FP rate to
approximately `W / 2^(b·z)` per read of length `L`
(where `W = L − k − z + 2`).

### counts/ (PersistentCompactIntMatrix)

Present when `with_counts=true`. One column per genome; each row holds the
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
during indexing and merge.

### presence/ (PersistentBitMatrix)

Present when the layer was built in presence/absence mode (merge path).
One bit per genome per MPHF slot. Written during merge; never present on a
freshly indexed single-genome layer.

## meta.json (PartitionMeta)

```json
{ "n_layers": 2 }
```

Records how many `layer_N/` directories exist under `index/`. Incremented by
each merge that adds a layer.