Files
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

4.6 KiB
Raw Permalink Blame History

On-disk index layout

Directory tree

<index_root>/
  index.meta                  ← JSON: IndexMeta
  scatter.done                ← sentinel: scatter phase complete
  count.done                  ← sentinel: dereplicate + count complete
  index.done                  ← sentinel: MPHF index fully built
  spectrums/
    <label>.json              ← kmer frequency spectrum per genome
  partitions/
    part_00000/               ← one dir per partition (zero-padded 5 digits, 0..2^n_bits1)
      index/
        meta.json             ← PartitionMeta { n_layers }
        layer_0/
          unitigs.bin         ← binary unitig sequences (2-bit packed)
          unitigs.bin.idx     ← block-sampled offset index (exact evidence only)
          mphf.bin            ← serialised PtrHash MPHF
          layer_meta.json     ← LayerMeta { evidence: EvidenceKind }
          evidence.bin        ← chunk_id:rank per MPHF slot (Exact only)
          fingerprint.bin     ← b-bit fingerprints per MPHF slot (Approx only)
          counts/             ← PersistentCompactIntMatrix (if with_counts=true)
          presence/           ← PersistentBitMatrix (if presence mode, merge)
        layer_1/              ← added by merge; same structure as layer_0
        layer_2/ …
    part_00001/ …

State machine (sentinels)

The sentinels are touched atomically at the end of each pipeline stage. A partial run (e.g. scatter interrupted) leaves no sentinel; the state is detected as the lowest sentinel present.

State Sentinel present Meaning
Empty index.meta exists; scatter not started or interrupted
Scattered scatter.done All super-kmers routed to partition files
Counted count.done Partitions dereplicated; spectrums/ written
Indexed index.done All MPHF layers built; index ready for queries

index.meta (IndexMeta)

{
  "version": 1,
  "config": {
    "kmer_size": 31,
    "minimizer_size": 11,
    "n_bits": 8,
    "with_counts": false,
    "evidence": "Exact",
    "block_bits": 0
  },
  "genomes": [
    { "label": "genome_A", "meta": { "species": "Homo sapiens" } }
  ]
}

n_bits determines the partition count: 2^n_bits directories under partitions/.

evidence is either the string "Exact" or {"Approx": {"b": 8, "z": 1}}.

block_bits controls the .idx granularity: one offset entry every 2^block_bits chunks. block_bits=0 stores one entry per chunk (O(1) random access, largest .idx).

GenomeInfo.meta is a free-form string→string map for categorical metadata (e.g. taxonomy, sample origin). It is optional; defaults to empty.

Layer files

unitigs.bin

2-bit packed binary unitig sequences. Each record: 1 byte seql_minus_k (nucleotide length k), followed by ceil((seql_minus_k + k) / 4) bytes of packed sequence. Long unitigs are transparently split into overlapping chunks (k1 nucleotide overlap) so no k-mer crosses a chunk boundary.

unitigs.bin.idx (Exact only)

Magic UIX3, little-endian header: block_bits (u32), n_unitigs (u32), n_kmers (u64), then ceil(n_unitigs / 2^block_bits) + 1 byte-offset entries (u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.

mphf.bin

PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned 2-bit) to a slot index in [0, n_kmers).

layer_meta.json (LayerMeta)

{ "evidence": { "type": "exact" } }

or

{ "evidence": { "type": "approx", "b": 8, "z": 1 } }

evidence.bin (Exact)

One (chunk_id: u32, rank: u8) record per MPHF slot, packed. Used to verify that the kmer mapped to a slot is actually present: unitigs.bin[chunk_id][rank] is re-read and compared against the query.

fingerprint.bin (Approx)

b-bit fingerprint per MPHF slot derived from the kmer's sequence hash. False-positive rate per query ≈ 1/2^b. With Findere parameter z ≥ 2, z consecutive k-mers must all match, reducing the effective FP rate to approximately W / 2^(b·z) per read of length L (where W = L k z + 2).

counts/ (PersistentCompactIntMatrix)

Present when with_counts=true. One column per genome; each row holds the per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column during indexing and merge.

presence/ (PersistentBitMatrix)

Present when the layer was built in presence/absence mode (merge path). One bit per genome per MPHF slot. Written during merge; never present on a freshly indexed single-genome layer.

meta.json (PartitionMeta)

{ "n_layers": 2 }

Records how many layer_N/ directories exist under index/. Incremented by each merge that adds a layer.