docmd/implementation/obilayeredmap.md

# obilayeredmap — layered kmer index crate

## Purpose

`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

---

## Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.

| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |

Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.

---

## Evidence kinds

Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:

```rust
pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}
```

`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.

- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.

---

## MphfLayer — autonomous kmer → slot mapping

`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.

```rust
pub struct MphfLayer {
    mphf: Mphf,
    ev:   LayerEvidence,   // loaded at open() time
    n:    usize,
}
```

`LayerEvidence` is an internal enum, not public:

```rust
enum LayerEvidence {
    Exact  { evidence: Evidence, unitigs: UnitigFileReader },
    Approx { fingerprint: FingerprintVec },
}
```

### Query API

Three public query methods, all returning `Option<usize>` (slot index):

```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
```

- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.

`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.

### Build surface

```rust
// Full MPHF + exact evidence build (two-pass, parallel)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>

// Evidence-only builds (MPHF already present in dir)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z)      -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize>  // dispatch
```

`MphfLayer::build` runs two sequential passes over `unitigs.bin`:

1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.

`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.

For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.

---

## Layer\<D: LayerData\> — MPHF + payload

`Layer<D>` pairs an `MphfLayer` with one payload store.

```rust
pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}
```

`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.

| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |

### Build signatures

```rust
// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, block_bits: u8,
                 count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, block_bits: u8,
                          counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(out_dir: &Path, block_bits: u8,
                          n_genomes: usize,
                          present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}
```

All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.

### Evidence build helpers on Layer

```rust
impl<D: LayerData> Layer<D> {
    pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
    pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8)  -> OLMResult<usize>
    pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
}
```

These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.

---

## FingerprintVec and FingerprintVecWriter

Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.

```
fingerprint.bin format:
  magic:   b"FPVF"  (4 bytes)
  b:       u8       (bits per fingerprint, 1..=64)
  padding: [0u8; 3]
  n:       u64 LE   (number of slots)
  data:    packed bits, ceil(n*b/8) bytes, Lsb0 order
```

```rust
impl FingerprintVec {
    pub fn open(path: &Path) -> OLMResult<Self>
    pub fn get(&self, slot: usize) -> u64
    pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
    pub fn n(&self) -> usize
    pub fn b(&self) -> u8
}
```

`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.

---

## LayeredMap\<D\> — collection of layers

`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.

```rust
pub struct LayeredMap<D: LayerData = ()> {
    root:   PathBuf,
    meta:   PartitionMeta,
    layers: Vec<Layer<D>>,
}
```

`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.

### Common methods

```rust
pub fn open(root: &Path)   -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self>
pub fn n_layers(&self)     -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
```

`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.

### push_layer

`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:

```rust
// mode 1
impl LayeredMap<()> {
    pub fn push_layer(&mut self) -> OLMResult<usize>
}

// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
    pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
```

Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.

---

## LayeredStore\<S\> and aggregation traits

`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:

```rust
pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // element-wise Σ partials
```

Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.

**Leaf implementors** (in `obicompactvec`):

| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |

See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.

---

## On-disk structure

```
partition_root/                    ← LayeredMap (one partition)
  meta.json                        — {"n_layers": N}
  layer_0/                         ← Layer
    layer_meta.json                — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
    mphf.bin                       — ptr_hash MPHF (epserde format)
    unitigs.bin                    — packed 2-bit nucleotide sequences
    unitigs.bin.idx                — UIDX index (exact evidence only)
    evidence.bin                   — [u32; n], LE  (exact evidence only)
    fingerprint.bin                — packed b-bit array  (approx evidence only)
    counts/                        [mode 2] PersistentCompactIntMatrix
      meta.json
      col_000000.pciv
    presence/                      [mode 3] PersistentBitMatrix
      meta.json
      col_000000.pbiv  …
  layer_1/
    …
```

`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.

---

## Evidence encoding (exact)

`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:

```
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)
```

`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.

---

## ptr_hash configuration

```rust
type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;
```

`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.

---

## Column append and merge support

These methods extend existing layers with new genome columns without touching the MPHF.

### Layer-level genome column append

```rust
impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}

impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
```

Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).

### Presence matrix initialisation

```rust
impl Layer<()> {
    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
```

Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.

### Why the MPHF is never rebuilt

The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.

---

## Add-layer algorithm

When adding dataset B to an existing index:

1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index`.
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.

Each partition's new layer is built independently; the operation is fully parallel across partitions.

---

## Dependencies

| crate | role |
|---|---|
| `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation |
| `memmap2 0.9` | mmap of evidence and fingerprint files |
| `bitvec` | packed b-bit fingerprint storage |
| `obiskio` | unitig file writer/reader + `.idx` build |
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								# obilayeredmap — layered kmer index crate
 								## Purpose
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								---
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## Three usage modes
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											feat: introduce column-major matrix storage and migrate layered map
										
										
											2026-05-14 09:31:11 +08:00
+								| Mode | Description | Payload type | Storage |
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								|---|---|---|---|
 								| 1. Set | membership test only | `()` | — |
-											feat: introduce column-major matrix storage and migrate layered map
										
										
											2026-05-14 09:31:11 +08:00
+								| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								---
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								## Evidence kinds
 								Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
 								```rust
 								pub enum EvidenceKind {
 								    Exact,
 								    Approx { b: u8, z: u8 },
 								}
 								```
 								`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
 								- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
 								- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.
 								---
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## MphfLayer — autonomous kmer → slot mapping
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
 								```rust
 								pub struct MphfLayer {
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								    mphf: Mphf,
 								    ev:   LayerEvidence,   // loaded at open() time
 								    n:    usize,
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
 								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`LayerEvidence` is an internal enum, not public:
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								```rust
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								enum LayerEvidence {
 								    Exact  { evidence: Evidence, unitigs: UnitigFileReader },
 								    Approx { fingerprint: FingerprintVec },
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
 								```
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								### Query API
 								Three public query methods, all returning `Option<usize>` (slot index):
 								```rust
 								pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
 								pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
 								pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
 								```
 								- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
 								- `find_exact` panics if the layer holds approximate evidence; zero false positives.
 								- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								### Build surface
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								```rust
 								// Full MPHF + exact evidence build (two-pass, parallel)
 								pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
 								// Evidence-only builds (MPHF already present in dir)
 								pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
 								pub fn build_approx_evidence(dir, b, z)      -> OLMResult<usize>
 								pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize>  // dispatch
 								```
 								`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
 . **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
 . **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
 								`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
 								For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
 								---
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## Layer\<D: LayerData\> — MPHF + payload
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								`Layer<D>` pairs an `MphfLayer` with one payload store.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
 								```rust
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								pub trait LayerData: Sized {
 								    type Item;
 								    fn open(layer_dir: &Path) -> OLMResult<Self>;
 								    fn read(&self, slot: usize) -> Self::Item;
 								}
 								pub struct Layer<D: LayerData = ()> {
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								    mphf: MphfLayer,
 								    data: D,
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
+								}
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
 								pub struct Hit<T = ()> {
 								    pub slot: usize,
 								    pub data: T,
 								}
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
+								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								| Type | `Item` | Description |
 								|---|---|---|
 								| `()` | `()` | mode 1 — membership only |
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
 								| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								### Build signatures
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
 								```rust
 								// mode 1
 								impl Layer<()> {
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								    pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
 								// mode 2
 								impl Layer<PersistentCompactIntMatrix> {
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								    pub fn build(out_dir: &Path, block_bits: u8,
 								                 count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
 								    pub fn build_from_map(out_dir: &Path, block_bits: u8,
 								                          counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
 								// mode 3
 								impl Layer<PersistentBitMatrix> {
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								    pub fn build_presence(out_dir: &Path, block_bits: u8,
 								                          n_genomes: usize,
 								                          present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
 								}
 								```
 								All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
 								### Evidence build helpers on Layer
 								```rust
 								impl<D: LayerData> Layer<D> {
 								    pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
 								    pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8)  -> OLMResult<usize>
 								    pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
 								}
 								```
 								These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
 								---
 								## FingerprintVec and FingerprintVecWriter
 								Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
 								```
 								fingerprint.bin format:
 								  magic:   b"FPVF"  (4 bytes)
 								  b:       u8       (bits per fingerprint, 1..=64)
 								  padding: [0u8; 3]
 								  n:       u64 LE   (number of slots)
 								  data:    packed bits, ceil(n*b/8) bytes, Lsb0 order
 								```
 								```rust
 								impl FingerprintVec {
 								    pub fn open(path: &Path) -> OLMResult<Self>
 								    pub fn get(&self, slot: usize) -> u64
 								    pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
 								    pub fn n(&self) -> usize
 								    pub fn b(&self) -> u8
 								}
 								```
 								`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
 								---
 								## LayeredMap\<D\> — collection of layers
 								`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
 								```rust
 								pub struct LayeredMap<D: LayerData = ()> {
 								    root:   PathBuf,
 								    meta:   PartitionMeta,
 								    layers: Vec<Layer<D>>,
 								}
 								```
 								`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
 								### Common methods
 								```rust
 								pub fn open(root: &Path)   -> OLMResult<Self>
 								pub fn create(root: &Path) -> OLMResult<Self>
 								pub fn n_layers(&self)     -> usize
 								pub fn layer(&self, i: usize) -> &Layer<D>
 								pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
 								pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
 								```
 								`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
 								### push_layer
 								`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
 								```rust
 								// mode 1
 								impl LayeredMap<()> {
 								    pub fn push_layer(&mut self) -> OLMResult<usize>
 								}
 								// mode 2
 								impl LayeredMap<PersistentCompactIntMatrix> {
 								    pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
 								    pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
 								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
 								---
 								## LayeredStore\<S\> and aggregation traits
 								`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
 								```rust
 								pub struct LayeredStore<S>(pub Vec<S>);
 								impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ col_weights across inner stores
 								impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // element-wise Σ partials
 								impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // element-wise Σ partials
 								```
 								Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
 								**Leaf implementors** (in `obicompactvec`):
 								| Type | Traits |
 								|---|---|
 								| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
 								| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
 								See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
 								---
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## On-disk structure
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								partition_root/                    ← LayeredMap (one partition)
 								  meta.json                        — {"n_layers": N}
 								  layer_0/                         ← Layer
 								    layer_meta.json                — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
 								    mphf.bin                       — ptr_hash MPHF (epserde format)
 								    unitigs.bin                    — packed 2-bit nucleotide sequences
 								    unitigs.bin.idx                — UIDX index (exact evidence only)
 								    evidence.bin                   — [u32; n], LE  (exact evidence only)
 								    fingerprint.bin                — packed b-bit array  (approx evidence only)
 								    counts/                        [mode 2] PersistentCompactIntMatrix
 								      meta.json
 								      col_000000.pciv
 								    presence/                      [mode 3] PersistentBitMatrix
 								      meta.json
 								      col_000000.pbiv  …
 								  layer_1/
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								    …
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								---
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								## Evidence encoding (exact)
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								```
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
 								bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
+								---
 								## ptr_hash configuration
 								```rust
 								type Mphf = PtrHash<
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								    u64,                              // key type: canonical kmer raw encoding
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								    CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								    Xx64,                             // hasher: XXH3-64 with seed
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								    Vec<u8>,                          // pilots
-											Add persistent compact integer vector and cache-line-optimized MPHF
										
										
											2026-05-13 06:24:43 +08:00
+								>;
 								```
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								---
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								## Column append and merge support
 								These methods extend existing layers with new genome columns without touching the MPHF.
 								### Layer-level genome column append
 								```rust
 								impl Layer<PersistentBitMatrix> {
 								    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
 								}
 								impl Layer<PersistentCompactIntMatrix> {
 								    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
 								}
 								```
 								Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
 								### Presence matrix initialisation
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											refactor(obilayeredmap): support generic payload types
										
										
											2026-05-14 09:24:25 +08:00
+								```rust
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								impl Layer<()> {
 								    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								}
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								```
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
 								### Why the MPHF is never rebuilt
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
-											feat: introduce column-major matrix storage and migrate layered map
										
										
											2026-05-14 09:31:11 +08:00
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								---
 								## Add-layer algorithm
 								When adding dataset B to an existing index:
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+. For each partition, probe existing layers for kmers of B routed to that partition.
 . Collect kmers absent from all layers → `B \ index`.
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
 . Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
 . Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								Each partition's new layer is built independently; the operation is fully parallel across partitions.
 								---
 								## Dependencies
 								| crate | role |
 								|---|---|
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								| `ptr_hash 1.1` | MPHF per layer |
 								| `cacheline-ef 1.1` | compact remap inside ptr_hash |
 								| `epserde 0.8` | zero-copy MPHF serialisation |
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								| `memmap2 0.9` | mmap of evidence and fingerprint files |
 								| `bitvec` | packed b-bit fingerprint storage |
 								| `obiskio` | unitig file writer/reader + `.idx` build |
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								| `obicompactvec` | payload types + aggregation traits |
 								| `rayon 1` | parallel MPHF construction pass |
-											docs: update architecture and storage specs for approximate index
										
										
											2026-05-23 13:24:25 +02:00
+								| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |