obikmer/docmd/implementation/obilayeredmap.md

# obilayeredmap — layered kmer index crate

## Purpose

`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

---

## Four usage modes

The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.

| Mode | Description | Payload type | File |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |

Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.

### Payload for mode 2: PersistentCompactIntVec

`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.

Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.

### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec

`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.

Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.

---

## Payload architecture

The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:

```rust
pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    data:     D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}
```

`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.

Implemented concrete types:

| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |

`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.

---

## Three-level hierarchy

```
index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin
      unitigs.bin
      unitigs.bin.idx
      evidence.bin
      counts.pciv         [mode 2 only]
      presence_N.pbiv     [mode 3/4, one per genome — not yet implemented]
    layer_1/
      ...
  part_00001/
    layer_0/
    ...
```

**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.

**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.

**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.

---

## Layer file layout

```
layer_N/
  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
  counts.pciv         — [mode 2] PersistentCompactIntVec: one u32 per slot
```

`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

### Evidence encoding

Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.

```
slot s  →  unitig_id: u25  |  rank: u7
```

Packed into a `u32` (29 bits used, 3 spare). Decoding:

```
kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice
```

`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.

---

## ptr_hash configuration

The MPHF per layer is configured as:

```rust
type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: balanced (2.4 bits/key, λ=3.5)
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
    Xx64,                             // hasher: XXH3-64 with seed, handles structured keys
    Vec<u8>,                          // pilots
>;
```

**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.

**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.

**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.

---

## Build path

The build path is not part of `LayerData`. Each mode exposes its own `impl Layer<D>::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:

**`build_mphf(out_dir, n) -> OLMResult<Mphf>`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).

**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).

```rust
// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntVec> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
```

Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.

---

## Query path

A kmer query routes through all three levels:

1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
3. **Data access**: `layer.data.read(slot)` returns `D::Item`.

```rust
// pseudo-code
fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
    for (i, layer) in self.layers.iter().enumerate():
        slot = layer.mphf.index(&kmer.raw())
        if layer.evidence.decode(slot) == kmer:
            return Some((i, Hit { slot, data: layer.data.read(slot) }))
    return None
```

Expected probe depth: 1 for kmers in layer 0, increasing for later layers.

---

## Add-layer algorithm

When adding dataset B to an existing index:

1. For each partition, iterate kmers of B routed to that partition.
2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
3. Build a new layer from `B \ index`.
4. Append the new layer directory under each `part_XXXXX/`.
5. Update `meta.json` (layer count, sample registry).

Each partition's new layer is built independently; the operation is fully parallel across partitions.

---

## Dependencies

| crate | role |
|---|---|
| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
| `epserde 0.8` | zero-copy serialisation of MPHF |
| `memmap2` | mmap of layer files |
| `obiskio` | unitig file writer/reader |
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |

---

## Open questions

- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.