obikmer/docmd/implementation/persistent_compact_int_vec.md

# PersistentCompactIntVec and PersistentCompactIntMatrix

## Purpose

`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.

Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).

`PersistentCompactIntMatrix` wraps multiple `PersistentCompactIntVec` columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.

---

## PersistentCompactIntVec — single-column file

### Design

Two-tier structure:

1. **Primary array** — `[u8; n]`, stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
2. **Overflow section** — sorted list of `(slot: u64, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.

```
primary[slot] < 255  →  return primary[slot]
primary[slot] == 255 →  binary search in overflow
```

### File format

Single `.pciv` file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.

```
offset 0:
  magic:      [u8; 4]   = b"PCIV"
  _pad:       [u8; 4]   = 0
  n:          u64        number of slots
  n_overflow: u64        number of overflow entries
  n_index:    u64        number of sparse index entries
  step:       u64        sparse index step (0 = no index)

offset 40:
  primary:    [u8; n]    one byte per slot, 255 = overflow sentinel

offset 40 + n:
  data:       [(slot: u64, value: u32); n_overflow]   12 bytes each, sorted by slot

offset 40 + n + n_overflow × 12:
  index:      [(slot: u64, pos: u64); n_index]         16 bytes each, sparse index
```

The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.

All integer fields are little-endian. Slot indices are stored as `u64` in the file; they are `usize` in Rust code.

### Lifecycle

#### Builder (`PersistentCompactIntVecBuilder`)

Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<usize, u32>` in RAM.

```rust
struct PersistentCompactIntVecBuilder {
    path:     PathBuf,
    mmap:     MmapMut,            // primary section live in the file from the start
    n:        usize,
    overflow: HashMap<usize, u32>, // values ≥ 255
}
```

**`new(n: usize, path: &Path) -> io::Result<Self>`**

Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.

**`build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**

Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).

At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.

**`set(slot: usize, value: u32)` / `get(slot: usize) -> u32`**

Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).

**Element-wise operations — `min`, `max`, `add`, `diff`**

Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:

```rust
builder.min(&other);   // self[i] = min(self[i], other[i])
builder.max(&other);   // self[i] = max(self[i], other[i])
builder.add(&other);   // self[i] = self[i].checked_add(other[i])  (panics on u32 overflow)
builder.diff(&other);  // self[i] = self[i].saturating_sub(other[i])
```

All iterate `other` with `other.iter()` (merge-scan, O(n_other)).

**`close(self) -> io::Result<()>`**

1. Flush and drop the mmap (primary changes are now on disk).
2. Sort the overflow HashMap into `Vec<(usize, u32)>`.
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
4. Append sorted overflow data, then sparse index.
5. Seek to offset 0, overwrite the header with final values.

#### Reader (`PersistentCompactIntVec`)

Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).

```rust
struct PersistentCompactIntVec {
    mmap:           Mmap,
    n:              usize,
    n_overflow:     usize,
    step:           usize,
    index:          Vec<(usize, usize)>,  // (slot, pos) — L1-resident
    primary_offset: usize,               // = 40 (HEADER_SIZE)
    data_offset:    usize,               // = 40 + n
    path:           PathBuf,
}
```

**`open(path: &Path) -> io::Result<Self>`**

Mmaps the file, parses the 40-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.

**`get(slot: usize) -> u32` — random access**

```
primary[slot] < 255  →  return it directly

step == 0:
    binary_search(data[0..n_overflow], slot)

step > 0:
    i = upper_bound(index[..].slot, slot) − 1     // in L1-resident Vec
    binary_search(data[index[i].pos .. index[i+1].pos], slot)
```

**`iter() -> Iter<'_>` — sequential scan, O(n)**

Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.

`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.

**Aggregate**

```rust
fn sum(&self) -> u64   // Σ self[i] as u64, via iter()
```

**Distance methods**

All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.

| Method | Formula |
|---|---|
| `bray_dist` | `1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)` |
| `relfreq_bray_dist` | Bray-Curtis on relative frequencies: `1 − Σmin(pᵢ,qᵢ)` where `pᵢ = aᵢ/Σa` |
| `euclidean_dist` | `√Σ(aᵢ − bᵢ)²` |
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − \|A∩B\| / \|A∪B\|` where presence iff count ≥ threshold |
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |

Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.

### Step computation

Chosen at `close()` once `n_overflow` is known:

```
L1_INDEX_ENTRIES = 2048

step = 0                                if n_overflow ≤ 2048
step = ⌈n_overflow / 2048⌉             otherwise
```

### Complexity

| Operation | Time | Notes |
|---|---|---|
| `set` / `get` (builder) | O(1) | mmap byte + HashMap |
| `get` (reader, no overflow) | O(1) | single mmap byte |
| `get` (reader, with index) | O(log step) | ≤ 2 memory regions |
| `get` (reader, no index) | O(log n_overflow) | data fits in a few cache lines |
| `iter()` full scan | O(n + n_overflow) | merge-scan, no binary search |
| `sum`, distances | O(n) | via `iter()` / `zip(iter(), iter())` |
| `min` / `max` / `add` / `diff` | O(n) | via `other.iter()` + builder `set` |
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
| `open` | O(n_index) | index copy into Vec |
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |

---

## PersistentCompactIntMatrix — column-major directory

### Design

A directory containing `meta.json` and N column files `col_000000.pciv`, `col_000001.pciv`, …, each a `PersistentCompactIntVec`. This is the type used by `LayerData` — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.

```
counts/
  meta.json          {"n": <n_slots>, "n_cols": <N>}
  col_000000.pciv
  col_000001.pciv
  ...
```

### Builder (`PersistentCompactIntMatrixBuilder`)

```rust
struct PersistentCompactIntMatrixBuilder {
    dir:    PathBuf,
    n:      usize,
    n_cols: usize,
}
```

**`new(n: usize, dir: &Path) -> io::Result<Self>`**

Creates the directory (including parents). Does not write `meta.json` yet.

**`add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>`**

Creates `col_NNNNNN.pciv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.

**`close(self) -> io::Result<()>`**

Writes `meta.json` with the final `n` and `n_cols`. Must be called after all column builders are closed.

### Reader (`PersistentCompactIntMatrix`)

```rust
struct PersistentCompactIntMatrix {
    cols: Vec<PersistentCompactIntVec>,
    n:    usize,
}
```

**`open(dir: &Path) -> io::Result<Self>`**

Reads `meta.json`, opens all `col_NNNNNN.pciv` files.

**`row(slot: usize) -> Box<[u32]>`**

Returns the full row: `[col_0[slot], col_1[slot], …, col_{N-1}[slot]]`. One mmap access per column. O(N).

**`col(c: usize) -> &PersistentCompactIntVec`**

Direct access to a single column for column-oriented operations (distance computations, iteration).

### LayerData implementation

```rust
impl LayerData for PersistentCompactIntMatrix {
    type Item = Box<[u32]>;
    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
    fn read(&self, slot: usize) -> Box<[u32]>    { self.row(slot) }
}
```

---

## Aggregation traits — `obicompactvec::traits`

`PersistentCompactIntMatrix` implements two aggregation traits used by `LayeredStore<S>` for cross-layer and cross-partition distance computations.

### ColumnWeights

```rust
impl ColumnWeights for PersistentCompactIntMatrix {
    fn col_weights(&self) -> Array1<u64>   // = self.sum()
}
```

`col_weights()[c]` = sum of all values in column `c` across all slots.

### CountPartials

```rust
impl CountPartials for PersistentCompactIntMatrix {
    // Self-contained partials (additive across layers, no external parameter)
    fn partial_bray(&self)                                      -> Array2<u64>
    fn partial_euclidean(&self)                                 -> Array2<f64>
    fn partial_threshold_jaccard(&self, threshold: u32)         -> (Array2<u64>, Array2<u64>)

    // Normalised partials (require global col_weights across all layers/partitions)
    fn partial_relfreq_bray(&self, global: &Array1<u64>)        -> Array2<f64>
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)   -> Array2<f64>
    fn partial_hellinger(&self, global: &Array1<u64>)           -> Array2<f64>

    // Provided finalisations (default implementations on the trait)
    fn bray_dist_matrix(&self)                                  -> Array2<f64>
    fn euclidean_dist_matrix(&self)                             -> Array2<f64>
    fn threshold_jaccard_dist_matrix(&self, threshold: u32)     -> Array2<f64>
    fn relfreq_bray_dist_matrix(&self)                          -> Array2<f64>
    fn relfreq_euclidean_dist_matrix(&self)                     -> Array2<f64>
    fn hellinger_dist_matrix(&self)                             -> Array2<f64>
}
```

**Self-contained partials** are additively decomposable: summing `partial_bray()` across all `(partition, layer)` pairs and finalising gives the same result as computing on the combined data.

**Normalised partials** require the global column weights (sum across all layers and all partitions). The `global` parameter must reflect the complete index, not a per-layer sum. The provided `relfreq_bray_dist_matrix()` etc. call `col_weights()` first (pass 1) then the normalised partial (pass 2); when called on a `LayeredStore<LayeredStore<…>>` these two-pass calls cascade automatically through the blanket impls.

**`partial_bray` returns `Array2<u64>`** (sum_min only, not a tuple). The denominator is always reconstructible as `col_weights()[i] + col_weights()[j]`.

**`partial_threshold_jaccard` returns `(inter, union)`** as a pair because `union[i,j]` is not reconstructible from per-column statistics — it depends on both columns simultaneously.