`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
`PersistentCompactIntMatrix` wraps multiple `PersistentCompactIntVec` columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.
1.**Primary array** — `[u8; n]`, stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
2.**Overflow section** — sorted list of `(slot: u64, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<usize, u32>` in RAM.
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
A directory containing `meta.json` and N column files `col_000000.pciv`, `col_000001.pciv`, …, each a `PersistentCompactIntVec`. This is the type used by `LayerData` — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.
Creates `col_NNNNNN.pciv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
**`close(self) -> io::Result<()>`**
Writes `meta.json` with the final `n` and `n_cols`. Must be called after all column builders are closed.
### Reader (`PersistentCompactIntMatrix`)
```rust
structPersistentCompactIntMatrix{
cols: Vec<PersistentCompactIntVec>,
n: usize,
}
```
**`open(dir: &Path) -> io::Result<Self>`**
Reads `meta.json`, opens all `col_NNNNNN.pciv` files.
**`row(slot: usize) -> Box<[u32]>`**
Returns the full row: `[col_0[slot], col_1[slot], …, col_{N-1}[slot]]`. One mmap access per column. O(N).
**`col(c: usize) -> &PersistentCompactIntVec`**
Direct access to a single column for column-oriented operations (distance computations, iteration).
**Self-contained partials** are additively decomposable: summing `partial_bray()` across all `(partition, layer)` pairs and finalising gives the same result as computing on the combined data.
**Normalised partials** require the global column weights (sum across all layers and all partitions). The `global` parameter must reflect the complete index, not a per-layer sum. The provided `relfreq_bray_dist_matrix()` etc. call `col_weights()` first (pass 1) then the normalised partial (pass 2); when called on a `LayeredStore<LayeredStore<…>>` these two-pass calls cascade automatically through the blanket impls.
**`partial_bray` returns `Array2<u64>`** (sum_min only, not a tuple). The denominator is always reconstructible as `col_weights()[i] + col_weights()[j]`.
**`partial_threshold_jaccard` returns `(inter, union)`** as a pair because `union[i,j]` is not reconstructible from per-column statistics — it depends on both columns simultaneously.