feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# PersistentCompactIntVec
|
||||
# PersistentCompactIntVec and PersistentCompactIntMatrix
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -6,78 +6,81 @@
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
`PersistentCompactIntMatrix` wraps multiple `PersistentCompactIntVec` columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
## PersistentCompactIntVec — single-column file
|
||||
|
||||
### Design
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u64, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
primary[slot] == 255 → binary search in overflow
|
||||
```
|
||||
|
||||
---
|
||||
### File format
|
||||
|
||||
## Single-file format
|
||||
|
||||
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
|
||||
Single `.pciv` file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.
|
||||
|
||||
```
|
||||
offset 0:
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n: u64 number of slots
|
||||
n_overflow: u32 number of overflow entries
|
||||
step: u32 sparse index step (0 = no index)
|
||||
n_index: u32 number of index entries
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
_pad: [u8; 4] = 0
|
||||
n: u64 number of slots
|
||||
n_overflow: u64 number of overflow entries
|
||||
n_index: u64 number of sparse index entries
|
||||
step: u64 sparse index step (0 = no index)
|
||||
|
||||
offset 24:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
offset 40:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
|
||||
offset 24 + n:
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot
|
||||
offset 40 + n:
|
||||
data: [(slot: u64, value: u32); n_overflow] 12 bytes each, sorted by slot
|
||||
|
||||
offset 24 + n + n_overflow × 8:
|
||||
index: [(slot: u32, pos: u32); n_index] sparse index
|
||||
offset 40 + n + n_overflow × 12:
|
||||
index: [(slot: u64, pos: u64); n_index] 16 bytes each, sparse index
|
||||
```
|
||||
|
||||
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
|
||||
|
||||
---
|
||||
All integer fields are little-endian. Slot indices are stored as `u64` in the file; they are `usize` in Rust code.
|
||||
|
||||
## Lifecycle
|
||||
### Lifecycle
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
#### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<usize, u32>` in RAM.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
path: PathBuf,
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
n: usize,
|
||||
overflow: HashMap<u64, u32>, // values ≥ 255
|
||||
overflow: HashMap<usize, u32>, // values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
#### `new(n: usize, path: &Path) -> io::Result<Self>`
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
|
||||
|
||||
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
|
||||
**`build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
|
||||
|
||||
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
|
||||
|
||||
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
|
||||
**`set(slot: usize, value: u32)` / `get(slot: usize) -> u32`**
|
||||
|
||||
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
|
||||
|
||||
#### Element-wise operations — `min`, `max`, `add`, `diff`
|
||||
**Element-wise operations — `min`, `max`, `add`, `diff`**
|
||||
|
||||
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
|
||||
|
||||
@@ -90,17 +93,15 @@ builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
|
||||
|
||||
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
1. Flush and drop the mmap (primary changes are now on disk).
|
||||
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
|
||||
2. Sort the overflow HashMap into `Vec<(usize, u32)>`.
|
||||
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
|
||||
4. Append sorted overflow data, then sparse index.
|
||||
5. Seek to offset 0, overwrite the header with final values.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
#### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
@@ -109,19 +110,19 @@ struct PersistentCompactIntVec {
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
step: u32,
|
||||
index: Vec<(u32, u32)>, // L1-resident
|
||||
primary_offset: usize, // = 24 (HEADER_SIZE)
|
||||
data_offset: usize, // = 24 + n
|
||||
step: usize,
|
||||
index: Vec<(usize, usize)>, // (slot, pos) — L1-resident
|
||||
primary_offset: usize, // = 40 (HEADER_SIZE)
|
||||
data_offset: usize, // = 40 + n
|
||||
path: PathBuf,
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
**`open(path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
Mmaps the file, parses the 40-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
|
||||
#### `get(slot: u64) -> u32` — random access
|
||||
**`get(slot: usize) -> u32` — random access**
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return it directly
|
||||
@@ -134,19 +135,19 @@ step > 0:
|
||||
binary_search(data[index[i].pos .. index[i+1].pos], slot)
|
||||
```
|
||||
|
||||
#### `iter() -> Iter<'_>` — sequential scan, O(n)
|
||||
**`iter() -> Iter<'_>` — sequential scan, O(n)**
|
||||
|
||||
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
|
||||
|
||||
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
|
||||
|
||||
#### Aggregate
|
||||
**Aggregate**
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
|
||||
```
|
||||
|
||||
#### Distance methods
|
||||
**Distance methods**
|
||||
|
||||
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
|
||||
|
||||
@@ -158,29 +159,23 @@ All take `&other` of equal length, iterate both with `zip(self.iter(), other.ite
|
||||
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
|
||||
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
|
||||
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − |A∩B| / |A∪B|` where presence iff count ≥ threshold |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − \|A∩B\| / \|A∪B\|` where presence iff count ≥ threshold |
|
||||
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
|
||||
|
||||
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
### Step computation
|
||||
|
||||
Chosen at `close()` once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_entries = 32 768 / 8 = 4096
|
||||
L1_INDEX_ENTRIES = 2048
|
||||
|
||||
step = 0 if n_overflow ≤ 4096
|
||||
step = ⌈n_overflow / 4096⌉ otherwise
|
||||
step = 0 if n_overflow ≤ 2048
|
||||
step = ⌈n_overflow / 2048⌉ otherwise
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
### Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
@@ -194,3 +189,72 @@ For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 en
|
||||
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |
|
||||
|
||||
---
|
||||
|
||||
## PersistentCompactIntMatrix — column-major directory
|
||||
|
||||
### Design
|
||||
|
||||
A directory containing `meta.json` and N column files `col_000000.pciv`, `col_000001.pciv`, …, each a `PersistentCompactIntVec`. This is the type used by `LayerData` — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.
|
||||
|
||||
```
|
||||
counts/
|
||||
meta.json {"n": <n_slots>, "n_cols": <N>}
|
||||
col_000000.pciv
|
||||
col_000001.pciv
|
||||
...
|
||||
```
|
||||
|
||||
### Builder (`PersistentCompactIntMatrixBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the directory (including parents). Does not write `meta.json` yet.
|
||||
|
||||
**`add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>`**
|
||||
|
||||
Creates `col_NNNNNN.pciv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
|
||||
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Writes `meta.json` with the final `n` and `n_cols`. Must be called after all column builders are closed.
|
||||
|
||||
### Reader (`PersistentCompactIntMatrix`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrix {
|
||||
cols: Vec<PersistentCompactIntVec>,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Reads `meta.json`, opens all `col_NNNNNN.pciv` files.
|
||||
|
||||
**`row(slot: usize) -> Box<[u32]>`**
|
||||
|
||||
Returns the full row: `[col_0[slot], col_1[slot], …, col_{N-1}[slot]]`. One mmap access per column. O(N).
|
||||
|
||||
**`col(c: usize) -> &PersistentCompactIntVec`**
|
||||
|
||||
Direct access to a single column for column-oriented operations (distance computations, iteration).
|
||||
|
||||
### LayerData implementation
|
||||
|
||||
```rust
|
||||
impl LayerData for PersistentCompactIntMatrix {
|
||||
type Item = Box<[u32]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
|
||||
fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) }
|
||||
}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user