feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
This commit is contained in:
@@ -10,26 +10,26 @@
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
|
||||
| Mode | Description | Payload type | File |
|
||||
| Mode | Description | Payload type | Storage |
|
||||
|---|---|---|---|
|
||||
| 1. Set | membership test only | `()` | — |
|
||||
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |
|
||||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
|
||||
Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::<PersistentBitMatrix>::build_presence`); mode 4 is not yet implemented.
|
||||
|
||||
### Payload for mode 2: PersistentCompactIntVec
|
||||
### Payload for modes 2/4: PersistentCompactIntMatrix
|
||||
|
||||
`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.
|
||||
`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
|
||||
|
||||
Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.
|
||||
Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
|
||||
|
||||
### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec
|
||||
### Payload for mode 3: PersistentBitMatrix
|
||||
|
||||
`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.
|
||||
`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
|
||||
|
||||
Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.
|
||||
Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
|
||||
|
||||
---
|
||||
|
||||
@@ -57,14 +57,15 @@ pub struct Hit<T = ()> {
|
||||
}
|
||||
```
|
||||
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
|
||||
Implemented concrete types:
|
||||
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
| `()` | `()` | mode 1 — membership only |
|
||||
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
|
||||
|
||||
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
|
||||
|
||||
@@ -81,8 +82,14 @@ index_root/ ← LayeredMap (collection)
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts.pciv [mode 2 only]
|
||||
presence_N.pbiv [mode 3/4, one per genome — not yet implemented]
|
||||
counts/ [modes 2/4]
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3]
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
col_000001.pbiv
|
||||
...
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
@@ -106,7 +113,8 @@ layer_N/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts.pciv — [mode 2] PersistentCompactIntVec: one u32 per slot
|
||||
counts/ — [modes 2/4] PersistentCompactIntMatrix
|
||||
presence/ — [mode 3] PersistentBitMatrix
|
||||
```
|
||||
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
@@ -165,13 +173,24 @@ impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntVec> {
|
||||
// modes 2/4
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 3
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn build_presence(
|
||||
out_dir: &Path,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||
) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
|
||||
|
||||
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
|
||||
|
||||
---
|
||||
@@ -196,6 +215,8 @@ fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
|
||||
|
||||
For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
|
||||
|
||||
---
|
||||
|
||||
## Add-layer algorithm
|
||||
@@ -221,15 +242,13 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
|
||||
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
|
||||
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# PersistentBitVec
|
||||
# PersistentBitVec and PersistentBitMatrix
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -6,9 +6,13 @@
|
||||
|
||||
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
|
||||
|
||||
`PersistentBitMatrix` wraps multiple `PersistentBitVec` columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.
|
||||
|
||||
---
|
||||
|
||||
## File format
|
||||
## PersistentBitVec — single-column file
|
||||
|
||||
### File format
|
||||
|
||||
Single `.pbiv` file.
|
||||
|
||||
@@ -28,11 +32,9 @@ offset 16:
|
||||
|
||||
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
|
||||
|
||||
---
|
||||
### Lifecycle
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Builder (`PersistentBitVecBuilder`)
|
||||
#### Builder (`PersistentBitVecBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVecBuilder {
|
||||
@@ -43,8 +45,6 @@ struct PersistentBitVecBuilder {
|
||||
|
||||
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
|
||||
|
||||
#### Constructors
|
||||
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
|
||||
@@ -68,16 +68,16 @@ Handles overflow values (≥ 255) transparently — the count iterator returns t
|
||||
|
||||
Shorthand for `build_from_counts(source, 1, path)`.
|
||||
|
||||
#### Bit-level access
|
||||
**Bit-level access**
|
||||
|
||||
```rust
|
||||
fn get(&self, slot: u64) -> bool
|
||||
fn set(&mut self, slot: u64, value: bool)
|
||||
fn get(&self, slot: usize) -> bool
|
||||
fn set(&mut self, slot: usize, value: bool)
|
||||
```
|
||||
|
||||
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
|
||||
|
||||
#### Word-level bulk operations
|
||||
**Word-level bulk operations**
|
||||
|
||||
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
|
||||
|
||||
@@ -90,13 +90,11 @@ builder.not(); // self[i] = !self[i], then re-zero padding bits
|
||||
|
||||
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentBitVec`)
|
||||
#### Reader (`PersistentBitVec`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVec {
|
||||
@@ -106,19 +104,19 @@ struct PersistentBitVec {
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
**`open(path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
|
||||
|
||||
#### `get(slot: u64) -> bool`
|
||||
**`get(slot: usize) -> bool`**
|
||||
|
||||
Byte-level read from `mmap[16 + slot/8]`. O(1).
|
||||
|
||||
#### `iter() -> BitIter<'_>`
|
||||
**`iter() -> BitIter<'_>`**
|
||||
|
||||
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
|
||||
|
||||
#### Aggregates
|
||||
**Aggregates**
|
||||
|
||||
```rust
|
||||
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
|
||||
@@ -127,22 +125,20 @@ fn count_zeros(&self) -> u64 // n - count_ones()
|
||||
|
||||
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
|
||||
|
||||
#### Distance methods
|
||||
**Distance methods**
|
||||
|
||||
Both operate word by word. O(n/64).
|
||||
|
||||
| Method | Formula | Notes |
|
||||
|---|---|---|
|
||||
| `jaccard_dist(&other) -> f64` | `1 − |A∩B| / |A∪B|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||||
| `jaccard_dist(&other) -> f64` | `1 − \|A∩B\| / \|A∪B\|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||||
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
|
||||
|
||||
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
|
||||
|
||||
---
|
||||
### Implementation notes
|
||||
|
||||
## Implementation notes
|
||||
|
||||
### u64 word view
|
||||
#### u64 word view
|
||||
|
||||
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||||
|
||||
@@ -152,13 +148,11 @@ The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||||
|
||||
This gives zero-copy word-level access with no intermediate allocation.
|
||||
|
||||
### Padding invariant
|
||||
#### Padding invariant
|
||||
|
||||
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
### Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
@@ -171,3 +165,74 @@ Writing `not()` without masking the last word would corrupt `count_ones()`, `ham
|
||||
| `build_from` | O(file_size) | OS copy |
|
||||
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
|
||||
| `close` | O(1) | flush only |
|
||||
|
||||
---
|
||||
|
||||
## PersistentBitMatrix — column-major directory
|
||||
|
||||
### Design
|
||||
|
||||
A directory containing `meta.json` and N column files `col_000000.pbiv`, `col_000001.pbiv`, …, each a `PersistentBitVec`. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.
|
||||
|
||||
```
|
||||
presence/
|
||||
meta.json {"n": <n_slots>, "n_cols": <G>}
|
||||
col_000000.pbiv genome 0
|
||||
col_000001.pbiv genome 1
|
||||
...
|
||||
```
|
||||
|
||||
Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.
|
||||
|
||||
### Builder (`PersistentBitMatrixBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the directory (including parents).
|
||||
|
||||
**`add_col(&mut self) -> io::Result<PersistentBitVecBuilder>`**
|
||||
|
||||
Creates `col_NNNNNN.pbiv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
|
||||
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Writes `meta.json` with the final `n` and `n_cols`.
|
||||
|
||||
### Reader (`PersistentBitMatrix`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitMatrix {
|
||||
cols: Vec<PersistentBitVec>,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Reads `meta.json`, opens all `col_NNNNNN.pbiv` files.
|
||||
|
||||
**`row(slot: usize) -> Box<[bool]>`**
|
||||
|
||||
Returns the presence vector: `[col_0[slot], col_1[slot], …, col_{G-1}[slot]]`. One byte read per column. O(G).
|
||||
|
||||
**`col(c: usize) -> &PersistentBitVec`**
|
||||
|
||||
Direct access to a single column for column-oriented operations.
|
||||
|
||||
### LayerData implementation
|
||||
|
||||
```rust
|
||||
impl LayerData for PersistentBitMatrix {
|
||||
type Item = Box<[bool]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
|
||||
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
|
||||
}
|
||||
```
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# PersistentCompactIntVec
|
||||
# PersistentCompactIntVec and PersistentCompactIntMatrix
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -6,78 +6,81 @@
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
`PersistentCompactIntMatrix` wraps multiple `PersistentCompactIntVec` columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
## PersistentCompactIntVec — single-column file
|
||||
|
||||
### Design
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u64, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
primary[slot] == 255 → binary search in overflow
|
||||
```
|
||||
|
||||
---
|
||||
### File format
|
||||
|
||||
## Single-file format
|
||||
|
||||
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
|
||||
Single `.pciv` file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.
|
||||
|
||||
```
|
||||
offset 0:
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n: u64 number of slots
|
||||
n_overflow: u32 number of overflow entries
|
||||
step: u32 sparse index step (0 = no index)
|
||||
n_index: u32 number of index entries
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
_pad: [u8; 4] = 0
|
||||
n: u64 number of slots
|
||||
n_overflow: u64 number of overflow entries
|
||||
n_index: u64 number of sparse index entries
|
||||
step: u64 sparse index step (0 = no index)
|
||||
|
||||
offset 24:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
offset 40:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
|
||||
offset 24 + n:
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot
|
||||
offset 40 + n:
|
||||
data: [(slot: u64, value: u32); n_overflow] 12 bytes each, sorted by slot
|
||||
|
||||
offset 24 + n + n_overflow × 8:
|
||||
index: [(slot: u32, pos: u32); n_index] sparse index
|
||||
offset 40 + n + n_overflow × 12:
|
||||
index: [(slot: u64, pos: u64); n_index] 16 bytes each, sparse index
|
||||
```
|
||||
|
||||
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
|
||||
|
||||
---
|
||||
All integer fields are little-endian. Slot indices are stored as `u64` in the file; they are `usize` in Rust code.
|
||||
|
||||
## Lifecycle
|
||||
### Lifecycle
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
#### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<usize, u32>` in RAM.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
path: PathBuf,
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
n: usize,
|
||||
overflow: HashMap<u64, u32>, // values ≥ 255
|
||||
overflow: HashMap<usize, u32>, // values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
#### `new(n: usize, path: &Path) -> io::Result<Self>`
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
|
||||
|
||||
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
|
||||
**`build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
|
||||
|
||||
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
|
||||
|
||||
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
|
||||
**`set(slot: usize, value: u32)` / `get(slot: usize) -> u32`**
|
||||
|
||||
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
|
||||
|
||||
#### Element-wise operations — `min`, `max`, `add`, `diff`
|
||||
**Element-wise operations — `min`, `max`, `add`, `diff`**
|
||||
|
||||
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
|
||||
|
||||
@@ -90,17 +93,15 @@ builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
|
||||
|
||||
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
1. Flush and drop the mmap (primary changes are now on disk).
|
||||
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
|
||||
2. Sort the overflow HashMap into `Vec<(usize, u32)>`.
|
||||
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
|
||||
4. Append sorted overflow data, then sparse index.
|
||||
5. Seek to offset 0, overwrite the header with final values.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
#### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
@@ -109,19 +110,19 @@ struct PersistentCompactIntVec {
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
step: u32,
|
||||
index: Vec<(u32, u32)>, // L1-resident
|
||||
primary_offset: usize, // = 24 (HEADER_SIZE)
|
||||
data_offset: usize, // = 24 + n
|
||||
step: usize,
|
||||
index: Vec<(usize, usize)>, // (slot, pos) — L1-resident
|
||||
primary_offset: usize, // = 40 (HEADER_SIZE)
|
||||
data_offset: usize, // = 40 + n
|
||||
path: PathBuf,
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
**`open(path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
Mmaps the file, parses the 40-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
|
||||
#### `get(slot: u64) -> u32` — random access
|
||||
**`get(slot: usize) -> u32` — random access**
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return it directly
|
||||
@@ -134,19 +135,19 @@ step > 0:
|
||||
binary_search(data[index[i].pos .. index[i+1].pos], slot)
|
||||
```
|
||||
|
||||
#### `iter() -> Iter<'_>` — sequential scan, O(n)
|
||||
**`iter() -> Iter<'_>` — sequential scan, O(n)**
|
||||
|
||||
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
|
||||
|
||||
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
|
||||
|
||||
#### Aggregate
|
||||
**Aggregate**
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
|
||||
```
|
||||
|
||||
#### Distance methods
|
||||
**Distance methods**
|
||||
|
||||
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
|
||||
|
||||
@@ -158,29 +159,23 @@ All take `&other` of equal length, iterate both with `zip(self.iter(), other.ite
|
||||
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
|
||||
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
|
||||
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − |A∩B| / |A∪B|` where presence iff count ≥ threshold |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − \|A∩B\| / \|A∪B\|` where presence iff count ≥ threshold |
|
||||
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
|
||||
|
||||
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
### Step computation
|
||||
|
||||
Chosen at `close()` once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_entries = 32 768 / 8 = 4096
|
||||
L1_INDEX_ENTRIES = 2048
|
||||
|
||||
step = 0 if n_overflow ≤ 4096
|
||||
step = ⌈n_overflow / 4096⌉ otherwise
|
||||
step = 0 if n_overflow ≤ 2048
|
||||
step = ⌈n_overflow / 2048⌉ otherwise
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
### Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
@@ -194,3 +189,72 @@ For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 en
|
||||
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |
|
||||
|
||||
---
|
||||
|
||||
## PersistentCompactIntMatrix — column-major directory
|
||||
|
||||
### Design
|
||||
|
||||
A directory containing `meta.json` and N column files `col_000000.pciv`, `col_000001.pciv`, …, each a `PersistentCompactIntVec`. This is the type used by `LayerData` — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.
|
||||
|
||||
```
|
||||
counts/
|
||||
meta.json {"n": <n_slots>, "n_cols": <N>}
|
||||
col_000000.pciv
|
||||
col_000001.pciv
|
||||
...
|
||||
```
|
||||
|
||||
### Builder (`PersistentCompactIntMatrixBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the directory (including parents). Does not write `meta.json` yet.
|
||||
|
||||
**`add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>`**
|
||||
|
||||
Creates `col_NNNNNN.pciv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
|
||||
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Writes `meta.json` with the final `n` and `n_cols`. Must be called after all column builders are closed.
|
||||
|
||||
### Reader (`PersistentCompactIntMatrix`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrix {
|
||||
cols: Vec<PersistentCompactIntVec>,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Reads `meta.json`, opens all `col_NNNNNN.pciv` files.
|
||||
|
||||
**`row(slot: usize) -> Box<[u32]>`**
|
||||
|
||||
Returns the full row: `[col_0[slot], col_1[slot], …, col_{N-1}[slot]]`. One mmap access per column. O(N).
|
||||
|
||||
**`col(c: usize) -> &PersistentCompactIntVec`**
|
||||
|
||||
Direct access to a single column for column-oriented operations (distance computations, iteration).
|
||||
|
||||
### LayerData implementation
|
||||
|
||||
```rust
|
||||
impl LayerData for PersistentCompactIntMatrix {
|
||||
type Item = Box<[u32]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
|
||||
fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) }
|
||||
}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user