feat: introduce column-major matrix storage and migrate layered map

Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
This commit is contained in:
Eric Coissac
2026-05-14 09:31:11 +08:00
parent f48f7500cd
commit b218bf012b
15 changed files with 843 additions and 201 deletions
+95 -30
View File
@@ -1,4 +1,4 @@
# PersistentBitVec
# PersistentBitVec and PersistentBitMatrix
## Purpose
@@ -6,9 +6,13 @@
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
`PersistentBitMatrix` wraps multiple `PersistentBitVec` columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.
---
## File format
## PersistentBitVec — single-column file
### File format
Single `.pbiv` file.
@@ -28,11 +32,9 @@ offset 16:
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
---
### Lifecycle
## Lifecycle
### Builder (`PersistentBitVecBuilder`)
#### Builder (`PersistentBitVecBuilder`)
```rust
struct PersistentBitVecBuilder {
@@ -43,8 +45,6 @@ struct PersistentBitVecBuilder {
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
#### Constructors
**`new(n: usize, path: &Path) -> io::Result<Self>`**
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
@@ -68,16 +68,16 @@ Handles overflow values (≥ 255) transparently — the count iterator returns t
Shorthand for `build_from_counts(source, 1, path)`.
#### Bit-level access
**Bit-level access**
```rust
fn get(&self, slot: u64) -> bool
fn set(&mut self, slot: u64, value: bool)
fn get(&self, slot: usize) -> bool
fn set(&mut self, slot: usize, value: bool)
```
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
#### Word-level bulk operations
**Word-level bulk operations**
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
@@ -90,13 +90,11 @@ builder.not(); // self[i] = !self[i], then re-zero padding bits
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
#### `close(self) -> io::Result<()>`
**`close(self) -> io::Result<()>`**
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
---
### Reader (`PersistentBitVec`)
#### Reader (`PersistentBitVec`)
```rust
struct PersistentBitVec {
@@ -106,19 +104,19 @@ struct PersistentBitVec {
}
```
#### `open(path: &Path) -> io::Result<Self>`
**`open(path: &Path) -> io::Result<Self>`**
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
#### `get(slot: u64) -> bool`
**`get(slot: usize) -> bool`**
Byte-level read from `mmap[16 + slot/8]`. O(1).
#### `iter() -> BitIter<'_>`
**`iter() -> BitIter<'_>`**
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
#### Aggregates
**Aggregates**
```rust
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
@@ -127,22 +125,20 @@ fn count_zeros(&self) -> u64 // n - count_ones()
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
#### Distance methods
**Distance methods**
Both operate word by word. O(n/64).
| Method | Formula | Notes |
|---|---|---|
| `jaccard_dist(&other) -> f64` | `1 |A∩B| / |AB|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
| `jaccard_dist(&other) -> f64` | `1 \|A∩B\| / \|AB\|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
---
### Implementation notes
## Implementation notes
### u64 word view
#### u64 word view
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
@@ -152,13 +148,11 @@ The unsafe cast from `&[u8]` to `&[u64]` is sound because:
This gives zero-copy word-level access with no intermediate allocation.
### Padding invariant
#### Padding invariant
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
---
## Complexity
### Complexity
| Operation | Time | Notes |
|---|---|---|
@@ -171,3 +165,74 @@ Writing `not()` without masking the last word would corrupt `count_ones()`, `ham
| `build_from` | O(file_size) | OS copy |
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
| `close` | O(1) | flush only |
---
## PersistentBitMatrix — column-major directory
### Design
A directory containing `meta.json` and N column files `col_000000.pbiv`, `col_000001.pbiv`, …, each a `PersistentBitVec`. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.
```
presence/
meta.json {"n": <n_slots>, "n_cols": <G>}
col_000000.pbiv genome 0
col_000001.pbiv genome 1
...
```
Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.
### Builder (`PersistentBitMatrixBuilder`)
```rust
struct PersistentBitMatrixBuilder {
dir: PathBuf,
n: usize,
n_cols: usize,
}
```
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
Creates the directory (including parents).
**`add_col(&mut self) -> io::Result<PersistentBitVecBuilder>`**
Creates `col_NNNNNN.pbiv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
**`close(self) -> io::Result<()>`**
Writes `meta.json` with the final `n` and `n_cols`.
### Reader (`PersistentBitMatrix`)
```rust
struct PersistentBitMatrix {
cols: Vec<PersistentBitVec>,
n: usize,
}
```
**`open(dir: &Path) -> io::Result<Self>`**
Reads `meta.json`, opens all `col_NNNNNN.pbiv` files.
**`row(slot: usize) -> Box<[bool]>`**
Returns the presence vector: `[col_0[slot], col_1[slot], …, col_{G-1}[slot]]`. One byte read per column. O(G).
**`col(c: usize) -> &PersistentBitVec`**
Direct access to a single column for column-oriented operations.
### LayerData implementation
```rust
impl LayerData for PersistentBitMatrix {
type Item = Box<[bool]>;
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
}
```