refactor: replace in-memory vectors with temp-file-backed storage

Introduces `TempCompactIntVec` and `TempBitVec` as temporary, file-backed intermediates to replace eager in-memory vectors, enabling OS-level paging under memory pressure. Updates the `MatrixGroupOps` trait to return `io::Result` types, allowing proper error propagation and supporting chunked accumulation for large column groups. Includes builder patterns with `.freeze()` finalization, automatic `TempDir` cleanup on drop, and necessary test updates to handle the new fallible signatures. Also fixes `Cargo.toml` section ordering.
This commit is contained in:
Eric Coissac
2026-06-17 15:13:22 +02:00
parent 1d38d87ff9
commit fb4962c4fe
11 changed files with 399 additions and 131 deletions
+114 -70
View File
@@ -11,8 +11,11 @@ src/obicompactvec/src/
reader.rs PersistentCompactIntVec (read-only)
builder.rs PersistentCompactIntVecBuilder (read-write)
memoryintvec.rs MemoryIntVec
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
colgroup.rs ColGroup, MatrixGroupOps trait
format.rs file format constants, encode/decode helpers
layer_meta.rs LayerMeta (column metadata)
meta.rs matrix metadata
@@ -24,13 +27,22 @@ graph TD
traits --> memoryintvec
bitvec --> memoryvec
bitvec --> bitmatrix
bitvec --> tempbitvec
format --> reader
format --> builder
reader --> intmatrix
reader --> tempintvec
builder --> intmatrix
builder --> memoryintvec
builder --> tempintvec
memoryvec --> traits
memoryintvec --> traits
tempintvec --> intmatrix
tempintvec --> bitmatrix
tempbitvec --> intmatrix
tempbitvec --> bitmatrix
colgroup --> intmatrix
colgroup --> bitmatrix
layer_meta --> bitmatrix
layer_meta --> intmatrix
meta --> bitmatrix
@@ -479,6 +491,8 @@ See `persistent_compact_int_vec.md` for file format and lifecycle.
| `MemoryIntVec` | inherent merge-scan ✓ | `byte_sum` ✓ | `byte_count_nonzero` ✓ |
| `PersistentCompactIntVecBuilder` | default (get-per-slot) | `byte_sum` on mmap ✓ | `byte_count_nonzero` on mmap ✓ |
| `PersistentCompactIntVec` | inherent merge-scan Iter ✓ | inherent `sum()` ✓ | inherent `count_nonzero()` ✓ |
| `TempCompactIntVec` | delegates to inner `PersistentCompactIntVec` | delegates | delegates |
| `TempCompactIntVecBuilder` | default (get-per-slot) | delegates to builder | delegates to builder |
| `PackedIntCol<'a>` | inherent PackedIntColIter ✓ | byte_sum ✓ | byte_count_nonzero ✓ |
`PackedIntCol` is used internally by `PersistentCompactIntMatrix` (packed format) for column views.
@@ -557,45 +571,68 @@ Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` (inter, union), `par
---
## Planned — Filter / Select API
## Temp-file-backed types
### Composition across layers and partitions
`MemoryBitVec` and `MemoryIntVec` are reserved for truly transient intra-method intermediates (e.g. a single `cmp_scalar` result that lives for one loop iteration). **All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
```mermaid
graph TD
subgraph Index
CG["ColGroup\nVec&lt;usize&gt; — valid everywhere"]
ACC["MemoryIntVec\nglobal accumulator"]
PRED["geq / leq / and / or\n→ MemoryBitVec mask"]
end
### Lifecycle
subgraph "Layer 1"
subgraph "Partition A kmers 0..k/2"
MA["Matrix A\npartial_group_presence_count"]
end
subgraph "Partition B kmers k/2..k"
MB["Matrix B\npartial_group_presence_count"]
end
CONCAT1["concat → MemoryIntVec\[0..k\]"]
end
subgraph "Layer 2"
CONCAT2["concat → MemoryIntVec\[0..k\]"]
end
CG -->|"same indices"| MA
CG -->|"same indices"| MB
MA -->|"kmer range A"| CONCAT1
MB -->|"kmer range B"| CONCAT1
CONCAT1 -->|"IntSliceMut::add"| ACC
CONCAT2 -->|"IntSliceMut::add"| ACC
ACC --> PRED
```
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
↓ (set / add / count_bits / mask_with / …)
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
↓ (optional)
.make_persistent(path) → PersistentCompactIntVec (permanent file)
```
Same pattern for `TempBitVecBuilder``TempBitVec``PersistentBitVec`.
**Drop order**: in `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }`, Rust drops fields in declaration order — `vec` (mmap) is released before `_temp` (directory) is deleted. No explicit `drop()` needed.
### TempCompactIntVec / TempCompactIntVecBuilder
```rust
pub struct TempCompactIntVec {
vec: PersistentCompactIntVec,
_temp: TempDir, // dropped after vec
}
pub(crate) struct TempCompactIntVecBuilder {
builder: PersistentCompactIntVecBuilder,
temp: TempDir,
}
```
`TempCompactIntVec` implements `IntSlice` (full delegation to inner `PersistentCompactIntVec`).
`TempCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut` (delegation to inner builder).
`make_persistent(path)` copies the temp file to `path` and opens it as `PersistentCompactIntVec`.
### TempBitVec / TempBitVecBuilder
```rust
pub struct TempBitVec {
vec: PersistentBitVec,
_temp: TempDir,
}
pub(crate) struct TempBitVecBuilder {
builder: PersistentBitVecBuilder,
temp: TempDir,
}
```
`TempBitVec` implements `BitSlice`.
`TempBitVecBuilder` implements `BitSlice` + `BitSliceMut`.
`make_persistent(path)` copies the temp file and opens as `PersistentBitVec`.
---
## Filter / Select API
### ColGroup
```rust
struct ColGroup { name: String, indices: Vec<usize> }
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
```
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions because column structure is identical across the entire hierarchy (same samples/genomes everywhere; only rows = kmer slots are partitioned).
@@ -607,68 +644,75 @@ Defined **once at the index level** from column metadata. Valid in all matrices
- **Across partitions**: kmer space is partitioned → partial results are **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results are **aggregated** (add, OR, etc.).
### Additivity rules
### MatrixGroupOps
```mermaid
flowchart LR
subgraph "Matrix level — returns MemoryIntVec"
PGP["partial_group_presence_count\npartial_group_sum\npartial_group_any → MemoryBitVec"]
end
subgraph "Index level — applies predicate"
GA["group_at_least(k)\n= accumulate.geq(k)"]
GALL["group_all\n= accumulate.geq(n_cols)"]
GANY["group_any\n= OR of partial_group_any"]
end
PGP -->|"concat across partitions\nadd across layers"| GA
PGP --> GALL
PGP --> GANY
```
Non-additive predicates (`group_all`, `group_at_least`) do **not** exist at matrix level — they require the global accumulated count.
### MatrixGroupOps (planned trait)
Group operations live on the matrix and expose only **additive intermediates** (`MemoryIntVec`). Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
Group operations live on the matrix and expose only **additive intermediates** backed by temp files. Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
```rust
trait MatrixGroupOps {
// How many columns in group have value >= threshold, per kmer slot
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32) -> MemoryIntVec;
pub trait MatrixGroupOps {
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempCompactIntVec>;
// Sum of values across group columns, per kmer slot
fn partial_group_sum(&self, g: &ColGroup) -> MemoryIntVec;
fn partial_group_sum(&self, g: &ColGroup)
-> io::Result<TempCompactIntVec>;
// Kmer present (value >= threshold) in at least one column of group
fn partial_group_any(&self, g: &ColGroup, threshold: u32) -> MemoryBitVec;
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
-> io::Result<TempBitVec>;
}
```
Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — they are composed at the index level from the additive intermediates:
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)` since values are 0/1.
**`partial_group_presence_count` — chunking for large groups:**
When `g.indices.len() < 255`, per-slot counts fit in a raw `u8` — fast path: accumulate directly into `primary_bytes_mut()` using `inc_primary_bits`, then `freeze()`. No overflow map needed.
When `g.indices.len() ≥ 255`, process in chunks of 254 columns — each chunk stays within `u8` range — then add chunks into a running `TempCompactIntVecBuilder` accumulator via `IntSliceMut::add`. This keeps peak memory proportional to one partition, not the number of columns × partitions.
```
fast path (< 255 cols):
builder = TempCompactIntVecBuilder::new(n)
for c in group:
mask = col_view(c).cmp_scalar(|v| v >= threshold) // MemoryBitVec
inc_primary_bits(primary_bytes_mut, mask) // u8 safe
builder.freeze()
slow path (≥ 255 cols):
result = TempCompactIntVecBuilder::new(n)
for chunk in group.chunks(254):
chunk_builder = TempCompactIntVecBuilder::new(n)
inc_primary_bits(chunk_builder, …)
chunk_frozen = chunk_builder.freeze()
IntSliceMut::add(&mut result, &chunk_frozen)
result.freeze()
```
Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — composed at the index level:
```
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)).sum();
let in_mask = presence.geq(2); // MemoryBitVec
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
let in_mask = presence.geq(2);
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)).sum();
let out_mask = out_sum.leq(0); // MemoryBitVec
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
let out_mask = out_sum.leq(0);
let mask = in_mask.and(&out_mask); // BitSliceMut::and — O(n/64)
let mask = in_mask & &out_mask; // BitSliceMut::and — O(n/64)
```
### mask_with (planned IntSliceMut method)
### mask_with (IntSliceMut)
Apply a bit mask to a count vector: zero slots where the mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
Provided method on `IntSliceMut`. Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
```
for (w_idx, word) in mask.words():
if word == u64::MAX: continue
if word == u64::MAX: continue // skip all-ones words
zeros = !word
while zeros != 0:
bit = trailing_zeros(zeros)
s = w_idx * 64 + bit
self.set(s, 0)
if primary[s] != 0: self.set(s, 0) // clears overflow entry too
zeros &= zeros 1
```
This is the terminal operation for both Filter (zero non-selected kmer slots in a count matrix) and Select (positional selection without MPHF).
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).