refactor: replace in-memory vectors with temp-file-backed storage
Introduces `TempCompactIntVec` and `TempBitVec` as temporary, file-backed intermediates to replace eager in-memory vectors, enabling OS-level paging under memory pressure. Updates the `MatrixGroupOps` trait to return `io::Result` types, allowing proper error propagation and supporting chunked accumulation for large column groups. Includes builder patterns with `.freeze()` finalization, automatic `TempDir` cleanup on drop, and necessary test updates to handle the new fallible signatures. Also fixes `Cargo.toml` section ordering.
This commit is contained in:
@@ -11,8 +11,11 @@ src/obicompactvec/src/
|
||||
reader.rs PersistentCompactIntVec (read-only)
|
||||
builder.rs PersistentCompactIntVecBuilder (read-write)
|
||||
memoryintvec.rs MemoryIntVec
|
||||
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
|
||||
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
|
||||
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
|
||||
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
|
||||
colgroup.rs ColGroup, MatrixGroupOps trait
|
||||
format.rs file format constants, encode/decode helpers
|
||||
layer_meta.rs LayerMeta (column metadata)
|
||||
meta.rs matrix metadata
|
||||
@@ -24,13 +27,22 @@ graph TD
|
||||
traits --> memoryintvec
|
||||
bitvec --> memoryvec
|
||||
bitvec --> bitmatrix
|
||||
bitvec --> tempbitvec
|
||||
format --> reader
|
||||
format --> builder
|
||||
reader --> intmatrix
|
||||
reader --> tempintvec
|
||||
builder --> intmatrix
|
||||
builder --> memoryintvec
|
||||
builder --> tempintvec
|
||||
memoryvec --> traits
|
||||
memoryintvec --> traits
|
||||
tempintvec --> intmatrix
|
||||
tempintvec --> bitmatrix
|
||||
tempbitvec --> intmatrix
|
||||
tempbitvec --> bitmatrix
|
||||
colgroup --> intmatrix
|
||||
colgroup --> bitmatrix
|
||||
layer_meta --> bitmatrix
|
||||
layer_meta --> intmatrix
|
||||
meta --> bitmatrix
|
||||
@@ -479,6 +491,8 @@ See `persistent_compact_int_vec.md` for file format and lifecycle.
|
||||
| `MemoryIntVec` | inherent merge-scan ✓ | `byte_sum` ✓ | `byte_count_nonzero` ✓ |
|
||||
| `PersistentCompactIntVecBuilder` | default (get-per-slot) | `byte_sum` on mmap ✓ | `byte_count_nonzero` on mmap ✓ |
|
||||
| `PersistentCompactIntVec` | inherent merge-scan Iter ✓ | inherent `sum()` ✓ | inherent `count_nonzero()` ✓ |
|
||||
| `TempCompactIntVec` | delegates to inner `PersistentCompactIntVec` | delegates | delegates |
|
||||
| `TempCompactIntVecBuilder` | default (get-per-slot) | delegates to builder | delegates to builder |
|
||||
| `PackedIntCol<'a>` | inherent PackedIntColIter ✓ | byte_sum ✓ | byte_count_nonzero ✓ |
|
||||
|
||||
`PackedIntCol` is used internally by `PersistentCompactIntMatrix` (packed format) for column views.
|
||||
@@ -557,45 +571,68 @@ Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` (inter, union), `par
|
||||
|
||||
---
|
||||
|
||||
## Planned — Filter / Select API
|
||||
## Temp-file-backed types
|
||||
|
||||
### Composition across layers and partitions
|
||||
`MemoryBitVec` and `MemoryIntVec` are reserved for truly transient intra-method intermediates (e.g. a single `cmp_scalar` result that lives for one loop iteration). **All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph Index
|
||||
CG["ColGroup\nVec<usize> — valid everywhere"]
|
||||
ACC["MemoryIntVec\nglobal accumulator"]
|
||||
PRED["geq / leq / and / or\n→ MemoryBitVec mask"]
|
||||
end
|
||||
### Lifecycle
|
||||
|
||||
subgraph "Layer 1"
|
||||
subgraph "Partition A kmers 0..k/2"
|
||||
MA["Matrix A\npartial_group_presence_count"]
|
||||
end
|
||||
subgraph "Partition B kmers k/2..k"
|
||||
MB["Matrix B\npartial_group_presence_count"]
|
||||
end
|
||||
CONCAT1["concat → MemoryIntVec\[0..k\]"]
|
||||
end
|
||||
|
||||
subgraph "Layer 2"
|
||||
CONCAT2["concat → MemoryIntVec\[0..k\]"]
|
||||
end
|
||||
|
||||
CG -->|"same indices"| MA
|
||||
CG -->|"same indices"| MB
|
||||
MA -->|"kmer range A"| CONCAT1
|
||||
MB -->|"kmer range B"| CONCAT1
|
||||
CONCAT1 -->|"IntSliceMut::add"| ACC
|
||||
CONCAT2 -->|"IntSliceMut::add"| ACC
|
||||
ACC --> PRED
|
||||
```
|
||||
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
|
||||
↓ (set / add / count_bits / mask_with / …)
|
||||
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
|
||||
↓ (optional)
|
||||
.make_persistent(path) → PersistentCompactIntVec (permanent file)
|
||||
```
|
||||
|
||||
Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec`.
|
||||
|
||||
**Drop order**: in `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }`, Rust drops fields in declaration order — `vec` (mmap) is released before `_temp` (directory) is deleted. No explicit `drop()` needed.
|
||||
|
||||
### TempCompactIntVec / TempCompactIntVecBuilder
|
||||
|
||||
```rust
|
||||
pub struct TempCompactIntVec {
|
||||
vec: PersistentCompactIntVec,
|
||||
_temp: TempDir, // dropped after vec
|
||||
}
|
||||
|
||||
pub(crate) struct TempCompactIntVecBuilder {
|
||||
builder: PersistentCompactIntVecBuilder,
|
||||
temp: TempDir,
|
||||
}
|
||||
```
|
||||
|
||||
`TempCompactIntVec` implements `IntSlice` (full delegation to inner `PersistentCompactIntVec`).
|
||||
`TempCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut` (delegation to inner builder).
|
||||
`make_persistent(path)` copies the temp file to `path` and opens it as `PersistentCompactIntVec`.
|
||||
|
||||
### TempBitVec / TempBitVecBuilder
|
||||
|
||||
```rust
|
||||
pub struct TempBitVec {
|
||||
vec: PersistentBitVec,
|
||||
_temp: TempDir,
|
||||
}
|
||||
|
||||
pub(crate) struct TempBitVecBuilder {
|
||||
builder: PersistentBitVecBuilder,
|
||||
temp: TempDir,
|
||||
}
|
||||
```
|
||||
|
||||
`TempBitVec` implements `BitSlice`.
|
||||
`TempBitVecBuilder` implements `BitSlice` + `BitSliceMut`.
|
||||
`make_persistent(path)` copies the temp file and opens as `PersistentBitVec`.
|
||||
|
||||
---
|
||||
|
||||
## Filter / Select API
|
||||
|
||||
### ColGroup
|
||||
|
||||
```rust
|
||||
struct ColGroup { name: String, indices: Vec<usize> }
|
||||
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
|
||||
```
|
||||
|
||||
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions because column structure is identical across the entire hierarchy (same samples/genomes everywhere; only rows = kmer slots are partitioned).
|
||||
@@ -607,68 +644,75 @@ Defined **once at the index level** from column metadata. Valid in all matrices
|
||||
- **Across partitions**: kmer space is partitioned → partial results are **concatenated** (disjoint kmer ranges).
|
||||
- **Across layers**: same kmer space, different counts → partial results are **aggregated** (add, OR, etc.).
|
||||
|
||||
### Additivity rules
|
||||
### MatrixGroupOps
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "Matrix level — returns MemoryIntVec"
|
||||
PGP["partial_group_presence_count\npartial_group_sum\npartial_group_any → MemoryBitVec"]
|
||||
end
|
||||
subgraph "Index level — applies predicate"
|
||||
GA["group_at_least(k)\n= accumulate.geq(k)"]
|
||||
GALL["group_all\n= accumulate.geq(n_cols)"]
|
||||
GANY["group_any\n= OR of partial_group_any"]
|
||||
end
|
||||
PGP -->|"concat across partitions\nadd across layers"| GA
|
||||
PGP --> GALL
|
||||
PGP --> GANY
|
||||
```
|
||||
|
||||
Non-additive predicates (`group_all`, `group_at_least`) do **not** exist at matrix level — they require the global accumulated count.
|
||||
|
||||
### MatrixGroupOps (planned trait)
|
||||
|
||||
Group operations live on the matrix and expose only **additive intermediates** (`MemoryIntVec`). Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
|
||||
Group operations live on the matrix and expose only **additive intermediates** backed by temp files. Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
|
||||
|
||||
```rust
|
||||
trait MatrixGroupOps {
|
||||
// How many columns in group have value >= threshold, per kmer slot
|
||||
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32) -> MemoryIntVec;
|
||||
pub trait MatrixGroupOps {
|
||||
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
|
||||
-> io::Result<TempCompactIntVec>;
|
||||
|
||||
// Sum of values across group columns, per kmer slot
|
||||
fn partial_group_sum(&self, g: &ColGroup) -> MemoryIntVec;
|
||||
fn partial_group_sum(&self, g: &ColGroup)
|
||||
-> io::Result<TempCompactIntVec>;
|
||||
|
||||
// Kmer present (value >= threshold) in at least one column of group
|
||||
fn partial_group_any(&self, g: &ColGroup, threshold: u32) -> MemoryBitVec;
|
||||
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
|
||||
-> io::Result<TempBitVec>;
|
||||
}
|
||||
```
|
||||
|
||||
Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — they are composed at the index level from the additive intermediates:
|
||||
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)` since values are 0/1.
|
||||
|
||||
**`partial_group_presence_count` — chunking for large groups:**
|
||||
|
||||
When `g.indices.len() < 255`, per-slot counts fit in a raw `u8` — fast path: accumulate directly into `primary_bytes_mut()` using `inc_primary_bits`, then `freeze()`. No overflow map needed.
|
||||
|
||||
When `g.indices.len() ≥ 255`, process in chunks of 254 columns — each chunk stays within `u8` range — then add chunks into a running `TempCompactIntVecBuilder` accumulator via `IntSliceMut::add`. This keeps peak memory proportional to one partition, not the number of columns × partitions.
|
||||
|
||||
```
|
||||
fast path (< 255 cols):
|
||||
builder = TempCompactIntVecBuilder::new(n)
|
||||
for c in group:
|
||||
mask = col_view(c).cmp_scalar(|v| v >= threshold) // MemoryBitVec
|
||||
inc_primary_bits(primary_bytes_mut, mask) // u8 safe
|
||||
builder.freeze()
|
||||
|
||||
slow path (≥ 255 cols):
|
||||
result = TempCompactIntVecBuilder::new(n)
|
||||
for chunk in group.chunks(254):
|
||||
chunk_builder = TempCompactIntVecBuilder::new(n)
|
||||
inc_primary_bits(chunk_builder, …)
|
||||
chunk_frozen = chunk_builder.freeze()
|
||||
IntSliceMut::add(&mut result, &chunk_frozen)
|
||||
result.freeze()
|
||||
```
|
||||
|
||||
Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — composed at the index level:
|
||||
|
||||
```
|
||||
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
|
||||
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)).sum();
|
||||
let in_mask = presence.geq(2); // MemoryBitVec
|
||||
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
|
||||
let in_mask = presence.geq(2);
|
||||
|
||||
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)).sum();
|
||||
let out_mask = out_sum.leq(0); // MemoryBitVec
|
||||
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
|
||||
let out_mask = out_sum.leq(0);
|
||||
|
||||
let mask = in_mask.and(&out_mask); // BitSliceMut::and — O(n/64)
|
||||
let mask = in_mask & &out_mask; // BitSliceMut::and — O(n/64)
|
||||
```
|
||||
|
||||
### mask_with (planned IntSliceMut method)
|
||||
### mask_with (IntSliceMut)
|
||||
|
||||
Apply a bit mask to a count vector: zero slots where the mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
|
||||
Provided method on `IntSliceMut`. Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
|
||||
|
||||
```
|
||||
for (w_idx, word) in mask.words():
|
||||
if word == u64::MAX: continue
|
||||
if word == u64::MAX: continue // skip all-ones words
|
||||
zeros = !word
|
||||
while zeros != 0:
|
||||
bit = trailing_zeros(zeros)
|
||||
s = w_idx * 64 + bit
|
||||
self.set(s, 0)
|
||||
if primary[s] != 0: self.set(s, 0) // clears overflow entry too
|
||||
zeros &= zeros − 1
|
||||
```
|
||||
|
||||
This is the terminal operation for both Filter (zero non-selected kmer slots in a count matrix) and Select (positional selection without MPHF).
|
||||
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).
|
||||
|
||||
Reference in New Issue
Block a user