refactor: replace in-memory vectors with temp-file-backed storage

Introduces `TempCompactIntVec` and `TempBitVec` as temporary, file-backed intermediates to replace eager in-memory vectors, enabling OS-level paging under memory pressure. Updates the `MatrixGroupOps` trait to return `io::Result` types, allowing proper error propagation and supporting chunked accumulation for large column groups. Includes builder patterns with `.freeze()` finalization, automatic `TempDir` cleanup on drop, and necessary test updates to handle the new fallible signatures. Also fixes `Cargo.toml` section ordering.
2026-06-17 15:13:22 +02:00
parent 1d38d87ff9
commit fb4962c4fe
11 changed files with 399 additions and 131 deletions
@@ -11,8 +11,11 @@ src/obicompactvec/src/
  reader.rs         PersistentCompactIntVec (read-only)
  builder.rs        PersistentCompactIntVecBuilder (read-write)
  memoryintvec.rs   MemoryIntVec
+  tempintvec.rs     TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
+  tempbitvec.rs     TempBitVec, TempBitVecBuilder (temp-file-backed)
  bitmatrix.rs      PersistentBitMatrix, PersistentBitMatrixBuilder
  intmatrix.rs      PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
+  colgroup.rs       ColGroup, MatrixGroupOps trait
  format.rs         file format constants, encode/decode helpers
  layer_meta.rs     LayerMeta (column metadata)
  meta.rs           matrix metadata
@@ -24,13 +27,22 @@ graph TD
    traits --> memoryintvec
    bitvec --> memoryvec
    bitvec --> bitmatrix
+    bitvec --> tempbitvec
    format --> reader
    format --> builder
    reader --> intmatrix
+    reader --> tempintvec
    builder --> intmatrix
    builder --> memoryintvec
+    builder --> tempintvec
    memoryvec --> traits
    memoryintvec --> traits
+    tempintvec --> intmatrix
+    tempintvec --> bitmatrix
+    tempbitvec --> intmatrix
+    tempbitvec --> bitmatrix
+    colgroup --> intmatrix
+    colgroup --> bitmatrix
    layer_meta --> bitmatrix
    layer_meta --> intmatrix
    meta --> bitmatrix
@@ -479,6 +491,8 @@ See `persistent_compact_int_vec.md` for file format and lifecycle.
 | `MemoryIntVec` | inherent merge-scan ✓ | `byte_sum` ✓ | `byte_count_nonzero` ✓ |
 | `PersistentCompactIntVecBuilder` | default (get-per-slot) | `byte_sum` on mmap ✓ | `byte_count_nonzero` on mmap ✓ |
 | `PersistentCompactIntVec` | inherent merge-scan Iter ✓ | inherent `sum()` ✓ | inherent `count_nonzero()` ✓ |
+| `TempCompactIntVec` | delegates to inner `PersistentCompactIntVec` | delegates | delegates |
+| `TempCompactIntVecBuilder` | default (get-per-slot) | delegates to builder | delegates to builder |
 | `PackedIntCol<'a>` | inherent PackedIntColIter ✓ | byte_sum ✓ | byte_count_nonzero ✓ |

 `PackedIntCol` is used internally by `PersistentCompactIntMatrix` (packed format) for column views.
@@ -557,45 +571,68 @@ Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` (inter, union), `par

 ---

-## Planned — Filter / Select API
+## Temp-file-backed types

-### Composition across layers and partitions
+`MemoryBitVec` and `MemoryIntVec` are reserved for truly transient intra-method intermediates (e.g. a single `cmp_scalar` result that lives for one loop iteration). **All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.

-```mermaid
-graph TD
-    subgraph Index
-        CG["ColGroup\nVec&lt;usize&gt; — valid everywhere"]
-        ACC["MemoryIntVec\nglobal accumulator"]
-        PRED["geq / leq / and / or\n→ MemoryBitVec mask"]
-    end
+### Lifecycle

-    subgraph "Layer 1"
-        subgraph "Partition A  kmers 0..k/2"
-            MA["Matrix A\npartial_group_presence_count"]
-        end
-        subgraph "Partition B  kmers k/2..k"
-            MB["Matrix B\npartial_group_presence_count"]
-        end
-        CONCAT1["concat → MemoryIntVec\[0..k\]"]
-    end
-
-    subgraph "Layer 2"
-        CONCAT2["concat → MemoryIntVec\[0..k\]"]
-    end
-
-    CG -->|"same indices"| MA
-    CG -->|"same indices"| MB
-    MA -->|"kmer range A"| CONCAT1
-    MB -->|"kmer range B"| CONCAT1
-    CONCAT1 -->|"IntSliceMut::add"| ACC
-    CONCAT2 -->|"IntSliceMut::add"| ACC
-    ACC --> PRED
 ```
+TempCompactIntVecBuilder::new(n)   →  writable mmap in TempDir
+     ↓  (set / add / count_bits / mask_with / …)
+ .freeze()                          →  TempCompactIntVec  (read-only mmap + TempDir)
+     ↓  (optional)
+ .make_persistent(path)             →  PersistentCompactIntVec  (permanent file)
+```
+
+Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec`.
+
+**Drop order**: in `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }`, Rust drops fields in declaration order — `vec` (mmap) is released before `_temp` (directory) is deleted. No explicit `drop()` needed.
+
+### TempCompactIntVec / TempCompactIntVecBuilder
+
+```rust
+pub struct TempCompactIntVec {
+    vec:   PersistentCompactIntVec,
+    _temp: TempDir,        // dropped after vec
+}
+
+pub(crate) struct TempCompactIntVecBuilder {
+    builder: PersistentCompactIntVecBuilder,
+    temp:    TempDir,
+}
+```
+
+`TempCompactIntVec` implements `IntSlice` (full delegation to inner `PersistentCompactIntVec`).  
+`TempCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut` (delegation to inner builder).  
+`make_persistent(path)` copies the temp file to `path` and opens it as `PersistentCompactIntVec`.
+
+### TempBitVec / TempBitVecBuilder
+
+```rust
+pub struct TempBitVec {
+    vec:   PersistentBitVec,
+    _temp: TempDir,
+}
+
+pub(crate) struct TempBitVecBuilder {
+    builder: PersistentBitVecBuilder,
+    temp:    TempDir,
+}
+```
+
+`TempBitVec` implements `BitSlice`.  
+`TempBitVecBuilder` implements `BitSlice` + `BitSliceMut`.  
+`make_persistent(path)` copies the temp file and opens as `PersistentBitVec`.
+
+---
+
+## Filter / Select API

 ### ColGroup

 ```rust
-struct ColGroup { name: String, indices: Vec<usize> }
+pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
 ```

 Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions because column structure is identical across the entire hierarchy (same samples/genomes everywhere; only rows = kmer slots are partitioned).
@@ -607,68 +644,75 @@ Defined **once at the index level** from column metadata. Valid in all matrices
 - **Across partitions**: kmer space is partitioned → partial results are **concatenated** (disjoint kmer ranges).
 - **Across layers**: same kmer space, different counts → partial results are **aggregated** (add, OR, etc.).

-### Additivity rules
+### MatrixGroupOps

-```mermaid
-flowchart LR
-    subgraph "Matrix level — returns MemoryIntVec"
-        PGP["partial_group_presence_count\npartial_group_sum\npartial_group_any → MemoryBitVec"]
-    end
-    subgraph "Index level — applies predicate"
-        GA["group_at_least(k)\n= accumulate.geq(k)"]
-        GALL["group_all\n= accumulate.geq(n_cols)"]
-        GANY["group_any\n= OR of partial_group_any"]
-    end
-    PGP -->|"concat across partitions\nadd across layers"| GA
-    PGP --> GALL
-    PGP --> GANY
-```
-
-Non-additive predicates (`group_all`, `group_at_least`) do **not** exist at matrix level — they require the global accumulated count.
-
-### MatrixGroupOps (planned trait)
-
-Group operations live on the matrix and expose only **additive intermediates** (`MemoryIntVec`). Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
+Group operations live on the matrix and expose only **additive intermediates** backed by temp files. Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.

 ```rust
-trait MatrixGroupOps {
-    // How many columns in group have value >= threshold, per kmer slot
-    fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32) -> MemoryIntVec;
+pub trait MatrixGroupOps {
+    fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
+        -> io::Result<TempCompactIntVec>;

-    // Sum of values across group columns, per kmer slot
-    fn partial_group_sum(&self, g: &ColGroup) -> MemoryIntVec;
+    fn partial_group_sum(&self, g: &ColGroup)
+        -> io::Result<TempCompactIntVec>;

-    // Kmer present (value >= threshold) in at least one column of group
-    fn partial_group_any(&self, g: &ColGroup, threshold: u32) -> MemoryBitVec;
+    fn partial_group_any(&self, g: &ColGroup, threshold: u32)
+        -> io::Result<TempBitVec>;
 }
 ```

-Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — they are composed at the index level from the additive intermediates:
+Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)` since values are 0/1.
+
+**`partial_group_presence_count` — chunking for large groups:**
+
+When `g.indices.len() < 255`, per-slot counts fit in a raw `u8` — fast path: accumulate directly into `primary_bytes_mut()` using `inc_primary_bits`, then `freeze()`. No overflow map needed.
+
+When `g.indices.len() ≥ 255`, process in chunks of 254 columns — each chunk stays within `u8` range — then add chunks into a running `TempCompactIntVecBuilder` accumulator via `IntSliceMut::add`. This keeps peak memory proportional to one partition, not the number of columns × partitions.
+
+```
+fast path (< 255 cols):
+  builder = TempCompactIntVecBuilder::new(n)
+  for c in group:
+    mask = col_view(c).cmp_scalar(|v| v >= threshold)  // MemoryBitVec
+    inc_primary_bits(primary_bytes_mut, mask)           // u8 safe
+  builder.freeze()
+
+slow path (≥ 255 cols):
+  result = TempCompactIntVecBuilder::new(n)
+  for chunk in group.chunks(254):
+    chunk_builder = TempCompactIntVecBuilder::new(n)
+    inc_primary_bits(chunk_builder, …)
+    chunk_frozen = chunk_builder.freeze()
+    IntSliceMut::add(&mut result, &chunk_frozen)
+  result.freeze()
+```
+
+Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — composed at the index level:

 ```
 // "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
-let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)).sum();
-let in_mask  = presence.geq(2);                                     // MemoryBitVec
+let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
+let in_mask  = presence.geq(2);

-let out_sum  = layers.map(|l| l.partial_group_sum(&outgroup)).sum();
-let out_mask = out_sum.leq(0);                                      // MemoryBitVec
+let out_sum  = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
+let out_mask = out_sum.leq(0);

-let mask = in_mask.and(&out_mask);    // BitSliceMut::and — O(n/64)
+let mask = in_mask & &out_mask;    // BitSliceMut::and — O(n/64)
 ```

-### mask_with (planned IntSliceMut method)
+### mask_with (IntSliceMut)

-Apply a bit mask to a count vector: zero slots where the mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
+Provided method on `IntSliceMut`. Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.

 ```
 for (w_idx, word) in mask.words():
-  if word == u64::MAX: continue
+  if word == u64::MAX: continue   // skip all-ones words
  zeros = !word
  while zeros != 0:
    bit = trailing_zeros(zeros)
    s = w_idx * 64 + bit
-    self.set(s, 0)
+    if primary[s] != 0: self.set(s, 0)   // clears overflow entry too
    zeros &= zeros − 1
 ```

-This is the terminal operation for both Filter (zero non-selected kmer slots in a count matrix) and Select (positional selection without MPHF).
+Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).