2026-06-17 13:14:11 +02:00
# obicompactvec — Complete Reference
## Module structure
```
src/obicompactvec/src/
lib.rs public re-exports
2026-06-18 07:10:08 +02:00
views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation)
2026-06-17 13:14:11 +02:00
bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter
reader.rs PersistentCompactIntVec (read-only)
builder.rs PersistentCompactIntVecBuilder (read-write)
2026-06-17 15:13:22 +02:00
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
2026-06-17 13:14:11 +02:00
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
2026-06-17 15:13:22 +02:00
colgroup.rs ColGroup, MatrixGroupOps trait
2026-06-17 13:14:11 +02:00
format.rs file format constants, encode/decode helpers
layer_meta.rs LayerMeta (column metadata)
meta.rs matrix metadata
```
2026-06-17 14:24:57 +02:00
```mermaid
graph TD
2026-06-18 07:10:08 +02:00
views --> bitvec
views --> builder
views --> tempbitvec
views --> tempintvec
views --> bitmatrix
views --> intmatrix
2026-06-17 14:24:57 +02:00
format --> reader
format --> builder
reader --> intmatrix
2026-06-17 15:13:22 +02:00
reader --> tempintvec
2026-06-17 14:24:57 +02:00
builder --> intmatrix
2026-06-17 15:13:22 +02:00
builder --> tempintvec
2026-06-18 07:10:08 +02:00
bitvec --> tempbitvec
bitvec --> bitmatrix
2026-06-17 15:13:22 +02:00
tempintvec --> intmatrix
tempintvec --> bitmatrix
tempbitvec --> intmatrix
tempbitvec --> bitmatrix
colgroup --> intmatrix
colgroup --> bitmatrix
2026-06-17 14:24:57 +02:00
layer_meta --> bitmatrix
layer_meta --> intmatrix
meta --> bitmatrix
meta --> intmatrix
```
2026-06-17 13:14:11 +02:00
---
## Compact int encoding
All integer vectors use the same two-tier encoding regardless of storage backend.
**Primary array** — one `u8` per slot:
- Values **0– 254** are stored directly. No overhead.
- Value **255 is a sentinel** : the slot's actual value is ≥ 255 and lives in the overflow store.
**Overflow store** — maps slot index to a `u32` value ≥ 255:
2026-06-18 07:10:08 +02:00
- In `PersistentCompactIntVecBuilder` : a `HashMap<usize, u32>` in RAM.
2026-06-17 13:14:11 +02:00
- In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.
2026-06-17 14:24:57 +02:00
```mermaid
flowchart LR
slot --> P["primary[slot]: u8"]
P -->|"< 255"| V["value = byte (0– 254)"]
P -->|"= 255 sentinel"| OV["overflow store"]
2026-06-18 07:10:08 +02:00
OV -->|"Builder"| HM["HashMap<usize, u32>\nin RAM"]
2026-06-17 14:24:57 +02:00
OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
```
2026-06-17 13:14:11 +02:00
**Key property — sentinel 255 = +∞ on `u8`:**
- `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
- `max(a, 255) = 255` → correct sentinel when either side is overflow
- Only the **both-overflow** case requires reading actual values from the overflow store.
In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.
---
2026-06-18 07:10:08 +02:00
## View types
2026-06-17 14:24:57 +02:00
2026-06-18 07:10:08 +02:00
The previous trait hierarchy (`BitSlice` , `BitSliceMut` , `IntSlice` , `IntSliceMut` ) has been replaced by two concrete zero-copy view structs with inherent methods. Views are ** `Copy` ** — passing them is free. All read operations live on these two types.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
### `BitSliceView<'a>`
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
```rust
#[derive(Clone, Copy)]
pub struct BitSliceView < 'a > { pub ( crate ) words : & 'a [ u64 ], pub ( crate ) n : usize }
2026-06-17 14:24:57 +02:00
```
2026-06-18 07:10:08 +02:00
Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
| Method | Cost |
|---|---|
| `len()` , `is_empty()` | O(1) |
| `get(slot)` | O(1) |
| `count_ones()` | POPCNT per word, O(n/64) |
| `count_zeros()` | `n − count_ones()` , O(n/64) |
| `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration |
| `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount` , `(a\|b).popcount` per word, O(n/64) |
| `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) |
| `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) |
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`BitSliceIter<'a>` : word-level scan; one word per 64 iterations.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
### `IntSliceView<'a>`
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
```rust
#[derive(Clone, Copy)]
pub struct IntSliceView < 'a > {
pub ( crate ) primary : & 'a [ u8 ],
pub ( crate ) overflow_raw : & 'a [ u8 ], // sorted [(slot:u64, value:u32)] entries
pub ( crate ) n_overflow : usize ,
pub ( crate ) n : usize ,
}
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
`overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()` /`freeze()` time.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
| Method | Cost |
|---|---|
| `len()` , `is_empty()` | O(1) |
| `primary_bytes()` | O(1) |
| `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration |
| `get(slot)` | O(1) primary; binary search O(log k) for overflow slots |
| `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) |
| `sum()` | byte scan + overflow, O(n + k) |
| `count_nonzero()` | byte scan, O(n) |
| Distance methods (`bray_dist` , `euclidean_dist` , `jaccard_dist` , …) | O(n + k) |
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`IntSliceViewIter<'a>` : merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
**Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap` , not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0` . This is intentional — the view is primarily useful after `freeze()` . During building, callers that need overflow use `overflow_entries()` directly.
2026-06-17 13:14:11 +02:00
---
## Concrete types
2026-06-17 14:24:57 +02:00
```mermaid
classDiagram
2026-06-18 07:10:08 +02:00
class BitSliceView {
+words: &[u64]
+n: usize
+get(slot) bool
+count_ones() u64
+iter() BitSliceIter
+jaccard_dist/hamming_dist(other: BitSliceView)
2026-06-17 14:24:57 +02:00
}
2026-06-18 07:10:08 +02:00
class IntSliceView {
+primary: &[u8]
+overflow_raw: &[u8]
+n_overflow: usize
+n: usize
+get(slot) u32
+iter() IntSliceViewIter
+overflow_entries() Iterator
+bray_dist/euclidean_dist/…(other: IntSliceView)
2026-06-17 14:24:57 +02:00
}
class PersistentBitVec {
-mmap: Mmap
-n: usize
2026-06-18 07:10:08 +02:00
+view() BitSliceView
+get(slot) bool
+count_ones/zeros() u64
2026-06-17 14:24:57 +02:00
+iter() BitIter
2026-06-18 07:10:08 +02:00
+partial_jaccard_dist(&Self) (u64,u64)
+jaccard_dist/hamming_dist(&Self) …
2026-06-17 14:24:57 +02:00
}
class PersistentBitVecBuilder {
-mmap: MmapMut
-n: usize
2026-06-18 07:10:08 +02:00
+view() BitSliceView
+set(slot, bool)
+or/and/xor/not(BitSliceView)
+copy_from(BitSliceView)
+close() / finish() → PersistentBitVec
2026-06-17 14:24:57 +02:00
}
class PersistentCompactIntVec {
-mmap: Mmap
2026-06-18 07:10:08 +02:00
-n: usize
-n_overflow: usize
-step: usize
2026-06-17 14:24:57 +02:00
-index: Vec~(usize,usize)~
2026-06-18 07:10:08 +02:00
+view() IntSliceView
2026-06-17 14:24:57 +02:00
+get(slot) u32
2026-06-18 07:10:08 +02:00
+iter() Iter
+sum/count_nonzero() u64
+bray_dist/euclidean_dist/… (&Self)
2026-06-17 14:24:57 +02:00
}
class PersistentCompactIntVecBuilder {
-mmap: MmapMut
-n: usize
-overflow: HashMap~usize,u32~
2026-06-18 07:10:08 +02:00
+view() IntSliceView
+set(slot, u32) / get(slot) u32
+inc / inc_present / inc_present_fast
+inc_predicate / inc_predicate_fast
+add/min/max/diff/mask_with(…View)
+primary_bytes/primary_bytes_mut()
+close() / finish() → PersistentCompactIntVec
2026-06-17 14:24:57 +02:00
}
2026-06-18 07:10:08 +02:00
PersistentBitVec --> BitSliceView : view()
PersistentBitVecBuilder --> BitSliceView : view()
PersistentCompactIntVec --> IntSliceView : view()
PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
2026-06-17 14:24:57 +02:00
PersistentBitVecBuilder --> PersistentBitVec : close() then open()
PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
```
2026-06-18 07:10:08 +02:00
### `PersistentBitVec` / `PersistentBitVecBuilder`
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()` , `count_zeros()` , `partial_jaccard_dist(&Self)` , `jaccard_dist(&Self)` , `hamming_dist(&Self)` .
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`BitIter<'a>` — exported iterator for `PersistentBitVec::iter()` :
2026-06-17 13:14:11 +02:00
```rust
2026-06-18 07:10:08 +02:00
pub struct BitIter < 'a > { pub ( crate ) words : & 'a [ u64 ], pub ( crate ) slot : usize , pub ( crate ) n : usize }
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
`PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>` :
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
| Method | Cost |
|---|---|
| `set(slot, bool)` | O(1) |
| `view() -> BitSliceView<'_>` | O(1) |
| `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly |
| `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) |
| `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) |
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories:
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
**Point mutations:**
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
| Method | Note |
|---|---|
| `set(slot, u32)` | writes primary[slot] or 255+overflow |
| `get(slot) -> u32` | reads primary byte or HashMap |
| `inc(slot)` | `get` + `set` , O(1) |
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
**Bulk computation methods** — accept view arguments:
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
| Method | Semantics | Overflow |
|---|---|---|
| `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc` , safe for any group size |
| `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached |
| `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe |
| `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached |
| `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback |
| `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below |
| `max(IntSliceView)` | pre-pass + byte max | see algorithm below |
| `diff(IntSliceView)` | saturating sub | self<255 hot path |
| `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) |
** `inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast` , or chunk size < 255 for `inc_predicate_fast` ). Violation is caught by `debug_assert` in dev builds.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
** `min` algorithm:**
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
```
snapshot self_ov: Vec<(slot,val)>
snapshot other_ov: HashMap<slot,val>
clear_overflow()
Pass 1 — byte min, SIMD-vectorizable, O(n)
Pass 2 — both-overflow fixup, O(k_self):
for (slot, self_val) in self_ov:
if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
```
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
** `max` algorithm:**
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
```
Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
set(slot, max(self.get(slot), other_val))
Pass 1 — byte max, SIMD-vectorizable, O(n)
```
2026-06-17 13:14:11 +02:00
---
## Matrix types
Four matrix types, two encodings × two formats:
| | Columnar format | Packed format |
|---|---|---|
2026-06-18 07:10:08 +02:00
| **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) |
| **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) |
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly:
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
```rust
// PersistentBitMatrix
pub fn col_view ( & self , c : usize ) -> BitSliceView < '_ >
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
// PersistentCompactIntMatrix
pub fn col_view ( & self , c : usize ) -> IntSliceView < '_ >
```
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
No wrapper enums (`BitColView` , `IntColView` ): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
`pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format.
2026-06-17 13:14:11 +02:00
---
## Aggregation traits (matrix level)
### ColumnWeights
```rust
trait ColumnWeights : Send + Sync {
fn col_weights ( & self ) -> Array1 < u64 > ; // sum per column
fn partial_kmer_counts ( & self ) -> Array1 < u64 > ; // default = col_weights()
}
```
`partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count.
### CountPartials
Abstract required methods: `partial_bray` , `partial_euclidean` , `partial_threshold_jaccard` , `partial_relfreq_bray` , `partial_relfreq_euclidean` , `partial_hellinger` .
2026-06-18 07:10:08 +02:00
**Additivity rule:** self-contained partials (`partial_bray` , `partial_euclidean` , `partial_threshold_jaccard` ) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*` , `partial_hellinger` ) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
** `partial_threshold_jaccard` returns `(inter, union)` ** because `union[i,j]` depends on both columns simultaneously.
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Provided finalisations:
2026-06-17 13:14:11 +02:00
| Finalisation | Formula |
|---|---|
| `bray_dist_matrix()` | `1 − 2·partial_bray[i,j] / (w[i] + w[j])` |
| `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
| `threshold_jaccard_dist_matrix(t)` | `1 − inter[i,j] / union[i,j]` |
2026-06-18 07:10:08 +02:00
| `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` |
2026-06-17 13:14:11 +02:00
| `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
| `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
| `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |
### BitPartials
2026-06-18 07:10:08 +02:00
Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` , `partial_hamming() -> Array2<u64>` . Both additive across layers and partitions.
2026-06-17 13:14:11 +02:00
---
2026-06-17 15:13:22 +02:00
## Temp-file-backed types
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
**All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
2026-06-17 14:24:57 +02:00
2026-06-17 15:13:22 +02:00
### Lifecycle
```
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
2026-06-18 07:10:08 +02:00
↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …)
2026-06-17 15:13:22 +02:00
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
↓ (optional)
.make_persistent(path) → PersistentCompactIntVec (permanent file)
```
Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec` .
2026-06-18 07:10:08 +02:00
**Drop order** : `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed.
2026-06-17 15:13:22 +02:00
### TempCompactIntVec / TempCompactIntVecBuilder
```rust
pub struct TempCompactIntVec {
vec : PersistentCompactIntVec ,
_temp : TempDir , // dropped after vec
}
pub ( crate ) struct TempCompactIntVecBuilder {
builder : PersistentCompactIntVecBuilder ,
temp : TempDir ,
}
```
2026-06-18 07:10:08 +02:00
`TempCompactIntVec` : read access via `get(slot)` , `sum()` , `iter()` , `view() -> IntSliceView<'_>` .
`TempCompactIntVecBuilder` : full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast` , `inc_predicate_fast` , `add` , `min` , `max` , `diff` , `mask_with` ) are exposed as `pub(crate)` .
2026-06-17 15:13:22 +02:00
### TempBitVec / TempBitVecBuilder
```rust
pub struct TempBitVec {
vec : PersistentBitVec ,
_temp : TempDir ,
}
pub ( crate ) struct TempBitVecBuilder {
builder : PersistentBitVecBuilder ,
temp : TempDir ,
}
2026-06-17 14:24:57 +02:00
```
2026-06-18 07:10:08 +02:00
`TempBitVec` : read access via `get(slot)` , `count_ones()` , `view() -> BitSliceView<'_>` , `iter()` .
`TempBitVecBuilder` : exposes `set(slot, bool)` , `or(BitSliceView)` , and:
```rust
pub ( crate ) fn or_where ( & mut self , col : IntSliceView < '_ > , pred : impl Fn ( u32 ) -> bool )
```
`or_where` — two passes, no intermediate allocation:
```
Pass 1 — primary bytes, O(n):
for slot in 0..n:
b = col.primary_bytes()[slot]
if b < 255 AND pred(b as u32): self.set(slot, true)
Pass 2 — overflow, O(k):
for (slot, val) in col.overflow_entries():
if pred(val): self.set(slot, true)
```
2026-06-17 15:13:22 +02:00
---
## Filter / Select API
2026-06-17 13:14:11 +02:00
### ColGroup
```rust
2026-06-17 15:13:22 +02:00
pub struct ColGroup { pub name : String , pub indices : Vec < usize > }
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
2026-06-17 13:14:11 +02:00
### Composition axis
2026-06-18 07:10:08 +02:00
- **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.).
2026-06-17 13:14:11 +02:00
2026-06-17 15:13:22 +02:00
### MatrixGroupOps
2026-06-17 14:24:57 +02:00
2026-06-18 07:10:08 +02:00
Group operations expose only **additive intermediates** backed by temp files. Final predicates are applied at the index level after accumulation.
2026-06-17 14:24:57 +02:00
2026-06-17 15:13:22 +02:00
```rust
pub trait MatrixGroupOps {
fn partial_group_presence_count ( & self , g : & ColGroup , threshold : u32 )
-> io ::Result < TempCompactIntVec > ;
2026-06-17 14:24:57 +02:00
2026-06-17 15:13:22 +02:00
fn partial_group_sum ( & self , g : & ColGroup )
-> io ::Result < TempCompactIntVec > ;
2026-06-17 13:14:11 +02:00
2026-06-17 15:13:22 +02:00
fn partial_group_any ( & self , g : & ColGroup , threshold : u32 )
-> io ::Result < TempBitVec > ;
}
```
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix` . For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)` .
2026-06-17 13:14:11 +02:00
2026-06-17 15:13:22 +02:00
** `partial_group_presence_count` — chunking for large groups:**
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
When `g.indices.len() < 255` : per-slot counts stay within `u8` range. Use `inc_present_fast` (bit matrix) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int matrix) — raw u8 increment, no overflow map written.
2026-06-17 15:13:22 +02:00
2026-06-18 07:10:08 +02:00
When `g.indices.len() ≥ 255` : process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via `.add(chunk_frozen.view())` .
2026-06-17 15:13:22 +02:00
```
fast path (< 255 cols):
builder = TempCompactIntVecBuilder::new(n)
for c in group:
2026-06-18 07:10:08 +02:00
builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
2026-06-17 15:13:22 +02:00
builder.freeze()
slow path (≥ 255 cols):
result = TempCompactIntVecBuilder::new(n)
for chunk in group.chunks(254):
2026-06-18 07:10:08 +02:00
chunk_b = TempCompactIntVecBuilder::new(n)
for c in chunk:
chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
frozen = chunk_b.freeze()
result.add(frozen.view())
2026-06-17 15:13:22 +02:00
result.freeze()
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
** `partial_group_any` ** uses `or_where` on `TempBitVecBuilder` :
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
result = TempBitVecBuilder::new(n)
for c in group:
result.or_where(matrix.col_view(c), |v| v >= threshold)
result.freeze()
```
**Non-additive predicates** (`group_all` , `group_at_least(k)` ) are composed at the index level:
```rust
2026-06-17 13:14:11 +02:00
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
2026-06-17 15:13:22 +02:00
let presence = layers . map ( | l | l . partial_group_presence_count ( & ingroup , 3 ) ? ). add_all () ? ;
2026-06-18 07:10:08 +02:00
let in_mask = presence . view (). geq ( 2 ); // IntSliceView method
2026-06-17 13:14:11 +02:00
2026-06-17 15:13:22 +02:00
let out_sum = layers . map ( | l | l . partial_group_sum ( & outgroup ) ? ). add_all () ? ;
2026-06-18 07:10:08 +02:00
let out_mask = out_sum . view (). leq ( 0 );
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
let mut mask_b = TempBitVecBuilder ::new ( n ) ? ;
mask_b . copy_from ( in_mask );
mask_b . and ( out_mask );
2026-06-17 13:14:11 +02:00
```
2026-06-18 07:10:08 +02:00
### mask_with
2026-06-17 13:14:11 +02:00
2026-06-18 07:10:08 +02:00
Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder` ). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
2026-06-17 13:14:11 +02:00
```
for (w_idx, word) in mask.words():
2026-06-17 15:13:22 +02:00
if word == u64::MAX: continue // skip all-ones words
2026-06-17 13:14:11 +02:00
zeros = !word
while zeros != 0:
bit = trailing_zeros(zeros)
s = w_idx * 64 + bit
2026-06-18 07:10:08 +02:00
if primary[s] != 0: set(s, 0) // clears overflow entry too
2026-06-17 13:14:11 +02:00
zeros &= zeros − 1
```
2026-06-17 15:13:22 +02:00
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).