7eea71fdcd
Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
528 lines
19 KiB
Markdown
528 lines
19 KiB
Markdown
# obicompactvec — Complete Reference
|
||
|
||
## Module structure
|
||
|
||
```
|
||
src/obicompactvec/src/
|
||
lib.rs public re-exports
|
||
views.rs BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
|
||
traits.rs ColumnWeights, CountPartials, BitPartials (matrix aggregation)
|
||
bitvec.rs PersistentBitVec, PersistentBitVecBuilder, BitIter
|
||
reader.rs PersistentCompactIntVec (read-only)
|
||
builder.rs PersistentCompactIntVecBuilder (read-write)
|
||
tempintvec.rs TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
|
||
tempbitvec.rs TempBitVec, TempBitVecBuilder (temp-file-backed)
|
||
bitmatrix.rs PersistentBitMatrix, PersistentBitMatrixBuilder
|
||
intmatrix.rs PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
|
||
colgroup.rs ColGroup, MatrixGroupOps trait
|
||
format.rs file format constants, encode/decode helpers
|
||
layer_meta.rs LayerMeta (column metadata)
|
||
meta.rs matrix metadata
|
||
```
|
||
|
||
```mermaid
|
||
graph TD
|
||
views --> bitvec
|
||
views --> builder
|
||
views --> tempbitvec
|
||
views --> tempintvec
|
||
views --> bitmatrix
|
||
views --> intmatrix
|
||
format --> reader
|
||
format --> builder
|
||
reader --> intmatrix
|
||
reader --> tempintvec
|
||
builder --> intmatrix
|
||
builder --> tempintvec
|
||
bitvec --> tempbitvec
|
||
bitvec --> bitmatrix
|
||
tempintvec --> intmatrix
|
||
tempintvec --> bitmatrix
|
||
tempbitvec --> intmatrix
|
||
tempbitvec --> bitmatrix
|
||
colgroup --> intmatrix
|
||
colgroup --> bitmatrix
|
||
layer_meta --> bitmatrix
|
||
layer_meta --> intmatrix
|
||
meta --> bitmatrix
|
||
meta --> intmatrix
|
||
```
|
||
|
||
---
|
||
|
||
## Compact int encoding
|
||
|
||
All integer vectors use the same two-tier encoding regardless of storage backend.
|
||
|
||
**Primary array** — one `u8` per slot:
|
||
|
||
- Values **0–254** are stored directly. No overhead.
|
||
- Value **255 is a sentinel**: the slot's actual value is ≥ 255 and lives in the overflow store.
|
||
|
||
**Overflow store** — maps slot index to a `u32` value ≥ 255:
|
||
|
||
- In `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
|
||
- In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
slot --> P["primary[slot]: u8"]
|
||
P -->|"< 255"| V["value = byte (0–254)"]
|
||
P -->|"= 255 sentinel"| OV["overflow store"]
|
||
OV -->|"Builder"| HM["HashMap<usize, u32>\nin RAM"]
|
||
OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
|
||
```
|
||
|
||
**Key property — sentinel 255 = +∞ on `u8`:**
|
||
|
||
- `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
|
||
- `max(a, 255) = 255` → correct sentinel when either side is overflow
|
||
- Only the **both-overflow** case requires reading actual values from the overflow store.
|
||
|
||
In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.
|
||
|
||
---
|
||
|
||
## View types
|
||
|
||
The previous trait hierarchy (`BitSlice`, `BitSliceMut`, `IntSlice`, `IntSliceMut`) has been replaced by two concrete zero-copy view structs with inherent methods. Views are **`Copy`** — passing them is free. All read operations live on these two types.
|
||
|
||
### `BitSliceView<'a>`
|
||
|
||
```rust
|
||
#[derive(Clone, Copy)]
|
||
pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }
|
||
```
|
||
|
||
Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero.
|
||
|
||
| Method | Cost |
|
||
|---|---|
|
||
| `len()`, `is_empty()` | O(1) |
|
||
| `get(slot)` | O(1) |
|
||
| `count_ones()` | POPCNT per word, O(n/64) |
|
||
| `count_zeros()` | `n − count_ones()`, O(n/64) |
|
||
| `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration |
|
||
| `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount`, `(a\|b).popcount` per word, O(n/64) |
|
||
| `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) |
|
||
| `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) |
|
||
|
||
`BitSliceIter<'a>`: word-level scan; one word per 64 iterations.
|
||
|
||
### `IntSliceView<'a>`
|
||
|
||
```rust
|
||
#[derive(Clone, Copy)]
|
||
pub struct IntSliceView<'a> {
|
||
pub(crate) primary: &'a [u8],
|
||
pub(crate) overflow_raw: &'a [u8], // sorted [(slot:u64, value:u32)] entries
|
||
pub(crate) n_overflow: usize,
|
||
pub(crate) n: usize,
|
||
}
|
||
```
|
||
|
||
`overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()`/`freeze()` time.
|
||
|
||
| Method | Cost |
|
||
|---|---|
|
||
| `len()`, `is_empty()` | O(1) |
|
||
| `primary_bytes()` | O(1) |
|
||
| `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration |
|
||
| `get(slot)` | O(1) primary; binary search O(log k) for overflow slots |
|
||
| `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) |
|
||
| `sum()` | byte scan + overflow, O(n + k) |
|
||
| `count_nonzero()` | byte scan, O(n) |
|
||
| Distance methods (`bray_dist`, `euclidean_dist`, `jaccard_dist`, …) | O(n + k) |
|
||
|
||
`IntSliceViewIter<'a>`: merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle.
|
||
|
||
**Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap`, not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0`. This is intentional — the view is primarily useful after `freeze()`. During building, callers that need overflow use `overflow_entries()` directly.
|
||
|
||
---
|
||
|
||
## Concrete types
|
||
|
||
```mermaid
|
||
classDiagram
|
||
class BitSliceView {
|
||
+words: &[u64]
|
||
+n: usize
|
||
+get(slot) bool
|
||
+count_ones() u64
|
||
+iter() BitSliceIter
|
||
+jaccard_dist/hamming_dist(other: BitSliceView)
|
||
}
|
||
class IntSliceView {
|
||
+primary: &[u8]
|
||
+overflow_raw: &[u8]
|
||
+n_overflow: usize
|
||
+n: usize
|
||
+get(slot) u32
|
||
+iter() IntSliceViewIter
|
||
+overflow_entries() Iterator
|
||
+bray_dist/euclidean_dist/…(other: IntSliceView)
|
||
}
|
||
class PersistentBitVec {
|
||
-mmap: Mmap
|
||
-n: usize
|
||
+view() BitSliceView
|
||
+get(slot) bool
|
||
+count_ones/zeros() u64
|
||
+iter() BitIter
|
||
+partial_jaccard_dist(&Self) (u64,u64)
|
||
+jaccard_dist/hamming_dist(&Self) …
|
||
}
|
||
class PersistentBitVecBuilder {
|
||
-mmap: MmapMut
|
||
-n: usize
|
||
+view() BitSliceView
|
||
+set(slot, bool)
|
||
+or/and/xor/not(BitSliceView)
|
||
+copy_from(BitSliceView)
|
||
+close() / finish() → PersistentBitVec
|
||
}
|
||
class PersistentCompactIntVec {
|
||
-mmap: Mmap
|
||
-n: usize
|
||
-n_overflow: usize
|
||
-step: usize
|
||
-index: Vec~(usize,usize)~
|
||
+view() IntSliceView
|
||
+get(slot) u32
|
||
+iter() Iter
|
||
+sum/count_nonzero() u64
|
||
+bray_dist/euclidean_dist/… (&Self)
|
||
}
|
||
class PersistentCompactIntVecBuilder {
|
||
-mmap: MmapMut
|
||
-n: usize
|
||
-overflow: HashMap~usize,u32~
|
||
+view() IntSliceView
|
||
+set(slot, u32) / get(slot) u32
|
||
+inc / inc_present / inc_present_fast
|
||
+inc_predicate / inc_predicate_fast
|
||
+add/min/max/diff/mask_with(…View)
|
||
+primary_bytes/primary_bytes_mut()
|
||
+close() / finish() → PersistentCompactIntVec
|
||
}
|
||
|
||
PersistentBitVec --> BitSliceView : view()
|
||
PersistentBitVecBuilder --> BitSliceView : view()
|
||
PersistentCompactIntVec --> IntSliceView : view()
|
||
PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
|
||
PersistentBitVecBuilder --> PersistentBitVec : close() then open()
|
||
PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
|
||
```
|
||
|
||
### `PersistentBitVec` / `PersistentBitVecBuilder`
|
||
|
||
`PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()`, `count_zeros()`, `partial_jaccard_dist(&Self)`, `jaccard_dist(&Self)`, `hamming_dist(&Self)`.
|
||
|
||
`BitIter<'a>` — exported iterator for `PersistentBitVec::iter()`:
|
||
|
||
```rust
|
||
pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
|
||
```
|
||
|
||
`PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>`:
|
||
|
||
| Method | Cost |
|
||
|---|---|
|
||
| `set(slot, bool)` | O(1) |
|
||
| `view() -> BitSliceView<'_>` | O(1) |
|
||
| `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly |
|
||
| `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) |
|
||
| `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) |
|
||
|
||
### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`
|
||
|
||
`PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers.
|
||
|
||
`PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories:
|
||
|
||
**Point mutations:**
|
||
|
||
| Method | Note |
|
||
|---|---|
|
||
| `set(slot, u32)` | writes primary[slot] or 255+overflow |
|
||
| `get(slot) -> u32` | reads primary byte or HashMap |
|
||
| `inc(slot)` | `get` + `set`, O(1) |
|
||
|
||
**Bulk computation methods** — accept view arguments:
|
||
|
||
| Method | Semantics | Overflow |
|
||
|---|---|---|
|
||
| `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc`, safe for any group size |
|
||
| `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached |
|
||
| `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe |
|
||
| `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached |
|
||
| `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback |
|
||
| `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below |
|
||
| `max(IntSliceView)` | pre-pass + byte max | see algorithm below |
|
||
| `diff(IntSliceView)` | saturating sub | self<255 hot path |
|
||
| `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) |
|
||
|
||
**`inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast`, or chunk size < 255 for `inc_predicate_fast`). Violation is caught by `debug_assert` in dev builds.
|
||
|
||
**`min` algorithm:**
|
||
|
||
Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
|
||
|
||
```
|
||
snapshot self_ov: Vec<(slot,val)>
|
||
snapshot other_ov: HashMap<slot,val>
|
||
clear_overflow()
|
||
Pass 1 — byte min, SIMD-vectorizable, O(n)
|
||
Pass 2 — both-overflow fixup, O(k_self):
|
||
for (slot, self_val) in self_ov:
|
||
if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
|
||
```
|
||
|
||
**`max` algorithm:**
|
||
|
||
Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
|
||
|
||
```
|
||
Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
|
||
set(slot, max(self.get(slot), other_val))
|
||
Pass 1 — byte max, SIMD-vectorizable, O(n)
|
||
```
|
||
|
||
---
|
||
|
||
## Matrix types
|
||
|
||
Four matrix types, two encodings × two formats:
|
||
|
||
| | Columnar format | Packed format |
|
||
|---|---|---|
|
||
| **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) |
|
||
| **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) |
|
||
|
||
Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly:
|
||
|
||
```rust
|
||
// PersistentBitMatrix
|
||
pub fn col_view(&self, c: usize) -> BitSliceView<'_>
|
||
|
||
// PersistentCompactIntMatrix
|
||
pub fn col_view(&self, c: usize) -> IntSliceView<'_>
|
||
```
|
||
|
||
No wrapper enums (`BitColView`, `IntColView`): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method.
|
||
|
||
`pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format.
|
||
|
||
---
|
||
|
||
## Aggregation traits (matrix level)
|
||
|
||
### ColumnWeights
|
||
|
||
```rust
|
||
trait ColumnWeights: Send + Sync {
|
||
fn col_weights(&self) -> Array1<u64>; // sum per column
|
||
fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
|
||
}
|
||
```
|
||
|
||
`partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count.
|
||
|
||
### CountPartials
|
||
|
||
Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`.
|
||
|
||
**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter.
|
||
|
||
**`partial_threshold_jaccard` returns `(inter, union)`** because `union[i,j]` depends on both columns simultaneously.
|
||
|
||
Provided finalisations:
|
||
|
||
| Finalisation | Formula |
|
||
|---|---|
|
||
| `bray_dist_matrix()` | `1 − 2·partial_bray[i,j] / (w[i] + w[j])` |
|
||
| `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
|
||
| `threshold_jaccard_dist_matrix(t)` | `1 − inter[i,j] / union[i,j]` |
|
||
| `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` |
|
||
| `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
|
||
| `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
|
||
| `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |
|
||
|
||
### BitPartials
|
||
|
||
Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)`, `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.
|
||
|
||
---
|
||
|
||
## Temp-file-backed types
|
||
|
||
**All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
|
||
|
||
### Lifecycle
|
||
|
||
```
|
||
TempCompactIntVecBuilder::new(n) → writable mmap in TempDir
|
||
↓ (inc_present_fast / inc_predicate_fast / add / mask_with / …)
|
||
.freeze() → TempCompactIntVec (read-only mmap + TempDir)
|
||
↓ (optional)
|
||
.make_persistent(path) → PersistentCompactIntVec (permanent file)
|
||
```
|
||
|
||
Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec`.
|
||
|
||
**Drop order**: `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed.
|
||
|
||
### TempCompactIntVec / TempCompactIntVecBuilder
|
||
|
||
```rust
|
||
pub struct TempCompactIntVec {
|
||
vec: PersistentCompactIntVec,
|
||
_temp: TempDir, // dropped after vec
|
||
}
|
||
|
||
pub(crate) struct TempCompactIntVecBuilder {
|
||
builder: PersistentCompactIntVecBuilder,
|
||
temp: TempDir,
|
||
}
|
||
```
|
||
|
||
`TempCompactIntVec`: read access via `get(slot)`, `sum()`, `iter()`, `view() -> IntSliceView<'_>`.
|
||
|
||
`TempCompactIntVecBuilder`: full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast`, `inc_predicate_fast`, `add`, `min`, `max`, `diff`, `mask_with`) are exposed as `pub(crate)`.
|
||
|
||
### TempBitVec / TempBitVecBuilder
|
||
|
||
```rust
|
||
pub struct TempBitVec {
|
||
vec: PersistentBitVec,
|
||
_temp: TempDir,
|
||
}
|
||
|
||
pub(crate) struct TempBitVecBuilder {
|
||
builder: PersistentBitVecBuilder,
|
||
temp: TempDir,
|
||
}
|
||
```
|
||
|
||
`TempBitVec`: read access via `get(slot)`, `count_ones()`, `view() -> BitSliceView<'_>`, `iter()`.
|
||
|
||
`TempBitVecBuilder`: exposes `set(slot, bool)`, `or(BitSliceView)`, and:
|
||
|
||
```rust
|
||
pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)
|
||
```
|
||
|
||
`or_where` — two passes, no intermediate allocation:
|
||
|
||
```
|
||
Pass 1 — primary bytes, O(n):
|
||
for slot in 0..n:
|
||
b = col.primary_bytes()[slot]
|
||
if b < 255 AND pred(b as u32): self.set(slot, true)
|
||
|
||
Pass 2 — overflow, O(k):
|
||
for (slot, val) in col.overflow_entries():
|
||
if pred(val): self.set(slot, true)
|
||
```
|
||
|
||
---
|
||
|
||
## Filter / Select API
|
||
|
||
### ColGroup
|
||
|
||
```rust
|
||
pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
|
||
```
|
||
|
||
Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
|
||
|
||
### Composition axis
|
||
|
||
- **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges).
|
||
- **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.).
|
||
|
||
### MatrixGroupOps
|
||
|
||
Group operations expose only **additive intermediates** backed by temp files. Final predicates are applied at the index level after accumulation.
|
||
|
||
```rust
|
||
pub trait MatrixGroupOps {
|
||
fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32)
|
||
-> io::Result<TempCompactIntVec>;
|
||
|
||
fn partial_group_sum(&self, g: &ColGroup)
|
||
-> io::Result<TempCompactIntVec>;
|
||
|
||
fn partial_group_any(&self, g: &ColGroup, threshold: u32)
|
||
-> io::Result<TempBitVec>;
|
||
}
|
||
```
|
||
|
||
Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)`.
|
||
|
||
**`partial_group_presence_count` — chunking for large groups:**
|
||
|
||
When `g.indices.len() < 255`: per-slot counts stay within `u8` range. Use `inc_present_fast` (bit matrix) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int matrix) — raw u8 increment, no overflow map written.
|
||
|
||
When `g.indices.len() ≥ 255`: process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via `.add(chunk_frozen.view())`.
|
||
|
||
```
|
||
fast path (< 255 cols):
|
||
builder = TempCompactIntVecBuilder::new(n)
|
||
for c in group:
|
||
builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
|
||
builder.freeze()
|
||
|
||
slow path (≥ 255 cols):
|
||
result = TempCompactIntVecBuilder::new(n)
|
||
for chunk in group.chunks(254):
|
||
chunk_b = TempCompactIntVecBuilder::new(n)
|
||
for c in chunk:
|
||
chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
|
||
frozen = chunk_b.freeze()
|
||
result.add(frozen.view())
|
||
result.freeze()
|
||
```
|
||
|
||
**`partial_group_any`** uses `or_where` on `TempBitVecBuilder`:
|
||
|
||
```
|
||
result = TempBitVecBuilder::new(n)
|
||
for c in group:
|
||
result.or_where(matrix.col_view(c), |v| v >= threshold)
|
||
result.freeze()
|
||
```
|
||
|
||
**Non-additive predicates** (`group_all`, `group_at_least(k)`) are composed at the index level:
|
||
|
||
```rust
|
||
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
|
||
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
|
||
let in_mask = presence.view().geq(2); // IntSliceView method
|
||
|
||
let out_sum = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
|
||
let out_mask = out_sum.view().leq(0);
|
||
|
||
let mut mask_b = TempBitVecBuilder::new(n)?;
|
||
mask_b.copy_from(in_mask);
|
||
mask_b.and(out_mask);
|
||
```
|
||
|
||
### mask_with
|
||
|
||
Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder`). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
|
||
|
||
```
|
||
for (w_idx, word) in mask.words():
|
||
if word == u64::MAX: continue // skip all-ones words
|
||
zeros = !word
|
||
while zeros != 0:
|
||
bit = trailing_zeros(zeros)
|
||
s = w_idx * 64 + bit
|
||
if primary[s] != 0: set(s, 0) // clears overflow entry too
|
||
zeros &= zeros − 1
|
||
```
|
||
|
||
Terminal operation for Filter (retain only selected kmer slots in a count vector) and Select (positional selection without MPHF).
|