docs(obicompactvec): update API docs and algorithm descriptions

Replace trait-based API documentation with concrete, zero-copy view structs and update all associated diagrams. Refine algorithmic descriptions for sentinel handling, overflow stores, and bulk operations. Clarify temporary file lifecycles and group-chunking strategies to support memory-efficient parallel aggregation.
2026-06-18 07:10:08 +02:00
parent f91c5a3f79
commit 7eea71fdcd
1 changed files with 228 additions and 419 deletions
@@ -5,12 +5,11 @@
 ```
 src/obicompactvec/src/
  lib.rs            public re-exports
-  traits.rs         BitSlice, BitSliceMut, IntSlice, IntSliceMut + conversion traits
+  views.rs          BitSliceView<'a>, IntSliceView<'a> — zero-copy read views
  traits.rs         ColumnWeights, CountPartials, BitPartials (matrix aggregation)
  bitvec.rs         PersistentBitVec, PersistentBitVecBuilder, BitIter
  memoryvec.rs      MemoryBitVec
  reader.rs         PersistentCompactIntVec (read-only)
  builder.rs        PersistentCompactIntVecBuilder (read-write)
  memoryintvec.rs   MemoryIntVec
  tempintvec.rs     TempCompactIntVec, TempCompactIntVecBuilder (temp-file-backed)
  tempbitvec.rs     TempBitVec, TempBitVecBuilder (temp-file-backed)
  bitmatrix.rs      PersistentBitMatrix, PersistentBitMatrixBuilder
@@ -23,20 +22,20 @@ src/obicompactvec/src/
 ```mermaid
 graph TD
-    traits --> memoryvec
+    views --> bitvec
-    traits --> memoryintvec
+    views --> builder
-    bitvec --> memoryvec
+    views --> tempbitvec
-    bitvec --> bitmatrix
+    views --> tempintvec
-    bitvec --> tempbitvec
+    views --> bitmatrix
    views --> intmatrix
    format --> reader
    format --> builder
    reader --> intmatrix
    reader --> tempintvec
    builder --> intmatrix
    builder --> memoryintvec
    builder --> tempintvec
-    memoryvec --> traits
+    bitvec --> tempbitvec
-    memoryintvec --> traits
+    bitvec --> bitmatrix
    tempintvec --> intmatrix
    tempintvec --> bitmatrix
    tempbitvec --> intmatrix
@@ -62,7 +61,7 @@ All integer vectors use the same two-tier encoding regardless of storage backend
 **Overflow store** — maps slot index to a `u32` value ≥ 255:
- In `MemoryIntVec` and `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
+- In `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
 - In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.
 ```mermaid
@@ -70,13 +69,12 @@ flowchart LR
    slot --> P["primary[slot]: u8"]
    P -->|"< 255"| V["value = byte (0–254)"]
    P -->|"= 255 sentinel"| OV["overflow store"]
-    OV -->|"MemoryIntVec / Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
+    OV -->|"Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
    OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
 ```
 **Key property — sentinel 255 = +∞ on `u8`:**
 This is exploited throughout the binary operations. On a `u8` comparison, 255 behaves as positive infinity:
 - `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
 - `max(a, 255) = 255` → correct sentinel when either side is overflow
 - Only the **both-overflow** case requires reading actual values from the overflow store.
@@ -85,274 +83,60 @@ In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.0
 ---
-## Trait hierarchy
+## View types
-```mermaid
+The previous trait hierarchy (`BitSlice`, `BitSliceMut`, `IntSlice`, `IntSliceMut`) has been replaced by two concrete zero-copy view structs with inherent methods. Views are **`Copy`** — passing them is free. All read operations live on these two types.
-classDiagram
+
-    class BitSlice {
+### `BitSliceView<'a>`
-        <<trait>>
+
-        +len() usize
+```rust
-        +words() &[u64]
+#[derive(Clone, Copy)]
-        +get(slot) bool
+pub struct BitSliceView<'a> { pub(crate) words: &'a [u64], pub(crate) n: usize }
        +count_ones() u64
        +count_zeros() u64
        +partial_jaccard_dist(other) (u64,u64)
        +jaccard_dist(other) f64
        +hamming_dist(other) u64
    }
    class BitSliceMut {
        <<trait>>
        +words_mut() &mut [u64]
        +set(slot, value)
        +copy_from(src)
        +and(other)
        +or(other)
        +xor(other)
        +not()
    }
    class IntSlice {
        <<trait>>
        +len() usize
        +get(slot) u32
        +primary_bytes() &[u8]
        +overflow_entries() Iterator
        +iter() Iterator
        +sum() u64
        +count_nonzero() u64
        +cmp_scalar(pred) MemoryBitVec
        +lt/leq/gt/geq(t) MemoryBitVec
    }
    class IntSliceMut {
        <<trait>>
        +set(slot, value)
        +primary_bytes_mut() &mut [u8]
        +clear_overflow()
        +inc/dec/add_at(slot)
        +copy_from(src)
        +min/max/add/diff(other)
        +count_bits(bits)
    }
    class IntToBit {
        <<trait blanket>>
        +to_bitvec(threshold) MemoryBitVec
        +to_presence() MemoryBitVec
    }
    class BitToInt {
        <<trait blanket>>
        +to_intvec() MemoryIntVec
    }
    BitSliceMut --|> BitSlice : extends
    IntSliceMut --|> IntSlice : extends
    IntToBit --|> IntSlice : blanket T:IntSlice
    BitToInt --|> BitSlice : blanket T:BitSlice
 ```
-### BitSlice (read-only)
+Bit `i` is at `words[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are zero.
-Required: `len()`, `words() -> &[u64]`.
+| Method | Cost |
 |---|---|
 | `len()`, `is_empty()` | O(1) |
 | `get(slot)` | O(1) |
 | `count_ones()` | POPCNT per word, O(n/64) |
 | `count_zeros()` | `n − count_ones()`, O(n/64) |
 | `iter() -> BitSliceIter<'a>` | O(1) setup, O(n) iteration |
 | `partial_jaccard_dist(other: BitSliceView)` | `(a&b).popcount`, `(a\|b).popcount` per word, O(n/64) |
 | `jaccard_dist(other: BitSliceView)` | from partial, O(n/64) |
 | `hamming_dist(other: BitSliceView)` | `(a^b).popcount` per word, O(n/64) |
-Bit `i` is at `words()[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are always zero — this invariant must be maintained by all implementors.
+`BitSliceIter<'a>`: word-level scan; one word per 64 iterations.
-| Provided method | Implementation | Cost |
+### `IntSliceView<'a>`
 |---|---|---|
 | `is_empty()` | `len() == 0` | O(1) |
 | `get(slot)` | word extract | O(1) |
 | `count_ones()` | POPCNT per word | O(n/64) |
 | `count_zeros()` | `n − count_ones()` | O(n/64) |
 | `partial_jaccard_dist(other)` | `(a&b).popcount`, `(a\|b).popcount` per word | O(n/64) |
 | `jaccard_dist(other)` | from partial | O(n/64) |
 | `hamming_dist(other)` | `(a^b).popcount` per word | O(n/64) |
-### BitSliceMut: BitSlice (mutable)
+```rust
-
+#[derive(Clone, Copy)]
-Required: `words_mut() -> &mut [u64]`.
+pub struct IntSliceView<'a> {
-
+    pub(crate) primary:      &'a [u8],
-All bulk operations work at the word level (64 bits/iteration). The compiler auto-vectorizes these loops to AVX2/AVX-512. The zero-padding invariant is maintained: `not()` re-masks the last word after flipping.
+    pub(crate) overflow_raw: &'a [u8],   // sorted [(slot:u64, value:u32)] entries
-
+    pub(crate) n_overflow:   usize,
-| Provided method | Implementation | Cost |
+    pub(crate) n:            usize,
-|---|---|---|
+}
 | `set(slot, value)` | OR / AND-NOT on one word | O(1) |
 | `copy_from(src)` | `copy_from_slice` = memcpy | O(n/64) |
 | `and(other)` | `w &= o` per word | O(n/64) |
 | `or(other)` | `w \|= o` per word | O(n/64) |
 | `xor(other)` | `w ^= o` per word | O(n/64) |
 | `not()` | `w ^= u64::MAX` per word, then mask last | O(n/64) |
 **No overflow complexity here.** The packed `u64` representation is already the natural unit for SIMD operations. No sentinel, no HashMap — just bitwise word ops.
 ---
 ### IntSlice (read-only)
 Required:
 - `len() -> usize`
 - `get(slot) -> u32` — handles sentinel transparently (binary search into overflow for persistent, HashMap for memory)
 - `primary_bytes() -> &[u8]` — raw primary array including 255 sentinels
 - `overflow_entries() -> impl Iterator<Item = (usize, u32)>` — (slot, true_value) pairs for all overflow slots
 | Provided method | Default implementation | Note |
 |---|---|---|
 | `is_empty()` | `len() == 0` | |
 | `iter()` | `(0..n).map(\|i\| self.get(i))` | Overridden in all concrete types |
 | `sum()` | `iter().map(\|v\| v as u64).sum()` | Overridden in concrete types |
 | `count_nonzero()` | `iter().filter(\|v\| *v > 0).count()` | Overridden in concrete types |
 | `lt(t)` | `cmp_scalar(\|v\| v < t)` | |
 | `leq(t)` | `cmp_scalar(\|v\| v <= t)` | |
 | `gt(t)` | `cmp_scalar(\|v\| v > t)` | |
 | `geq(t)` | `cmp_scalar(\|v\| v >= t)` | |
 | `cmp_scalar(pred)` | two-pass (see below) | |
 **`cmp_scalar` algorithm — two passes:**
 ```
 Pass 1 — byte scan, O(n):
  for s in 0..n:
    b = primary[s]
    if b < 255 AND pred(b as u32):
      set bit s in result word
 Pass 2 — overflow fixup, O(k):
  for (s, val) in overflow_entries():
    if pred(val): set bit s in result word
 ```
-Pass 1 reads only the primary byte array — no HashMap access. For simple predicates (`geq`, `lt`, etc.) the compiler inlines `pred` and can auto-vectorize the byte comparison loop. Pass 2 handles the O(k) overflow slots that were left as 0 in pass 1.
+`overflow_raw` contains `n_overflow` entries of `OVERFLOW_ENTRY_SIZE` bytes each, sorted by slot. The sort invariant is established at `close()`/`freeze()` time.
-Previous implementation: `pred(self.get(s))` for every slot → O(n log k) due to binary search in overflow. New: O(n) + O(k).
+| Method | Cost |
 |---|---|
 | `len()`, `is_empty()` | O(1) |
 | `primary_bytes()` | O(1) |
 | `overflow_entries() -> impl Iterator<(usize,u32)>` | O(n_overflow) iteration |
 | `get(slot)` | O(1) primary; binary search O(log k) for overflow slots |
 | `iter() -> IntSliceViewIter<'a>` | merge scan, O(n + k) |
 | `sum()` | byte scan + overflow, O(n + k) |
 | `count_nonzero()` | byte scan, O(n) |
 | Distance methods (`bray_dist`, `euclidean_dist`, `jaccard_dist`, …) | O(n + k) |
---
+`IntSliceViewIter<'a>`: merge scan using `overflow_pos` index. Requires sorted overflow — guaranteed by the construction lifecycle.
-### IntSliceMut: IntSlice (mutable)
+**Builder `view()` vs reader `view()`:** `PersistentCompactIntVecBuilder` stores overflow as an unsorted `HashMap`, not raw bytes. Its `view()` returns an `IntSliceView` with `overflow_raw = &[]` and `n_overflow = 0`. This is intentional — the view is primarily useful after `freeze()`. During building, callers that need overflow use `overflow_entries()` directly.
 Required:
 - `set(slot, value: u32)` — writes primary byte (or 255 + overflow entry if value ≥ 255); removes stale overflow entry if value drops below 255
 - `primary_bytes_mut() -> &mut [u8]` — direct mutable access to the primary array
 - `clear_overflow()` — empties the entire overflow store
 The required methods expose the encoding internals. All provided methods are implemented in terms of these three + the `IntSlice` required methods.
 | Provided method | Hot path | Overflow case | Cost |
 |---|---|---|---|
 | `inc(slot)` | `get` + `set` | — | O(1) or O(log k) |
 | `dec(slot)` | `get` + `set` (saturating) | — | O(1) or O(log k) |
 | `add_at(slot, delta)` | `get` + `set` (saturating) | — | O(1) or O(log k) |
 | `copy_from(src)` | `copy_from_slice` + `clear_overflow` + replay overflows | — | O(n) + O(k) |
 | `min(other)` | byte-level min, O(n) | both-overflow fixup, O(k) | O(n) |
 | `max(other)` | byte-level max, O(n) | pre-pass on other's overflows, O(k) | O(n) |
 | `add(other)` | byte add when both < 255, O(n) | `get` + `+` when either = 255 | O(n) |
 | `diff(other)` | byte saturating_sub when self < 255, O(n) | `get` + `saturating_sub` when self = 255 | O(n) |
 | `count_bits(bits)` | iterate set bits via word scan | — | O(n_ones) |
 | `cmp_scalar` | inherited from IntSlice | — | O(n) + O(k) |
 **`min` algorithm:**
 Exploits 255 = +∞: `u8::min(a, 255) = a` and `u8::min(255, b) = b`. Only the case where both sides are ≥ 255 needs actual overflow values.
 ```mermaid
 flowchart TD
    A["min(self, other)"] --> B["snapshot self_ov: Vec&lt;(slot,val)&gt;\nsnapshot other_ov: HashMap&lt;slot,val&gt;"]
    B --> C["clear_overflow()"]
    C --> D["Pass 1 — byte min, SIMD-vectorizable\nprimary[s] = min(self[s], other[s])  ∀s"]
    D --> E["Pass 2 — both-overflow fixup\nfor (slot, self_val) in self_ov"]
    E --> F{"slot ∈ other_ov?"}
    F -->|yes| G["set(slot, min(self_val, other_ov[slot]))"]
    F -->|no| H["byte pass wrote other.primary &lt; 255\nclear_overflow removed stale entry\nno action"]
    G --> I[done]
    H --> I
 ```
 Overflow entries where only self was overflow are correctly handled: after `clear_overflow` + byte pass, `self.primary[slot] = min(255, other.primary[slot]) = other.primary[slot]` (which is < 255). No overflow entry — correct.
 **`max` algorithm:**
 Exploits 255 = +∞: `u8::max(a, 255) = 255` → any slot where either side is overflow will have sentinel 255 in the primary after the byte pass. The byte pass cannot distinguish "self had overflow and other did not" from "self was just written to 255 by the byte pass".
 Solution: read and update self's original value at other's overflow slots *before* the byte pass overwrites them.
 ```mermaid
 flowchart TD
    A["max(self, other)"] --> B["Pre-pass O(k_other)\nfor (slot, other_val) in other.overflow_entries()"]
    B --> C["self_val = self.get(slot)\nself.set(slot, max(self_val, other_val))"]
    C --> D["Pass 1 — byte max, SIMD-vectorizable\nprimary[s] = max(self[s], other[s])  ∀s"]
    D --> E["Overflow slots: max(255,255)=255\nprimary unchanged\noverflow entry from pre-pass preserved"]
    E --> F[done]
 ```
 After the pre-pass, self.primary[slot] = 255 for all slots in other's overflow. The byte pass leaves those 255s intact. Self's own overflow slots not in other's overflow are also 255 in primary — byte max(255, b < 255) = 255, unchanged. Correct in all cases.
 **`add` algorithm:**
 No sentinel property useful for add: any pair (sb, ob) with sb + ob ≥ 255 creates a new overflow entry, even when neither input was overflow. Cannot simplify via byte arithmetic.
 ```
 for s in 0..n:
  sb = self.primary[s]
  ob = other.primary[s]
  if sb < 255 AND ob < 255:      // hot path: no HashMap
    sum = sb as u32 + ob as u32
    if sum < 255: self.primary[s] = sum as u8   // direct byte write
    else:         self.set(s, sum)               // creates overflow if needed
  else:                           // at least one is overflow
    self.set(s, self.get(s) + other.get(s))
 ```
 ```mermaid
 flowchart TD
    A["add(self, other)"] --> B{"sb &lt; 255\nAND ob &lt; 255"}
    B -->|"yes — hot path\nno HashMap"| C{"sb + ob &lt; 255"}
    C -->|yes| D["primary[s] = sum as u8\nsingle byte write"]
    C -->|no| E["set(s, sum)\ncreates overflow entry"]
    B -->|"no — ≥1 side is overflow"| F["self_val = self.get(s)\nother_val = other.get(s)\nset(s, self_val + other_val)"]
    D --> Z[next slot]
    E --> Z
    F --> Z
 ```
 The `+` on `u32` values is exact (no `saturating_add`). Overflow at u32 level panics in debug — not a real risk for kmer counts. The hot path (both < 255, sum < 255) is a single byte write with no HashMap access.
 **`diff` (saturating sub) algorithm:**
 `saturating_sub(a, b) = a − min(a, b) = max(0, a − b)`. Key insight: if self's primary byte < 255, the result is always < 255 (result ≤ a), so no new overflow entries are created and no overflow lookup is needed for self. Only self's overflow slots (primary = 255) need `get()`.
 | sb | ob | result | get() needed |
 |----|----|--------|-------------|
 | < 255 | < 255 | `sb.saturating_sub(ob)` < 255 | none |
 | < 255 | 255 | 0 (b ≥ 255 > a) | none |
 | 255 | < 255 | `self.get(s) − ob` | self only |
 | 255 | 255 | `self.get(s) − other.get(s)` | both |
 ```mermaid
 flowchart TD
    A["diff(self, other)"] --> B{"sb &lt; 255\nself not overflow"}
    B -->|"yes — hot path O(n)"| C{"ob &lt; 255"}
    C -->|yes| D["primary[s] = sb.saturating_sub(ob)\nbyte write, no HashMap"]
    C -->|"no: b ≥ 255 > a"| E["primary[s] = 0"]
    B -->|"no — cold path O(k_self)"| F["self_val = self.get(s)"]
    F --> G{"ob &lt; 255"}
    G -->|yes| H["other_val = ob as u32"]
    G -->|no| I["other_val = other.get(s)"]
    H --> J["set(s, self_val.saturating_sub(other_val))"]
    I --> J
    D --> Z[next slot]
    E --> Z
    J --> Z
 ```
 Overflow entries that drop below 255 (case sb=255, result < 255) are removed by `set()`. Overflow entries that remain ≥ 255 are updated. Correct in all four cases.
 **`count_bits` algorithm:**
 Increments self at each slot where the corresponding bit in `bits` is set. Iterates `bits.words()` and skips zero words entirely — O(n_ones) rather than O(n).
 ```
 for (w_idx, word) in bits.words():
  if word == 0: continue
  base = w_idx * 64
  while word != 0:
    bit = trailing_zeros(word)
    self.inc(base + bit)
    word &= word − 1        // clear lowest set bit
 ```
 ---
@@ -360,142 +144,149 @@ for (w_idx, word) in bits.words():
 ```mermaid
 classDiagram
-    class MemoryBitVec {
+    class BitSliceView {
-        -words: Vec~u64~
+        +words: &[u64]
-        -n: usize
+        +n: usize
-        +iter() BitIter
+        +get(slot) bool
-        +ones(n) Self
+        +count_ones() u64
-        +persist(path) Builder
+        +iter() BitSliceIter
        +jaccard_dist/hamming_dist(other: BitSliceView)
    }
-    class MemoryIntVec {
+    class IntSliceView {
-        -primary: Vec~u8~
+        +primary: &[u8]
-        -overflow: HashMap~usize,u32~
+        +overflow_raw: &[u8]
-        -n: usize
+        +n_overflow: usize
-        +iter() MemoryIntIter
+        +n: usize
-        +filled(n, value) Self
+        +get(slot) u32
-        +persist(path) Builder
+        +iter() IntSliceViewIter
        +overflow_entries() Iterator
        +bray_dist/euclidean_dist/…(other: IntSliceView)
    }
    class PersistentBitVec {
        -mmap: Mmap
        -n: usize
        +view() BitSliceView
        +get(slot) bool
        +count_ones/zeros() u64
        +iter() BitIter
-        +count_ones() u64
+        +partial_jaccard_dist(&Self) (u64,u64)
        +jaccard_dist/hamming_dist(&Self) …
    }
    class PersistentBitVecBuilder {
        -mmap: MmapMut
        -n: usize
-        +close()
+        +view() BitSliceView
-        +build_from(src, path)
+        +set(slot, bool)
-        +build_from_counts(src, t, path)
+        +or/and/xor/not(BitSliceView)
        +copy_from(BitSliceView)
        +close() / finish() → PersistentBitVec
    }
    class PersistentCompactIntVec {
        -mmap: Mmap
-        -n usize
+        -n: usize
-        -n_overflow usize
+        -n_overflow: usize
-        -step usize
+        -step: usize
        -index: Vec~(usize,usize)~
-        +iter() Iter
+        +view() IntSliceView
        +get(slot) u32
-        +sum() u64
+        +iter() Iter
        +sum/count_nonzero() u64
        +bray_dist/euclidean_dist/… (&Self)
    }
    class PersistentCompactIntVecBuilder {
        -mmap: MmapMut
        -n: usize
        -overflow: HashMap~usize,u32~
-        +set(slot, value)
+        +view() IntSliceView
-        +close()
+        +set(slot, u32) / get(slot) u32
-        +build_from(src, path)
+        +inc / inc_present / inc_present_fast
        +inc_predicate / inc_predicate_fast
        +add/min/max/diff/mask_with(…View)
        +primary_bytes/primary_bytes_mut()
        +close() / finish() → PersistentCompactIntVec
    }
-    MemoryBitVec ..|> BitSlice
+    PersistentBitVec --> BitSliceView : view()
-    MemoryBitVec ..|> BitSliceMut
+    PersistentBitVecBuilder --> BitSliceView : view()
-    PersistentBitVec ..|> BitSlice
+    PersistentCompactIntVec --> IntSliceView : view()
-    PersistentBitVecBuilder ..|> BitSlice
+    PersistentCompactIntVecBuilder --> IntSliceView : view() (primary only)
    PersistentBitVecBuilder ..|> BitSliceMut
    MemoryIntVec ..|> IntSlice
    MemoryIntVec ..|> IntSliceMut
    PersistentCompactIntVec ..|> IntSlice
    PersistentCompactIntVecBuilder ..|> IntSlice
    PersistentCompactIntVecBuilder ..|> IntSliceMut
    PersistentBitVecBuilder --> PersistentBitVec : close() then open()
    PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
 ```
-### Memory types
+### `PersistentBitVec` / `PersistentBitVecBuilder`
-**`MemoryBitVec`**
+`PersistentBitVec` is the read-only type. `view()` returns a `BitSliceView<'_>` over the mmap word array. Direct inherent methods delegate to the view: `count_ones()`, `count_zeros()`, `partial_jaccard_dist(&Self)`, `jaccard_dist(&Self)`, `hamming_dist(&Self)`.
-```rust
+`BitIter<'a>` — exported iterator for `PersistentBitVec::iter()`:
 struct MemoryBitVec { words: Vec<u64>, n: usize }
 ```
 Implements `BitSlice` + `BitSliceMut`. Owns its word array. Used as the result type of `cmp_scalar` / filter operations and as an intermediate for bit-level computations.
 Std ops: `BitAnd`, `BitOr`, `BitXor`, `Not` (owned and borrowed), `BitAndAssign`, `BitOrAssign`, `BitXorAssign` — all delegate to `BitSliceMut` methods.
 `iter()` returns a `BitIter<'_>` (word-level, see below).
 **`MemoryIntVec`**
 ```rust
 struct MemoryIntVec {
    primary:  Vec<u8>,
    overflow: HashMap<usize, u32>,
    n:        usize,
 }
 ```
 Implements `IntSlice` + `IntSliceMut`. Overrides: `iter()` → inherent `iter()` (merge-scan), `sum()`, `count_nonzero()`.
 `IntSlice` required impls: `primary_bytes()` → `&self.primary`; `overflow_entries()` → `self.overflow.iter().map(...)`.
 `IntSliceMut` required impls: `set()` writes to `self.primary[slot]` and inserts/removes from `self.overflow`; `primary_bytes_mut()` → `&mut self.primary`; `clear_overflow()` → `self.overflow.clear()`.
 Std ops: `Add<&B>`, `Sub<&B>` (owned and borrowed), `AddAssign<&B>`, `SubAssign<&B>` — delegate to `IntSliceMut::add` / `diff`.
 `From<&S: IntSlice>`: copies primary bytes + overflow entries. O(n) + O(k).
 ---
 ### Persistent types
 **`PersistentBitVec` / `PersistentBitVecBuilder`**
 See `persistent_bit_vec.md`. `PersistentBitVec` is read-only (implements `BitSlice`). `PersistentBitVecBuilder` is read-write (implements `BitSlice` + `BitSliceMut`).
 `BitIter<'a>` — shared iterator type for both `MemoryBitVec` and `PersistentBitVec`:
 ```rust
 pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
 ```
-Word-level scan: `(words[slot >> 6] >> (slot & 63)) & 1 != 0`. One word serves 64 iterations. `pub type MemoryBitIter<'a> = BitIter<'a>` preserves the public API name.
+`PersistentBitVecBuilder` is the read-write type. Mutation operations accept `BitSliceView<'_>`:
-**`PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`**
+| Method | Cost |
 |---|---|
 | `set(slot, bool)` | O(1) |
 | `view() -> BitSliceView<'_>` | O(1) |
 | `or/and/xor(BitSliceView)` | word-level, O(n/64), SIMD-friendly |
 | `not()` | `w ^= u64::MAX` per word, re-masks last word | O(n/64) |
 | `copy_from(BitSliceView)` | `copy_from_slice` | O(n/64) |
-See `persistent_compact_int_vec.md` for file format and lifecycle.
+### `PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`
-`PersistentCompactIntVec` implements `IntSlice`. Overrides: `iter()` → inherent merge-scan `Iter`; `sum()`; `count_nonzero()`. `overflow_entries()` returns a sequential scan `(0..n_overflow).map(|i| (data_slot(i), data_value(i)))` — no binary search since entries are stored sorted.
+`PersistentCompactIntVec` is the read-only type. `view()` returns an `IntSliceView<'_>` over the mmap primary and overflow arrays. Inherent `iter()` is a merge scan (`Iter` struct). Inherent `sum()` and `count_nonzero()` use fast byte-scan helpers.
-`PersistentCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut`. `iter()` is NOT overridden (default `get`-per-slot) because the overflow `HashMap` is unsorted. `sum()` and `count_nonzero()` are overridden using `byte_sum` / `byte_count_nonzero` on the mmap primary slice — avoids per-slot overhead.
+`PersistentCompactIntVecBuilder` is the read-write type. Mutation methods on the builder fall into two categories:
-**Override rationale:** the default `iter()`, `sum()`, `count_nonzero()` on `IntSlice` call `self.get(s)` per slot, which is O(log k) binary search for `PersistentCompactIntVec`. Overrides provide O(n + k) merge-scan or O(n) byte scan instead.
+**Point mutations:**
---
+| Method | Note |
 |---|---|
 | `set(slot, u32)` | writes primary[slot] or 255+overflow |
 | `get(slot) -> u32` | reads primary byte or HashMap |
 | `inc(slot)` | `get` + `set`, O(1) |
-### IntSlice implementors — override summary
+**Bulk computation methods** — accept view arguments:
-| Type | `iter()` | `sum()` | `count_nonzero()` |
+| Method | Semantics | Overflow |
-|------|----------|---------|-------------------|
+|---|---|---|
-| `MemoryIntVec` | inherent merge-scan ✓ | `byte_sum` ✓ | `byte_count_nonzero` ✓ |
+| `inc_present(BitSliceView)` | `+= 1` at each 1-bit | via `inc`, safe for any group size |
-| `PersistentCompactIntVecBuilder` | default (get-per-slot) | `byte_sum` on mmap ✓ | `byte_count_nonzero` on mmap ✓ |
+| `inc_present_fast(BitSliceView)` | same, raw u8 `+= 1` | `debug_assert` no 255 reached |
-| `PersistentCompactIntVec` | inherent merge-scan Iter ✓ | inherent `sum()` ✓ | inherent `count_nonzero()` ✓ |
+| `inc_predicate(IntSliceView, pred)` | `+= 1` where `pred(col[s])` | two-pass, safe |
-| `TempCompactIntVec` | delegates to inner `PersistentCompactIntVec` | delegates | delegates |
+| `inc_predicate_fast(IntSliceView, pred)` | same, raw u8 | `debug_assert` no 255 reached |
-| `TempCompactIntVecBuilder` | default (get-per-slot) | delegates to builder | delegates to builder |
+| `add(IntSliceView)` | `self[s] += other[s]` | primary fast path + overflow fallback |
-| `PackedIntCol<'a>` | inherent PackedIntColIter ✓ | byte_sum ✓ | byte_count_nonzero ✓ |
+| `min(IntSliceView)` | byte min + both-overflow fixup | see algorithm below |
 | `max(IntSliceView)` | pre-pass + byte max | see algorithm below |
 | `diff(IntSliceView)` | saturating sub | self<255 hot path |
 | `mask_with(BitSliceView)` | zeros slots where mask bit = 0 | O(n_zeros) |
-`PackedIntCol` is used internally by `PersistentCompactIntMatrix` (packed format) for column views.
+**`inc_present_fast` / `inc_predicate_fast` invariant:** caller guarantees no counter reaches 255 during the operation (group size < 255 for `inc_present_fast`, or chunk size < 255 for `inc_predicate_fast`). Violation is caught by `debug_assert` in dev builds.
 **`min` algorithm:**
 Exploits 255 = +∞: byte-level min is correct unless both sides are overflow.
 ```
 snapshot self_ov: Vec<(slot,val)>
 snapshot other_ov: HashMap<slot,val>
 clear_overflow()
 Pass 1 — byte min, SIMD-vectorizable, O(n)
 Pass 2 — both-overflow fixup, O(k_self):
  for (slot, self_val) in self_ov:
    if slot ∈ other_ov: set(slot, min(self_val, other_ov[slot]))
 ```
 **`max` algorithm:**
 Cannot do byte max first — `max(255, b<255)=255` overwrites self's original overflow value. Pre-pass reads self's value at other's overflow slots before the byte pass.
 ```
 Pre-pass O(k_other): for (slot, other_val) in other.overflow_entries():
  set(slot, max(self.get(slot), other_val))
 Pass 1 — byte max, SIMD-vectorizable, O(n)
 ```
 ---
@@ -505,30 +296,22 @@ Four matrix types, two encodings × two formats:
 | | Columnar format | Packed format |
 |---|---|---|
-| **Bit** | `PersistentBitMatrix` | — |
+| **Bit** | `PersistentBitMatrix` (Columnar variant) | `PersistentBitMatrix` (Packed variant) |
-| **Int** | `PersistentCompactIntMatrix` (columnar) | `PersistentCompactIntMatrix` (packed) |
+| **Int** | `PersistentCompactIntMatrix` (Columnar variant) | `PersistentCompactIntMatrix` (Packed variant) |
-`PersistentCompactIntMatrix` is an enum behind a transparent API — the caller does not see whether the on-disk format is columnar (one `.pciv` per column) or packed (one `.pcmx` file interleaving all columns). `col(c)` and `col_slice(c)` return column views that implement `IntSlice`.
+Both matrix types are enums (`Columnar` / `Packed` / `Implicit` for bit) behind a transparent API. `col_view(c)` returns the appropriate view directly:
-`pack_compact_int_matrix` and `pack_bit_matrix` convert a columnar matrix to packed format.
+```rust
 // PersistentBitMatrix
 pub fn col_view(&self, c: usize) -> BitSliceView<'_>
-For details see `persistent_compact_int_vec.md` and `persistent_bit_vec.md`.
+// PersistentCompactIntMatrix
 pub fn col_view(&self, c: usize) -> IntSliceView<'_>
 ```
---
+No wrapper enums (`BitColView`, `IntColView`): the caller receives a `Copy` view struct immediately usable with any view method or bulk builder method.
-## Conversion traits
+`pack_compact_int_matrix` and `pack_bit_matrix` convert columnar → packed format.
 Four blanket-impl traits on top of `BitSlice` / `IntSlice`:
 **`IntToBit: IntSlice`**
 - `to_bitvec(threshold: u32) -> MemoryBitVec` — bit set iff value ≥ threshold (delegates to `geq`)
 - `to_presence() -> MemoryBitVec` — bit set iff value ≥ 1 (delegates to `geq(1)`)
 **`BitToInt: BitSlice`**
 - `to_intvec() -> MemoryIntVec` — expands each bit to a `u8` (0 or 1) in a new primary array
 - Uses a `static EXPAND_BYTE: [[u8; 8]; 256]` lookup table — 8 bits expanded per byte, word-level outer loop
 Both `IntToBit` and `BitToInt` are implemented for all `T: IntSlice` / `T: BitSlice` via blanket impls.
 ---
@@ -549,37 +332,37 @@ trait ColumnWeights: Send + Sync {
 Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`.
-**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs before applying the finalisation. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter — not per-layer or per-partition weights.
+**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter.
-**`partial_threshold_jaccard` returns `(inter, union)`**, not a single matrix, because `union[i,j]` depends on both columns simultaneously and cannot be reconstructed from per-column statistics.
+**`partial_threshold_jaccard` returns `(inter, union)`** because `union[i,j]` depends on both columns simultaneously.
-Provided finalisations (default implementations):
+Provided finalisations:
 | Finalisation | Formula |
 |---|---|
 | `bray_dist_matrix()` | `1 − 2·partial_bray[i,j] / (w[i] + w[j])` |
 | `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
 | `threshold_jaccard_dist_matrix(t)` | `1 − inter[i,j] / union[i,j]` |
-| `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` (two-pass: col_weights then partial) |
+| `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` |
 | `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
 | `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
 | `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |
 ### BitPartials
-Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` (inter, union), `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.
+Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)`, `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.
 ---
 ## Temp-file-backed types
-`MemoryBitVec` and `MemoryIntVec` are reserved for truly transient intra-method intermediates (e.g. a single `cmp_scalar` result that lives for one loop iteration). **All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
+**All inter-function results use temp-file-backed types** so the OS can page them out under memory pressure. This matters in practice: processing dozens of layers × hundreds of partitions in parallel would otherwise accumulate gigabytes of live anonymous memory.
 ### Lifecycle
 ```
 TempCompactIntVecBuilder::new(n)   →  writable mmap in TempDir
-     ↓  (set / add / count_bits / mask_with / …)
+     ↓  (inc_present_fast / inc_predicate_fast / add / mask_with / …)
 .freeze()                          →  TempCompactIntVec  (read-only mmap + TempDir)
     ↓  (optional)
 .make_persistent(path)             →  PersistentCompactIntVec  (permanent file)
@@ -587,7 +370,7 @@ TempCompactIntVecBuilder::new(n)   →  writable mmap in TempDir
 Same pattern for `TempBitVecBuilder` → `TempBitVec` → `PersistentBitVec`.
-**Drop order**: in `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }`, Rust drops fields in declaration order — `vec` (mmap) is released before `_temp` (directory) is deleted. No explicit `drop()` needed.
+**Drop order**: `TempCompactIntVec { vec: PersistentCompactIntVec, _temp: TempDir }` — Rust drops fields in declaration order. `vec` (mmap) released before `_temp` (directory deleted). No explicit `drop()` needed.
 ### TempCompactIntVec / TempCompactIntVecBuilder
@@ -603,9 +386,9 @@ pub(crate) struct TempCompactIntVecBuilder {
 }
 ```
-`TempCompactIntVec` implements `IntSlice` (full delegation to inner `PersistentCompactIntVec`).  
+`TempCompactIntVec`: read access via `get(slot)`, `sum()`, `iter()`, `view() -> IntSliceView<'_>`.
-`TempCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut` (delegation to inner builder).  
+
-`make_persistent(path)` copies the temp file to `path` and opens it as `PersistentCompactIntVec`.
+`TempCompactIntVecBuilder`: full delegation to inner `PersistentCompactIntVecBuilder` — all bulk computation methods (`inc_present_fast`, `inc_predicate_fast`, `add`, `min`, `max`, `diff`, `mask_with`) are exposed as `pub(crate)`.
 ### TempBitVec / TempBitVecBuilder
@@ -621,9 +404,26 @@ pub(crate) struct TempBitVecBuilder {
 }
 ```
-`TempBitVec` implements `BitSlice`.  
+`TempBitVec`: read access via `get(slot)`, `count_ones()`, `view() -> BitSliceView<'_>`, `iter()`.
-`TempBitVecBuilder` implements `BitSlice` + `BitSliceMut`.  
+
-`make_persistent(path)` copies the temp file and opens as `PersistentBitVec`.
+`TempBitVecBuilder`: exposes `set(slot, bool)`, `or(BitSliceView)`, and:
 ```rust
 pub(crate) fn or_where(&mut self, col: IntSliceView<'_>, pred: impl Fn(u32) -> bool)
 ```
 `or_where` — two passes, no intermediate allocation:
 ```
 Pass 1 — primary bytes, O(n):
  for slot in 0..n:
    b = col.primary_bytes()[slot]
    if b < 255 AND pred(b as u32): self.set(slot, true)
 Pass 2 — overflow, O(k):
  for (slot, val) in col.overflow_entries():
    if pred(val): self.set(slot, true)
 ```
 ---
@@ -635,18 +435,16 @@ pub(crate) struct TempBitVecBuilder {
 pub struct ColGroup { pub name: String, pub indices: Vec<usize> }
 ```
-Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions because column structure is identical across the entire hierarchy (same samples/genomes everywhere; only rows = kmer slots are partitioned).
+Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions — column structure is identical across the entire hierarchy; only rows (kmer slots) are partitioned.
 `ColGroup` is passed by reference unchanged to any matrix — no index translation.
 ### Composition axis
- **Across partitions**: kmer space is partitioned → partial results are **concatenated** (disjoint kmer ranges).
+- **Across partitions**: kmer space is partitioned → partial results **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results are **aggregated** (add, OR, etc.).
+- **Across layers**: same kmer space, different counts → partial results **aggregated** (add, OR, etc.).
 ### MatrixGroupOps
-Group operations live on the matrix and expose only **additive intermediates** backed by temp files. Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.
+Group operations expose only **additive intermediates** backed by temp files. Final predicates are applied at the index level after accumulation.
 ```rust
 pub trait MatrixGroupOps {
@@ -661,48 +459,59 @@ pub trait MatrixGroupOps {
 }
 ```
-Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)` since values are 0/1.
+Implemented for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. For bit matrices, `partial_group_sum` delegates to `partial_group_presence_count(g, 1)`.
 **`partial_group_presence_count` — chunking for large groups:**
-When `g.indices.len() < 255`, per-slot counts fit in a raw `u8` — fast path: accumulate directly into `primary_bytes_mut()` using `inc_primary_bits`, then `freeze()`. No overflow map needed.
+When `g.indices.len() < 255`: per-slot counts stay within `u8` range. Use `inc_present_fast` (bit matrix) or `inc_predicate_fast(col_view(c), |v| v >= threshold)` (int matrix) — raw u8 increment, no overflow map written.
-When `g.indices.len() ≥ 255`, process in chunks of 254 columns — each chunk stays within `u8` range — then add chunks into a running `TempCompactIntVecBuilder` accumulator via `IntSliceMut::add`. This keeps peak memory proportional to one partition, not the number of columns × partitions.
+When `g.indices.len() ≥ 255`: process in chunks of 254 columns (each chunk stays within u8 range), accumulate into a running builder via `.add(chunk_frozen.view())`.
 ```
 fast path (< 255 cols):
  builder = TempCompactIntVecBuilder::new(n)
  for c in group:
-    mask = col_view(c).cmp_scalar(|v| v >= threshold)  // MemoryBitVec
+    builder.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
    inc_primary_bits(primary_bytes_mut, mask)           // u8 safe
  builder.freeze()
 slow path (≥ 255 cols):
  result = TempCompactIntVecBuilder::new(n)
  for chunk in group.chunks(254):
-    chunk_builder = TempCompactIntVecBuilder::new(n)
+    chunk_b = TempCompactIntVecBuilder::new(n)
-    inc_primary_bits(chunk_builder, …)
+    for c in chunk:
-    chunk_frozen = chunk_builder.freeze()
+      chunk_b.inc_predicate_fast(matrix.col_view(c), |v| v >= threshold)
-    IntSliceMut::add(&mut result, &chunk_frozen)
+    frozen = chunk_b.freeze()
    result.add(frozen.view())
  result.freeze()
 ```
-Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — composed at the index level:
+**`partial_group_any`** uses `or_where` on `TempBitVecBuilder`:
 ```
 result = TempBitVecBuilder::new(n)
 for c in group:
  result.or_where(matrix.col_view(c), |v| v >= threshold)
 result.freeze()
 ```
 **Non-additive predicates** (`group_all`, `group_at_least(k)`) are composed at the index level:
 ```rust
 // "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
 let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)?).add_all()?;
-let in_mask  = presence.geq(2);
+let in_mask  = presence.view().geq(2);   // IntSliceView method
 let out_sum  = layers.map(|l| l.partial_group_sum(&outgroup)?).add_all()?;
-let out_mask = out_sum.leq(0);
+let out_mask = out_sum.view().leq(0);
-let mask = in_mask & &out_mask;    // BitSliceMut::and — O(n/64)
+let mut mask_b = TempBitVecBuilder::new(n)?;
 mask_b.copy_from(in_mask);
 mask_b.and(out_mask);
 ```
-### mask_with (IntSliceMut)
+### mask_with
-Provided method on `IntSliceMut`. Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
+Direct method on `PersistentCompactIntVecBuilder` (and delegation via `TempCompactIntVecBuilder`). Zeros every slot where the corresponding mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.
 ```
 for (w_idx, word) in mask.words():
@@ -711,7 +520,7 @@ for (w_idx, word) in mask.words():
  while zeros != 0:
    bit = trailing_zeros(zeros)
    s = w_idx * 64 + bit
-    if primary[s] != 0: self.set(s, 0)   // clears overflow entry too
+    if primary[s] != 0: set(s, 0)   // clears overflow entry too
    zeros &= zeros − 1
 ```