docmd/implementation/obicompactvec.md

# obicompactvec — Complete Reference

## Module structure

```
src/obicompactvec/src/
  lib.rs            public re-exports
  traits.rs         BitSlice, BitSliceMut, IntSlice, IntSliceMut + conversion traits
  bitvec.rs         PersistentBitVec, PersistentBitVecBuilder, BitIter
  memoryvec.rs      MemoryBitVec
  reader.rs         PersistentCompactIntVec (read-only)
  builder.rs        PersistentCompactIntVecBuilder (read-write)
  memoryintvec.rs   MemoryIntVec
  bitmatrix.rs      PersistentBitMatrix, PersistentBitMatrixBuilder
  intmatrix.rs      PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder
  format.rs         file format constants, encode/decode helpers
  layer_meta.rs     LayerMeta (column metadata)
  meta.rs           matrix metadata
```

```mermaid
graph TD
    traits --> memoryvec
    traits --> memoryintvec
    bitvec --> memoryvec
    bitvec --> bitmatrix
    format --> reader
    format --> builder
    reader --> intmatrix
    builder --> intmatrix
    builder --> memoryintvec
    memoryvec --> traits
    memoryintvec --> traits
    layer_meta --> bitmatrix
    layer_meta --> intmatrix
    meta --> bitmatrix
    meta --> intmatrix
```

---

## Compact int encoding

All integer vectors use the same two-tier encoding regardless of storage backend.

**Primary array** — one `u8` per slot:

- Values **0–254** are stored directly. No overhead.
- Value **255 is a sentinel**: the slot's actual value is ≥ 255 and lives in the overflow store.

**Overflow store** — maps slot index to a `u32` value ≥ 255:

- In `MemoryIntVec` and `PersistentCompactIntVecBuilder`: a `HashMap<usize, u32>` in RAM.
- In `PersistentCompactIntVec` (reader): a sorted `[(slot: u64, value: u32)]` array in the mmap, with a sparse L1-resident index for binary search.

```mermaid
flowchart LR
    slot --> P["primary[slot]: u8"]
    P -->|"< 255"| V["value = byte (0–254)"]
    P -->|"= 255 sentinel"| OV["overflow store"]
    OV -->|"MemoryIntVec / Builder"| HM["HashMap&lt;usize, u32&gt;\nin RAM"]
    OV -->|"PersistentCompactIntVec"| SA["sorted [(slot,value)] in mmap\n+ sparse L1 index"]
```

**Key property — sentinel 255 = +∞ on `u8`:**

This is exploited throughout the binary operations. On a `u8` comparison, 255 behaves as positive infinity:
- `min(a, 255) = a` for all `a ≤ 254` → correct when only one side is overflow
- `max(a, 255) = 255` → correct sentinel when either side is overflow
- Only the **both-overflow** case requires reading actual values from the overflow store.

In practice, k (overflow count) ≪ n (total slots). Observed genomic data: ~0.07% of kmer slots are in overflow.

---

## Trait hierarchy

```mermaid
classDiagram
    class BitSlice {
        <<trait>>
        +len() usize
        +words() &[u64]
        +get(slot) bool
        +count_ones() u64
        +count_zeros() u64
        +partial_jaccard_dist(other) (u64,u64)
        +jaccard_dist(other) f64
        +hamming_dist(other) u64
    }
    class BitSliceMut {
        <<trait>>
        +words_mut() &mut [u64]
        +set(slot, value)
        +copy_from(src)
        +and(other)
        +or(other)
        +xor(other)
        +not()
    }
    class IntSlice {
        <<trait>>
        +len() usize
        +get(slot) u32
        +primary_bytes() &[u8]
        +overflow_entries() Iterator
        +iter() Iterator
        +sum() u64
        +count_nonzero() u64
        +cmp_scalar(pred) MemoryBitVec
        +lt/leq/gt/geq(t) MemoryBitVec
    }
    class IntSliceMut {
        <<trait>>
        +set(slot, value)
        +primary_bytes_mut() &mut [u8]
        +clear_overflow()
        +inc/dec/add_at(slot)
        +copy_from(src)
        +min/max/add/diff(other)
        +count_bits(bits)
    }
    class IntToBit {
        <<trait blanket>>
        +to_bitvec(threshold) MemoryBitVec
        +to_presence() MemoryBitVec
    }
    class BitToInt {
        <<trait blanket>>
        +to_intvec() MemoryIntVec
    }
    BitSliceMut --|> BitSlice : extends
    IntSliceMut --|> IntSlice : extends
    IntToBit --|> IntSlice : blanket T:IntSlice
    BitToInt --|> BitSlice : blanket T:BitSlice
```

### BitSlice (read-only)

Required: `len()`, `words() -> &[u64]`.

Bit `i` is at `words()[i >> 6]` bit `i & 63` (LSB-first). Padding bits in the last word are always zero — this invariant must be maintained by all implementors.

| Provided method | Implementation | Cost |
|---|---|---|
| `is_empty()` | `len() == 0` | O(1) |
| `get(slot)` | word extract | O(1) |
| `count_ones()` | POPCNT per word | O(n/64) |
| `count_zeros()` | `n − count_ones()` | O(n/64) |
| `partial_jaccard_dist(other)` | `(a&b).popcount`, `(a\|b).popcount` per word | O(n/64) |
| `jaccard_dist(other)` | from partial | O(n/64) |
| `hamming_dist(other)` | `(a^b).popcount` per word | O(n/64) |

### BitSliceMut: BitSlice (mutable)

Required: `words_mut() -> &mut [u64]`.

All bulk operations work at the word level (64 bits/iteration). The compiler auto-vectorizes these loops to AVX2/AVX-512. The zero-padding invariant is maintained: `not()` re-masks the last word after flipping.

| Provided method | Implementation | Cost |
|---|---|---|
| `set(slot, value)` | OR / AND-NOT on one word | O(1) |
| `copy_from(src)` | `copy_from_slice` = memcpy | O(n/64) |
| `and(other)` | `w &= o` per word | O(n/64) |
| `or(other)` | `w \|= o` per word | O(n/64) |
| `xor(other)` | `w ^= o` per word | O(n/64) |
| `not()` | `w ^= u64::MAX` per word, then mask last | O(n/64) |

**No overflow complexity here.** The packed `u64` representation is already the natural unit for SIMD operations. No sentinel, no HashMap — just bitwise word ops.

---

### IntSlice (read-only)

Required:
- `len() -> usize`
- `get(slot) -> u32` — handles sentinel transparently (binary search into overflow for persistent, HashMap for memory)
- `primary_bytes() -> &[u8]` — raw primary array including 255 sentinels
- `overflow_entries() -> impl Iterator<Item = (usize, u32)>` — (slot, true_value) pairs for all overflow slots

| Provided method | Default implementation | Note |
|---|---|---|
| `is_empty()` | `len() == 0` | |
| `iter()` | `(0..n).map(\|i\| self.get(i))` | Overridden in all concrete types |
| `sum()` | `iter().map(\|v\| v as u64).sum()` | Overridden in concrete types |
| `count_nonzero()` | `iter().filter(\|v\| *v > 0).count()` | Overridden in concrete types |
| `lt(t)` | `cmp_scalar(\|v\| v < t)` | |
| `leq(t)` | `cmp_scalar(\|v\| v <= t)` | |
| `gt(t)` | `cmp_scalar(\|v\| v > t)` | |
| `geq(t)` | `cmp_scalar(\|v\| v >= t)` | |
| `cmp_scalar(pred)` | two-pass (see below) | |

**`cmp_scalar` algorithm — two passes:**

```
Pass 1 — byte scan, O(n):
  for s in 0..n:
    b = primary[s]
    if b < 255 AND pred(b as u32):
      set bit s in result word

Pass 2 — overflow fixup, O(k):
  for (s, val) in overflow_entries():
    if pred(val): set bit s in result word
```

Pass 1 reads only the primary byte array — no HashMap access. For simple predicates (`geq`, `lt`, etc.) the compiler inlines `pred` and can auto-vectorize the byte comparison loop. Pass 2 handles the O(k) overflow slots that were left as 0 in pass 1.

Previous implementation: `pred(self.get(s))` for every slot → O(n log k) due to binary search in overflow. New: O(n) + O(k).

---

### IntSliceMut: IntSlice (mutable)

Required:
- `set(slot, value: u32)` — writes primary byte (or 255 + overflow entry if value ≥ 255); removes stale overflow entry if value drops below 255
- `primary_bytes_mut() -> &mut [u8]` — direct mutable access to the primary array
- `clear_overflow()` — empties the entire overflow store

The required methods expose the encoding internals. All provided methods are implemented in terms of these three + the `IntSlice` required methods.

| Provided method | Hot path | Overflow case | Cost |
|---|---|---|---|
| `inc(slot)` | `get` + `set` | — | O(1) or O(log k) |
| `dec(slot)` | `get` + `set` (saturating) | — | O(1) or O(log k) |
| `add_at(slot, delta)` | `get` + `set` (saturating) | — | O(1) or O(log k) |
| `copy_from(src)` | `copy_from_slice` + `clear_overflow` + replay overflows | — | O(n) + O(k) |
| `min(other)` | byte-level min, O(n) | both-overflow fixup, O(k) | O(n) |
| `max(other)` | byte-level max, O(n) | pre-pass on other's overflows, O(k) | O(n) |
| `add(other)` | byte add when both < 255, O(n) | `get` + `+` when either = 255 | O(n) |
| `diff(other)` | byte saturating_sub when self < 255, O(n) | `get` + `saturating_sub` when self = 255 | O(n) |
| `count_bits(bits)` | iterate set bits via word scan | — | O(n_ones) |
| `cmp_scalar` | inherited from IntSlice | — | O(n) + O(k) |

**`min` algorithm:**

Exploits 255 = +∞: `u8::min(a, 255) = a` and `u8::min(255, b) = b`. Only the case where both sides are ≥ 255 needs actual overflow values.

```mermaid
flowchart TD
    A["min(self, other)"] --> B["snapshot self_ov: Vec&lt;(slot,val)&gt;\nsnapshot other_ov: HashMap&lt;slot,val&gt;"]
    B --> C["clear_overflow()"]
    C --> D["Pass 1 — byte min, SIMD-vectorizable\nprimary[s] = min(self[s], other[s])  ∀s"]
    D --> E["Pass 2 — both-overflow fixup\nfor (slot, self_val) in self_ov"]
    E --> F{"slot ∈ other_ov?"}
    F -->|yes| G["set(slot, min(self_val, other_ov[slot]))"]
    F -->|no| H["byte pass wrote other.primary &lt; 255\nclear_overflow removed stale entry\nno action"]
    G --> I[done]
    H --> I
```

Overflow entries where only self was overflow are correctly handled: after `clear_overflow` + byte pass, `self.primary[slot] = min(255, other.primary[slot]) = other.primary[slot]` (which is < 255). No overflow entry — correct.

**`max` algorithm:**

Exploits 255 = +∞: `u8::max(a, 255) = 255` → any slot where either side is overflow will have sentinel 255 in the primary after the byte pass. The byte pass cannot distinguish "self had overflow and other did not" from "self was just written to 255 by the byte pass".

Solution: read and update self's original value at other's overflow slots *before* the byte pass overwrites them.

```mermaid
flowchart TD
    A["max(self, other)"] --> B["Pre-pass O(k_other)\nfor (slot, other_val) in other.overflow_entries()"]
    B --> C["self_val = self.get(slot)\nself.set(slot, max(self_val, other_val))"]
    C --> D["Pass 1 — byte max, SIMD-vectorizable\nprimary[s] = max(self[s], other[s])  ∀s"]
    D --> E["Overflow slots: max(255,255)=255\nprimary unchanged\noverflow entry from pre-pass preserved"]
    E --> F[done]
```

After the pre-pass, self.primary[slot] = 255 for all slots in other's overflow. The byte pass leaves those 255s intact. Self's own overflow slots not in other's overflow are also 255 in primary — byte max(255, b < 255) = 255, unchanged. Correct in all cases.

**`add` algorithm:**

No sentinel property useful for add: any pair (sb, ob) with sb + ob ≥ 255 creates a new overflow entry, even when neither input was overflow. Cannot simplify via byte arithmetic.

```
for s in 0..n:
  sb = self.primary[s]
  ob = other.primary[s]
  if sb < 255 AND ob < 255:      // hot path: no HashMap
    sum = sb as u32 + ob as u32
    if sum < 255: self.primary[s] = sum as u8   // direct byte write
    else:         self.set(s, sum)               // creates overflow if needed
  else:                           // at least one is overflow
    self.set(s, self.get(s) + other.get(s))
```

```mermaid
flowchart TD
    A["add(self, other)"] --> B{"sb &lt; 255\nAND ob &lt; 255"}
    B -->|"yes — hot path\nno HashMap"| C{"sb + ob &lt; 255"}
    C -->|yes| D["primary[s] = sum as u8\nsingle byte write"]
    C -->|no| E["set(s, sum)\ncreates overflow entry"]
    B -->|"no — ≥1 side is overflow"| F["self_val = self.get(s)\nother_val = other.get(s)\nset(s, self_val + other_val)"]
    D --> Z[next slot]
    E --> Z
    F --> Z
```

The `+` on `u32` values is exact (no `saturating_add`). Overflow at u32 level panics in debug — not a real risk for kmer counts. The hot path (both < 255, sum < 255) is a single byte write with no HashMap access.

**`diff` (saturating sub) algorithm:**

`saturating_sub(a, b) = a − min(a, b) = max(0, a − b)`. Key insight: if self's primary byte < 255, the result is always < 255 (result ≤ a), so no new overflow entries are created and no overflow lookup is needed for self. Only self's overflow slots (primary = 255) need `get()`.

| sb | ob | result | get() needed |
|----|----|--------|-------------|
| < 255 | < 255 | `sb.saturating_sub(ob)` < 255 | none |
| < 255 | 255 | 0 (b ≥ 255 > a) | none |
| 255 | < 255 | `self.get(s) − ob` | self only |
| 255 | 255 | `self.get(s) − other.get(s)` | both |

```mermaid
flowchart TD
    A["diff(self, other)"] --> B{"sb &lt; 255\nself not overflow"}
    B -->|"yes — hot path O(n)"| C{"ob &lt; 255"}
    C -->|yes| D["primary[s] = sb.saturating_sub(ob)\nbyte write, no HashMap"]
    C -->|"no: b ≥ 255 > a"| E["primary[s] = 0"]
    B -->|"no — cold path O(k_self)"| F["self_val = self.get(s)"]
    F --> G{"ob &lt; 255"}
    G -->|yes| H["other_val = ob as u32"]
    G -->|no| I["other_val = other.get(s)"]
    H --> J["set(s, self_val.saturating_sub(other_val))"]
    I --> J
    D --> Z[next slot]
    E --> Z
    J --> Z
```

Overflow entries that drop below 255 (case sb=255, result < 255) are removed by `set()`. Overflow entries that remain ≥ 255 are updated. Correct in all four cases.

**`count_bits` algorithm:**

Increments self at each slot where the corresponding bit in `bits` is set. Iterates `bits.words()` and skips zero words entirely — O(n_ones) rather than O(n).

```
for (w_idx, word) in bits.words():
  if word == 0: continue
  base = w_idx * 64
  while word != 0:
    bit = trailing_zeros(word)
    self.inc(base + bit)
    word &= word − 1        // clear lowest set bit
```

---

## Concrete types

```mermaid
classDiagram
    class MemoryBitVec {
        -words: Vec~u64~
        -n: usize
        +iter() BitIter
        +ones(n) Self
        +persist(path) Builder
    }
    class MemoryIntVec {
        -primary: Vec~u8~
        -overflow: HashMap~usize,u32~
        -n: usize
        +iter() MemoryIntIter
        +filled(n, value) Self
        +persist(path) Builder
    }
    class PersistentBitVec {
        -mmap: Mmap
        -n: usize
        +iter() BitIter
        +count_ones() u64
    }
    class PersistentBitVecBuilder {
        -mmap: MmapMut
        -n: usize
        +close()
        +build_from(src, path)
        +build_from_counts(src, t, path)
    }
    class PersistentCompactIntVec {
        -mmap: Mmap
        -n usize
        -n_overflow usize
        -step usize
        -index: Vec~(usize,usize)~
        +iter() Iter
        +get(slot) u32
        +sum() u64
    }
    class PersistentCompactIntVecBuilder {
        -mmap: MmapMut
        -n: usize
        -overflow: HashMap~usize,u32~
        +set(slot, value)
        +close()
        +build_from(src, path)
    }

    MemoryBitVec ..|> BitSlice
    MemoryBitVec ..|> BitSliceMut
    PersistentBitVec ..|> BitSlice
    PersistentBitVecBuilder ..|> BitSlice
    PersistentBitVecBuilder ..|> BitSliceMut
    MemoryIntVec ..|> IntSlice
    MemoryIntVec ..|> IntSliceMut
    PersistentCompactIntVec ..|> IntSlice
    PersistentCompactIntVecBuilder ..|> IntSlice
    PersistentCompactIntVecBuilder ..|> IntSliceMut

    PersistentBitVecBuilder --> PersistentBitVec : close() then open()
    PersistentCompactIntVecBuilder --> PersistentCompactIntVec : close() then open()
```

### Memory types

**`MemoryBitVec`**

```rust
struct MemoryBitVec { words: Vec<u64>, n: usize }
```

Implements `BitSlice` + `BitSliceMut`. Owns its word array. Used as the result type of `cmp_scalar` / filter operations and as an intermediate for bit-level computations.

Std ops: `BitAnd`, `BitOr`, `BitXor`, `Not` (owned and borrowed), `BitAndAssign`, `BitOrAssign`, `BitXorAssign` — all delegate to `BitSliceMut` methods.

`iter()` returns a `BitIter<'_>` (word-level, see below).

**`MemoryIntVec`**

```rust
struct MemoryIntVec {
    primary:  Vec<u8>,
    overflow: HashMap<usize, u32>,
    n:        usize,
}
```

Implements `IntSlice` + `IntSliceMut`. Overrides: `iter()` → inherent `iter()` (merge-scan), `sum()`, `count_nonzero()`.

`IntSlice` required impls: `primary_bytes()` → `&self.primary`; `overflow_entries()` → `self.overflow.iter().map(...)`.

`IntSliceMut` required impls: `set()` writes to `self.primary[slot]` and inserts/removes from `self.overflow`; `primary_bytes_mut()` → `&mut self.primary`; `clear_overflow()` → `self.overflow.clear()`.

Std ops: `Add<&B>`, `Sub<&B>` (owned and borrowed), `AddAssign<&B>`, `SubAssign<&B>` — delegate to `IntSliceMut::add` / `diff`.

`From<&S: IntSlice>`: copies primary bytes + overflow entries. O(n) + O(k).

---

### Persistent types

**`PersistentBitVec` / `PersistentBitVecBuilder`**

See `persistent_bit_vec.md`. `PersistentBitVec` is read-only (implements `BitSlice`). `PersistentBitVecBuilder` is read-write (implements `BitSlice` + `BitSliceMut`).

`BitIter<'a>` — shared iterator type for both `MemoryBitVec` and `PersistentBitVec`:

```rust
pub struct BitIter<'a> { pub(crate) words: &'a [u64], pub(crate) slot: usize, pub(crate) n: usize }
```

Word-level scan: `(words[slot >> 6] >> (slot & 63)) & 1 != 0`. One word serves 64 iterations. `pub type MemoryBitIter<'a> = BitIter<'a>` preserves the public API name.

**`PersistentCompactIntVec` / `PersistentCompactIntVecBuilder`**

See `persistent_compact_int_vec.md` for file format and lifecycle.

`PersistentCompactIntVec` implements `IntSlice`. Overrides: `iter()` → inherent merge-scan `Iter`; `sum()`; `count_nonzero()`. `overflow_entries()` returns a sequential scan `(0..n_overflow).map(|i| (data_slot(i), data_value(i)))` — no binary search since entries are stored sorted.

`PersistentCompactIntVecBuilder` implements `IntSlice` + `IntSliceMut`. `iter()` is NOT overridden (default `get`-per-slot) because the overflow `HashMap` is unsorted. `sum()` and `count_nonzero()` are overridden using `byte_sum` / `byte_count_nonzero` on the mmap primary slice — avoids per-slot overhead.

**Override rationale:** the default `iter()`, `sum()`, `count_nonzero()` on `IntSlice` call `self.get(s)` per slot, which is O(log k) binary search for `PersistentCompactIntVec`. Overrides provide O(n + k) merge-scan or O(n) byte scan instead.

---

### IntSlice implementors — override summary

| Type | `iter()` | `sum()` | `count_nonzero()` |
|------|----------|---------|-------------------|
| `MemoryIntVec` | inherent merge-scan ✓ | `byte_sum` ✓ | `byte_count_nonzero` ✓ |
| `PersistentCompactIntVecBuilder` | default (get-per-slot) | `byte_sum` on mmap ✓ | `byte_count_nonzero` on mmap ✓ |
| `PersistentCompactIntVec` | inherent merge-scan Iter ✓ | inherent `sum()` ✓ | inherent `count_nonzero()` ✓ |
| `PackedIntCol<'a>` | inherent PackedIntColIter ✓ | byte_sum ✓ | byte_count_nonzero ✓ |

`PackedIntCol` is used internally by `PersistentCompactIntMatrix` (packed format) for column views.

---

## Matrix types

Four matrix types, two encodings × two formats:

| | Columnar format | Packed format |
|---|---|---|
| **Bit** | `PersistentBitMatrix` | — |
| **Int** | `PersistentCompactIntMatrix` (columnar) | `PersistentCompactIntMatrix` (packed) |

`PersistentCompactIntMatrix` is an enum behind a transparent API — the caller does not see whether the on-disk format is columnar (one `.pciv` per column) or packed (one `.pcmx` file interleaving all columns). `col(c)` and `col_slice(c)` return column views that implement `IntSlice`.

`pack_compact_int_matrix` and `pack_bit_matrix` convert a columnar matrix to packed format.

For details see `persistent_compact_int_vec.md` and `persistent_bit_vec.md`.

---

## Conversion traits

Four blanket-impl traits on top of `BitSlice` / `IntSlice`:

**`IntToBit: IntSlice`**
- `to_bitvec(threshold: u32) -> MemoryBitVec` — bit set iff value ≥ threshold (delegates to `geq`)
- `to_presence() -> MemoryBitVec` — bit set iff value ≥ 1 (delegates to `geq(1)`)

**`BitToInt: BitSlice`**
- `to_intvec() -> MemoryIntVec` — expands each bit to a `u8` (0 or 1) in a new primary array
- Uses a `static EXPAND_BYTE: [[u8; 8]; 256]` lookup table — 8 bits expanded per byte, word-level outer loop

Both `IntToBit` and `BitToInt` are implemented for all `T: IntSlice` / `T: BitSlice` via blanket impls.

---

## Aggregation traits (matrix level)

### ColumnWeights

```rust
trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;         // sum per column
    fn partial_kmer_counts(&self) -> Array1<u64>; // default = col_weights()
}
```

`partial_kmer_counts` is overridden for count matrices to return `count_nonzero` per column (distinct kmers) rather than total count.

### CountPartials

Abstract required methods: `partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`, `partial_relfreq_bray`, `partial_relfreq_euclidean`, `partial_hellinger`.

**Additivity rule:** self-contained partials (`partial_bray`, `partial_euclidean`, `partial_threshold_jaccard`) can be element-wise summed across all `(partition, layer)` pairs before applying the finalisation. Normalised partials (`partial_relfreq_*`, `partial_hellinger`) require the **global** `col_weights` (accumulated across all layers and all partitions) as parameter — not per-layer or per-partition weights.

**`partial_threshold_jaccard` returns `(inter, union)`**, not a single matrix, because `union[i,j]` depends on both columns simultaneously and cannot be reconstructed from per-column statistics.

Provided finalisations (default implementations):

| Finalisation | Formula |
|---|---|
| `bray_dist_matrix()` | `1 − 2·partial_bray[i,j] / (w[i] + w[j])` |
| `euclidean_dist_matrix()` | `√partial_euclidean[i,j]` |
| `threshold_jaccard_dist_matrix(t)` | `1 − inter[i,j] / union[i,j]` |
| `relfreq_bray_dist_matrix()` | `1 − partial_relfreq_bray[i,j]` (two-pass: col_weights then partial) |
| `relfreq_euclidean_dist_matrix()` | `√partial_relfreq_euclidean[i,j]` |
| `hellinger_dist_matrix()` | `√partial_hellinger[i,j] / √2` |
| `hellinger_euclidean_dist_matrix()` | `√partial_hellinger[i,j]` |

### BitPartials

Required: `partial_jaccard() -> (Array2<u64>, Array2<u64>)` (inter, union), `partial_hamming() -> Array2<u64>`. Both additive across layers and partitions.

---

## Planned — Filter / Select API

### Composition across layers and partitions

```mermaid
graph TD
    subgraph Index
        CG["ColGroup\nVec&lt;usize&gt; — valid everywhere"]
        ACC["MemoryIntVec\nglobal accumulator"]
        PRED["geq / leq / and / or\n→ MemoryBitVec mask"]
    end

    subgraph "Layer 1"
        subgraph "Partition A  kmers 0..k/2"
            MA["Matrix A\npartial_group_presence_count"]
        end
        subgraph "Partition B  kmers k/2..k"
            MB["Matrix B\npartial_group_presence_count"]
        end
        CONCAT1["concat → MemoryIntVec\[0..k\]"]
    end

    subgraph "Layer 2"
        CONCAT2["concat → MemoryIntVec\[0..k\]"]
    end

    CG -->|"same indices"| MA
    CG -->|"same indices"| MB
    MA -->|"kmer range A"| CONCAT1
    MB -->|"kmer range B"| CONCAT1
    CONCAT1 -->|"IntSliceMut::add"| ACC
    CONCAT2 -->|"IntSliceMut::add"| ACC
    ACC --> PRED
```

### ColGroup

```rust
struct ColGroup { name: String, indices: Vec<usize> }
```

Defined **once at the index level** from column metadata. Valid in all matrices of all layers and partitions because column structure is identical across the entire hierarchy (same samples/genomes everywhere; only rows = kmer slots are partitioned).

`ColGroup` is passed by reference unchanged to any matrix — no index translation.

### Composition axis

- **Across partitions**: kmer space is partitioned → partial results are **concatenated** (disjoint kmer ranges).
- **Across layers**: same kmer space, different counts → partial results are **aggregated** (add, OR, etc.).

### Additivity rules

```mermaid
flowchart LR
    subgraph "Matrix level — returns MemoryIntVec"
        PGP["partial_group_presence_count\npartial_group_sum\npartial_group_any → MemoryBitVec"]
    end
    subgraph "Index level — applies predicate"
        GA["group_at_least(k)\n= accumulate.geq(k)"]
        GALL["group_all\n= accumulate.geq(n_cols)"]
        GANY["group_any\n= OR of partial_group_any"]
    end
    PGP -->|"concat across partitions\nadd across layers"| GA
    PGP --> GALL
    PGP --> GANY
```

Non-additive predicates (`group_all`, `group_at_least`) do **not** exist at matrix level — they require the global accumulated count.

### MatrixGroupOps (planned trait)

Group operations live on the matrix and expose only **additive intermediates** (`MemoryIntVec`). Predicates (final thresholds → `MemoryBitVec`) are applied at the index level after accumulation.

```rust
trait MatrixGroupOps {
    // How many columns in group have value >= threshold, per kmer slot
    fn partial_group_presence_count(&self, g: &ColGroup, threshold: u32) -> MemoryIntVec;

    // Sum of values across group columns, per kmer slot
    fn partial_group_sum(&self, g: &ColGroup) -> MemoryIntVec;

    // Kmer present (value >= threshold) in at least one column of group
    fn partial_group_any(&self, g: &ColGroup, threshold: u32) -> MemoryBitVec;
}
```

Non-additive predicates (`group_all`, `group_at_least(k)`) are **not** on the matrix — they are composed at the index level from the additive intermediates:

```
// "present in >= 2 ingroup columns with count >= 3, absent from all outgroup"
let presence = layers.map(|l| l.partial_group_presence_count(&ingroup, 3)).sum();
let in_mask  = presence.geq(2);                                     // MemoryBitVec

let out_sum  = layers.map(|l| l.partial_group_sum(&outgroup)).sum();
let out_mask = out_sum.leq(0);                                      // MemoryBitVec

let mask = in_mask.and(&out_mask);    // BitSliceMut::and — O(n/64)
```

### mask_with (planned IntSliceMut method)

Apply a bit mask to a count vector: zero slots where the mask bit is 0. Iterates only zero bits — O(n_zeros), O(1) when mask is all-ones.

```
for (w_idx, word) in mask.words():
  if word == u64::MAX: continue
  zeros = !word
  while zeros != 0:
    bit = trailing_zeros(zeros)
    s = w_idx * 64 + bit
    self.set(s, 0)
    zeros &= zeros − 1
```

This is the terminal operation for both Filter (zero non-selected kmer slots in a count matrix) and Select (positional selection without MPHF).