feat: add memory vectors, slice traits, and column extraction methods

Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.
2026-06-16 23:18:10 +02:00
parent b6fcbc545f
commit cde6457eea
15 changed files with 1120 additions and 70 deletions
@@ -0,0 +1,105 @@
+# Rebuild / filter — column-first design
+
+## Problem with the current two-pass design
+
+`rebuild_partition` currently makes **two full passes** over source data:
+
+**Pass 1** — read unitigs → MPHF lookup (source) → read row (108 values) → apply filter → push kmer into `GraphDeBruijn`, **discard row**.
+
+**Pass 2** — read unitigs again → MPHF lookup again → read row again → for each passing kmer, look up slot in new MPHF → fill column builders.
+
+Both passes do random access into the source matrix: for each kmer, the MPHF returns a slot, then we read 108 values scattered across 108 column positions. This is cache-hostile even with a packed matrix (`.pbmx`), because the matrix is column-major: consecutive row reads jump across the file.
+
+## Memory budget
+
+The `keep` bitvector costs **1 bit per slot**. With 256 partitions and realistic kmer counts, each partition holds at most a few tens of millions of slots → a few MB per bitvector. Even in the absolute worst case (800 M slots), it stays under 100 MB. This is negligible.
+
+The `slot_map` option (Option B, 8–16 bytes per slot) is heavier but still bounded: at 15 M slots and 8 bytes, that is 120 MB per partition, acceptable for a single worker.
+
+## Key observation
+
+**The filter operates on column values, not on kmers.** A filter like `--max-outgroup-count 0` only needs to know, for each slot, whether any outgroup column is non-zero. It does not need to know which kmer occupies that slot.
+
+This means filtering can be done as a **sequential column scan** that produces a `keep: BitVec[n_slots]` — no MPHF lookups, no kmer knowledge, perfectly cache-friendly.
+
+## Proposed single-scan design
+
+### Step 1 — column scan → `keep` bitvector
+
+```
+for each column c in source matrix:
+    read column c sequentially (one mmap range)
+    update keep[slot] according to filter contribution of column c
+```
+
+For `GroupQuorumFilter` with ingroup/outgroup:
+- ingroup columns: count presence per slot → `ingroup_count[slot]`
+- outgroup columns: `keep[slot] &= (value[slot] == 0)` (early-exit possible)
+
+Result: `keep: BitVec` of size `n_slots`, computed with purely sequential IO.
+
+### Step 2 — unitig scan → kept kmers + new MPHF
+
+```
+for each kmer in unitig files:
+    old_slot = old_MPHF(kmer)
+    if keep[old_slot]:
+        push kmer into new GraphDeBruijn
+        record (old_slot, kmer)   ← or just old_slot in order
+```
+
+Build new MPHF from `GraphDeBruijn` via `materialize_layer`.
+
+### Step 3 — fill new matrix
+
+Two sub-options:
+
+**Option A — from recorded (old_slot, kmer) pairs:**
+
+```
+for each (old_slot, kmer) in recorded list:
+    new_slot = new_MPHF(kmer)
+    for each column c:
+        new_matrix[new_slot, c] = old_matrix[old_slot, c]
+```
+
+Memory cost: `n_kept × (8 + 8)` bytes for `(old_slot: usize, kmer: CanonicalKmer)`.
+For species-specific filters, `n_kept` is small. For unfiltered rebuild, `n_kept = n_slots`.
+
+**Option B — column-by-column copy using old→new slot mapping:**
+
+Precompute `slot_map: Vec<Option<usize>>` of size `n_slots`:
+- For each kmer in unitig file: `slot_map[old_MPHF(kmer)] = Some(new_MPHF(kmer))`
+
+Then for each source column:
+```
+read source column sequentially
+for each slot where slot_map[slot] = Some(new_slot):
+    write value to new column at new_slot
+```
+
+Memory cost: `n_slots × sizeof(usize)` for the slot map (one usize per source slot).
+IO pattern: sequential read of each source column → random write into new column builders.
+
+Option B avoids storing kmer values and works uniformly regardless of filter selectivity.
+
+## Comparison
+
+| | Current | Proposed |
+|---|---|---|
+| Disk reads | 2× unitigs + 2× random matrix | 1× columns (sequential) + 1× unitigs |
+| MPHF lookups (source) | 2× N_kmers | 1× N_kept (step 2) or 0 (option B, col scan only) |
+| Cache behavior | poor (random row access) | good (sequential column scan) |
+| Extra memory | none | slot_map (option B) or (old_slot, kmer) list (option A) |
+
+## Files to modify
+
+- `src/obikpartitionner/src/rebuild_layer.rs` — `rebuild_partition` and `iter_src_layers`
+- Possibly `src/obicompactvec/` — add column iterator API if not already present
+- `src/obilayeredmap/` — check if per-column sequential access is exposed on `SrcLayerData`
+
+## Open questions
+
+- Does `SrcLayerData` expose per-column sequential iteration, or only `lookup(kmer, n_genomes)` random access?
+- For option B: are new column builders writable in random-slot order (i.e. `set_val(slot, value)` without sequential constraint)?
+- For `GroupQuorumFilter` specifically: can the filter be decomposed into independent per-column contributions, or does it need the full row?