feat: add memory vectors, slice traits, and column extraction methods

Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.
This commit is contained in:
Eric Coissac
2026-06-16 23:18:10 +02:00
parent b6fcbc545f
commit cde6457eea
15 changed files with 1120 additions and 70 deletions
+105
View File
@@ -0,0 +1,105 @@
# Rebuild / filter — column-first design
## Problem with the current two-pass design
`rebuild_partition` currently makes **two full passes** over source data:
**Pass 1** — read unitigs → MPHF lookup (source) → read row (108 values) → apply filter → push kmer into `GraphDeBruijn`, **discard row**.
**Pass 2** — read unitigs again → MPHF lookup again → read row again → for each passing kmer, look up slot in new MPHF → fill column builders.
Both passes do random access into the source matrix: for each kmer, the MPHF returns a slot, then we read 108 values scattered across 108 column positions. This is cache-hostile even with a packed matrix (`.pbmx`), because the matrix is column-major: consecutive row reads jump across the file.
## Memory budget
The `keep` bitvector costs **1 bit per slot**. With 256 partitions and realistic kmer counts, each partition holds at most a few tens of millions of slots → a few MB per bitvector. Even in the absolute worst case (800 M slots), it stays under 100 MB. This is negligible.
The `slot_map` option (Option B, 816 bytes per slot) is heavier but still bounded: at 15 M slots and 8 bytes, that is 120 MB per partition, acceptable for a single worker.
## Key observation
**The filter operates on column values, not on kmers.** A filter like `--max-outgroup-count 0` only needs to know, for each slot, whether any outgroup column is non-zero. It does not need to know which kmer occupies that slot.
This means filtering can be done as a **sequential column scan** that produces a `keep: BitVec[n_slots]` — no MPHF lookups, no kmer knowledge, perfectly cache-friendly.
## Proposed single-scan design
### Step 1 — column scan → `keep` bitvector
```
for each column c in source matrix:
read column c sequentially (one mmap range)
update keep[slot] according to filter contribution of column c
```
For `GroupQuorumFilter` with ingroup/outgroup:
- ingroup columns: count presence per slot → `ingroup_count[slot]`
- outgroup columns: `keep[slot] &= (value[slot] == 0)` (early-exit possible)
Result: `keep: BitVec` of size `n_slots`, computed with purely sequential IO.
### Step 2 — unitig scan → kept kmers + new MPHF
```
for each kmer in unitig files:
old_slot = old_MPHF(kmer)
if keep[old_slot]:
push kmer into new GraphDeBruijn
record (old_slot, kmer) ← or just old_slot in order
```
Build new MPHF from `GraphDeBruijn` via `materialize_layer`.
### Step 3 — fill new matrix
Two sub-options:
**Option A — from recorded (old_slot, kmer) pairs:**
```
for each (old_slot, kmer) in recorded list:
new_slot = new_MPHF(kmer)
for each column c:
new_matrix[new_slot, c] = old_matrix[old_slot, c]
```
Memory cost: `n_kept × (8 + 8)` bytes for `(old_slot: usize, kmer: CanonicalKmer)`.
For species-specific filters, `n_kept` is small. For unfiltered rebuild, `n_kept = n_slots`.
**Option B — column-by-column copy using old→new slot mapping:**
Precompute `slot_map: Vec<Option<usize>>` of size `n_slots`:
- For each kmer in unitig file: `slot_map[old_MPHF(kmer)] = Some(new_MPHF(kmer))`
Then for each source column:
```
read source column sequentially
for each slot where slot_map[slot] = Some(new_slot):
write value to new column at new_slot
```
Memory cost: `n_slots × sizeof(usize)` for the slot map (one usize per source slot).
IO pattern: sequential read of each source column → random write into new column builders.
Option B avoids storing kmer values and works uniformly regardless of filter selectivity.
## Comparison
| | Current | Proposed |
|---|---|---|
| Disk reads | 2× unitigs + 2× random matrix | 1× columns (sequential) + 1× unitigs |
| MPHF lookups (source) | 2× N_kmers | 1× N_kept (step 2) or 0 (option B, col scan only) |
| Cache behavior | poor (random row access) | good (sequential column scan) |
| Extra memory | none | slot_map (option B) or (old_slot, kmer) list (option A) |
## Files to modify
- `src/obikpartitionner/src/rebuild_layer.rs``rebuild_partition` and `iter_src_layers`
- Possibly `src/obicompactvec/` — add column iterator API if not already present
- `src/obilayeredmap/` — check if per-column sequential access is exposed on `SrcLayerData`
## Open questions
- Does `SrcLayerData` expose per-column sequential iteration, or only `lookup(kmer, n_genomes)` random access?
- For option B: are new column builders writable in random-slot order (i.e. `set_val(slot, value)` without sequential constraint)?
- For `GroupQuorumFilter` specifically: can the filter be decomposed into independent per-column contributions, or does it need the full row?