feat: add memory vectors, slice traits, and column extraction methods
Introduce `MemoryBitVec` and `MemoryIntVec` for efficient in-memory storage with hybrid compression and overflow handling. Implement `BitSlice`, `BitSliceMut`, `IntSlice`, and `IntSliceMut` traits across persistent and memory-backed types to enable generic slice operations and bitwise/arithmetic overloads. Add `col_persist` and `col_as_memory` methods to `BitMatrix` and `IntMatrix` for efficient column extraction. Align with the new single-pass rebuild architecture by supporting fast kmer filtering and matrix rebuilding. Includes comprehensive tests and profiling instrumentation for the packing phase.
This commit is contained in:
@@ -0,0 +1,105 @@
|
||||
# Rebuild / filter — column-first design
|
||||
|
||||
## Problem with the current two-pass design
|
||||
|
||||
`rebuild_partition` currently makes **two full passes** over source data:
|
||||
|
||||
**Pass 1** — read unitigs → MPHF lookup (source) → read row (108 values) → apply filter → push kmer into `GraphDeBruijn`, **discard row**.
|
||||
|
||||
**Pass 2** — read unitigs again → MPHF lookup again → read row again → for each passing kmer, look up slot in new MPHF → fill column builders.
|
||||
|
||||
Both passes do random access into the source matrix: for each kmer, the MPHF returns a slot, then we read 108 values scattered across 108 column positions. This is cache-hostile even with a packed matrix (`.pbmx`), because the matrix is column-major: consecutive row reads jump across the file.
|
||||
|
||||
## Memory budget
|
||||
|
||||
The `keep` bitvector costs **1 bit per slot**. With 256 partitions and realistic kmer counts, each partition holds at most a few tens of millions of slots → a few MB per bitvector. Even in the absolute worst case (800 M slots), it stays under 100 MB. This is negligible.
|
||||
|
||||
The `slot_map` option (Option B, 8–16 bytes per slot) is heavier but still bounded: at 15 M slots and 8 bytes, that is 120 MB per partition, acceptable for a single worker.
|
||||
|
||||
## Key observation
|
||||
|
||||
**The filter operates on column values, not on kmers.** A filter like `--max-outgroup-count 0` only needs to know, for each slot, whether any outgroup column is non-zero. It does not need to know which kmer occupies that slot.
|
||||
|
||||
This means filtering can be done as a **sequential column scan** that produces a `keep: BitVec[n_slots]` — no MPHF lookups, no kmer knowledge, perfectly cache-friendly.
|
||||
|
||||
## Proposed single-scan design
|
||||
|
||||
### Step 1 — column scan → `keep` bitvector
|
||||
|
||||
```
|
||||
for each column c in source matrix:
|
||||
read column c sequentially (one mmap range)
|
||||
update keep[slot] according to filter contribution of column c
|
||||
```
|
||||
|
||||
For `GroupQuorumFilter` with ingroup/outgroup:
|
||||
- ingroup columns: count presence per slot → `ingroup_count[slot]`
|
||||
- outgroup columns: `keep[slot] &= (value[slot] == 0)` (early-exit possible)
|
||||
|
||||
Result: `keep: BitVec` of size `n_slots`, computed with purely sequential IO.
|
||||
|
||||
### Step 2 — unitig scan → kept kmers + new MPHF
|
||||
|
||||
```
|
||||
for each kmer in unitig files:
|
||||
old_slot = old_MPHF(kmer)
|
||||
if keep[old_slot]:
|
||||
push kmer into new GraphDeBruijn
|
||||
record (old_slot, kmer) ← or just old_slot in order
|
||||
```
|
||||
|
||||
Build new MPHF from `GraphDeBruijn` via `materialize_layer`.
|
||||
|
||||
### Step 3 — fill new matrix
|
||||
|
||||
Two sub-options:
|
||||
|
||||
**Option A — from recorded (old_slot, kmer) pairs:**
|
||||
|
||||
```
|
||||
for each (old_slot, kmer) in recorded list:
|
||||
new_slot = new_MPHF(kmer)
|
||||
for each column c:
|
||||
new_matrix[new_slot, c] = old_matrix[old_slot, c]
|
||||
```
|
||||
|
||||
Memory cost: `n_kept × (8 + 8)` bytes for `(old_slot: usize, kmer: CanonicalKmer)`.
|
||||
For species-specific filters, `n_kept` is small. For unfiltered rebuild, `n_kept = n_slots`.
|
||||
|
||||
**Option B — column-by-column copy using old→new slot mapping:**
|
||||
|
||||
Precompute `slot_map: Vec<Option<usize>>` of size `n_slots`:
|
||||
- For each kmer in unitig file: `slot_map[old_MPHF(kmer)] = Some(new_MPHF(kmer))`
|
||||
|
||||
Then for each source column:
|
||||
```
|
||||
read source column sequentially
|
||||
for each slot where slot_map[slot] = Some(new_slot):
|
||||
write value to new column at new_slot
|
||||
```
|
||||
|
||||
Memory cost: `n_slots × sizeof(usize)` for the slot map (one usize per source slot).
|
||||
IO pattern: sequential read of each source column → random write into new column builders.
|
||||
|
||||
Option B avoids storing kmer values and works uniformly regardless of filter selectivity.
|
||||
|
||||
## Comparison
|
||||
|
||||
| | Current | Proposed |
|
||||
|---|---|---|
|
||||
| Disk reads | 2× unitigs + 2× random matrix | 1× columns (sequential) + 1× unitigs |
|
||||
| MPHF lookups (source) | 2× N_kmers | 1× N_kept (step 2) or 0 (option B, col scan only) |
|
||||
| Cache behavior | poor (random row access) | good (sequential column scan) |
|
||||
| Extra memory | none | slot_map (option B) or (old_slot, kmer) list (option A) |
|
||||
|
||||
## Files to modify
|
||||
|
||||
- `src/obikpartitionner/src/rebuild_layer.rs` — `rebuild_partition` and `iter_src_layers`
|
||||
- Possibly `src/obicompactvec/` — add column iterator API if not already present
|
||||
- `src/obilayeredmap/` — check if per-column sequential access is exposed on `SrcLayerData`
|
||||
|
||||
## Open questions
|
||||
|
||||
- Does `SrcLayerData` expose per-column sequential iteration, or only `lookup(kmer, n_genomes)` random access?
|
||||
- For option B: are new column builders writable in random-slot order (i.e. `set_val(slot, value)` without sequential constraint)?
|
||||
- For `GroupQuorumFilter` specifically: can the filter be decomposed into independent per-column contributions, or does it need the full row?
|
||||
Reference in New Issue
Block a user