obikmer/docmd/implementation/persistent_bit_vec.md

# PersistentBitVec and PersistentBitMatrix

## Purpose

`PersistentBitVec` stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of `PersistentCompactIntVec` and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.

Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.

`PersistentBitMatrix` wraps multiple `PersistentBitVec` columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.

---

## PersistentBitVec — single-column file

### File format

Single `.pbiv` file.

```
offset 0:
  magic:     [u8; 4]  = b"PBIV"
  _pad:      [u8; 4]  = 0           alignment padding
  n:         u64       number of bits

offset 16:
  data:      [u64; ⌈n/64⌉]          bit words, LSB-first, zero-padded
```

**Header is 16 bytes**, so data starts at an offset divisible by 8. Since `mmap` returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy `&[u8] → &[u64]` reinterpretation.

**Bit layout**: bit `i` is in `data[i >> 6]` at bit position `i & 63` (LSB-first). Bits `[n, ⌈n/64⌉×64)` are **always zero** (padding). This invariant is maintained by all write operations and must be restored by `not()` after flipping.

**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.

### Lifecycle

#### Builder (`PersistentBitVecBuilder`)

```rust
struct PersistentBitVecBuilder {
    mmap: MmapMut,
    n:    usize,
}
```

The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.

**`new(n: usize, path: &Path) -> io::Result<Self>`**

Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.

**`build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>`**

OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).

**`build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>`**

Creates a new file, iterates `source` with its merge-scan iterator (O(n)), and writes bits directly into u64 words:

```rust
// bit i = 1 iff source[i] >= threshold
words[slot >> 6] |= 1u64 << (slot & 63);
```

Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.

**`build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**

Shorthand for `build_from_counts(source, 1, path)`.

**Bit-level access**

```rust
fn get(&self, slot: usize) -> bool
fn set(&mut self, slot: usize, value: bool)
```

Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).

**Word-level bulk operations**

All operate on `⌈n/64⌉` u64 words. O(n/64) per call.

```rust
builder.and(&other);   // self[i] &= other[i]  for all i
builder.or(&other);    // self[i] |= other[i]
builder.xor(&other);   // self[i] ^= other[i]
builder.not();         // self[i]  = !self[i], then re-zero padding bits
```

`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.

**`close(self) -> io::Result<()>`**

Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.

#### Reader (`PersistentBitVec`)

```rust
struct PersistentBitVec {
    mmap: Mmap,
    n:    usize,
    path: PathBuf,
}
```

**`open(path: &Path) -> io::Result<Self>`**

Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).

**`get(slot: usize) -> bool`**

Byte-level read from `mmap[16 + slot/8]`. O(1).

**`iter() -> BitIter<'_>`**

Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).

**Aggregates**

```rust
fn count_ones(&self)  -> u64   // popcount over all words; padding bits are 0
fn count_zeros(&self) -> u64   // n - count_ones()
```

`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).

**Distance methods**

Both operate word by word. O(n/64).

| Method | Formula | Notes |
|---|---|---|
| `jaccard_dist(&other) -> f64` | `1 − \|A∩B\| / \|A∪B\|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |

Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.

### Implementation notes

#### u64 word view

The unsafe cast from `&[u8]` to `&[u64]` is sound because:

1. `mmap` base is page-aligned (≥ 4096-byte boundary).
2. Data offset = 16, and `16 % 8 == 0` → the data pointer is 8-byte aligned.
3. Data length = `⌈n/64⌉ × 8` bytes — always a multiple of 8.

This gives zero-copy word-level access with no intermediate allocation.

#### Padding invariant

Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.

### Complexity

| Operation | Time | Notes |
|---|---|---|
| `new` / `open` | O(1) | mmap setup + header parse |
| `get` / `set` (builder or reader) | O(1) | byte-level mmap |
| `iter()` | O(n) | byte-by-byte scan |
| `count_ones` / `count_zeros` | O(n/64) | POPCNT per u64 word |
| `and` / `or` / `xor` / `not` | O(n/64) | word-level bitwise ops |
| `jaccard_dist` / `hamming_dist` | O(n/64) | word AND/OR/XOR + POPCNT |
| `build_from` | O(file_size) | OS copy |
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
| `close` | O(1) | flush only |

---

## PersistentBitMatrix — column-major directory

### Design

A directory containing `meta.json` and N column files `col_000000.pbiv`, `col_000001.pbiv`, …, each a `PersistentBitVec`. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.

```
presence/
  meta.json          {"n": <n_slots>, "n_cols": <G>}
  col_000000.pbiv    genome 0
  col_000001.pbiv    genome 1
  ...
```

Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.

### Builder (`PersistentBitMatrixBuilder`)

```rust
struct PersistentBitMatrixBuilder {
    dir:    PathBuf,
    n:      usize,
    n_cols: usize,
}
```

**`new(n: usize, dir: &Path) -> io::Result<Self>`**

Creates the directory (including parents).

**`add_col(&mut self) -> io::Result<PersistentBitVecBuilder>`**

Creates `col_NNNNNN.pbiv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.

**`close(self) -> io::Result<()>`**

Writes `meta.json` with the final `n` and `n_cols`.

### Reader (`PersistentBitMatrix`)

```rust
struct PersistentBitMatrix {
    cols: Vec<PersistentBitVec>,
    n:    usize,
}
```

**`open(dir: &Path) -> io::Result<Self>`**

Reads `meta.json`, opens all `col_NNNNNN.pbiv` files.

**`row(slot: usize) -> Box<[bool]>`**

Returns the presence vector: `[col_0[slot], col_1[slot], …, col_{G-1}[slot]]`. One byte read per column. O(G).

**`col(c: usize) -> &PersistentBitVec`**

Direct access to a single column for column-oriented operations.

### LayerData implementation

```rust
impl LayerData for PersistentBitMatrix {
    type Item = Box<[bool]>;
    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
    fn read(&self, slot: usize) -> Box<[bool]>   { self.row(slot) }
}
```

---

## Aggregation traits — `obicompactvec::traits`

`PersistentBitMatrix` implements two aggregation traits used by `LayeredStore<S>` for cross-layer and cross-partition distance computations.

### ColumnWeights

```rust
impl ColumnWeights for PersistentBitMatrix {
    fn col_weights(&self) -> Array1<u64>   // = self.count_ones()
}
```

`col_weights()[c]` = number of set bits in column `c` across all slots.

### BitPartials

```rust
impl BitPartials for PersistentBitMatrix {
    // Self-contained partials (additive across layers)
    fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)   // (inter, union)
    fn partial_hamming(&self) -> Array2<u64>                   // differing bits

    // Provided finalisations
    fn jaccard_dist_matrix(&self) -> Array2<f64>
    fn hamming_dist_matrix(&self) -> Array2<u64>
}
```

`partial_jaccard` returns `(inter, union)` as a pair because `union` is not reconstructible from per-column `count_ones()` — it depends on both columns simultaneously. Both components are additively decomposable across `(partition, layer)` pairs; the final `jaccard_dist_matrix()` is computed from their element-wise sums.