Skip to content

PersistentBitVec and PersistentBitMatrix

Purpose

PersistentBitVec stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of PersistentCompactIntVec and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.

Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.

PersistentBitMatrix wraps multiple PersistentBitVec columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.


PersistentBitVec — single-column file

File format

Single .pbiv file.

offset 0:
  magic:     [u8; 4]  = b"PBIV"
  _pad:      [u8; 4]  = 0           alignment padding
  n:         u64       number of bits

offset 16:
  data:      [u64; ⌈n/64⌉]          bit words, LSB-first, zero-padded

Header is 16 bytes, so data starts at an offset divisible by 8. Since mmap returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy &[u8] → &[u64] reinterpretation.

Bit layout: bit i is in data[i >> 6] at bit position i & 63 (LSB-first). Bits [n, ⌈n/64⌉×64) are always zero (padding). This invariant is maintained by all write operations and must be restored by not() after flipping.

Total file size: 16 + ⌈n/64⌉ × 8 bytes.

Lifecycle

Builder (PersistentBitVecBuilder)

struct PersistentBitVecBuilder {
    mmap: MmapMut,
    n:    usize,
}

The file and mmap are created immediately at construction. The header is written once at new() or copied from the source at build_from*(). close() is a single flush — there is no tail to append, unlike PersistentCompactIntVec.

new(n: usize, path: &Path) -> io::Result<Self>

Creates the file, writes the header, zero-extends to 16 + ⌈n/64⌉×8 bytes, mmaps immediately. All bits default to 0.

build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>

OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).

build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>

Creates a new file, iterates source with its merge-scan iterator (O(n)), and writes bits directly into u64 words:

// bit i = 1 iff source[i] >= threshold
words[slot >> 6] |= 1u64 << (slot & 63);

Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.

build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>

Shorthand for build_from_counts(source, 1, path).

Bit-level access

fn get(&self, slot: usize) -> bool
fn set(&mut self, slot: usize, value: bool)

Byte-level mmap access: mmap[16 + slot/8], bit slot % 8. O(1).

Word-level bulk operations

All operate on ⌈n/64⌉ u64 words. O(n/64) per call.

builder.and(&other);   // self[i] &= other[i]  for all i
builder.or(&other);    // self[i] |= other[i]
builder.xor(&other);   // self[i] ^= other[i]
builder.not();         // self[i]  = !self[i], then re-zero padding bits

and/or/xor read other's word slice directly (no allocation). not() flips all words then masks the last word's padding bits to restore the invariant.

close(self) -> io::Result<()>

Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.

Reader (PersistentBitVec)

struct PersistentBitVec {
    mmap: Mmap,
    n:    usize,
    path: PathBuf,
}

open(path: &Path) -> io::Result<Self>

Mmaps the file, validates magic, reads n from bytes [8..16]. O(1).

get(slot: usize) -> bool

Byte-level read from mmap[16 + slot/8]. O(1).

iter() -> BitIter<'_>

Sequential scan, byte by byte, yielding bool values in slot order. Implements ExactSizeIterator. O(n).

Aggregates

fn count_ones(&self)  -> u64   // popcount over all words; padding bits are 0
fn count_zeros(&self) -> u64   // n - count_ones()

count_ones iterates ⌈n/64⌉ words and calls u64::count_ones() (maps to POPCNT). O(n/64).

Distance methods

Both operate word by word. O(n/64).

Method Formula Notes
jaccard_dist(&other) -> f64 1 − \|A∩B\| / \|A∪B\| (a&b).count_ones(), (a\|b).count_ones() per word
hamming_dist(&other) -> u64 number of differing bits (a^b).count_ones() per word

Edge case (both all-zero → union = 0): jaccard_dist returns 0.0.

Implementation notes

u64 word view

The unsafe cast from &[u8] to &[u64] is sound because:

  1. mmap base is page-aligned (≥ 4096-byte boundary).
  2. Data offset = 16, and 16 % 8 == 0 → the data pointer is 8-byte aligned.
  3. Data length = ⌈n/64⌉ × 8 bytes — always a multiple of 8.

This gives zero-copy word-level access with no intermediate allocation.

Padding invariant

Writing not() without masking the last word would corrupt count_ones(), hamming_dist(), and jaccard_dist(). The mask applied after flipping is (1u64 << (n % 64)) - 1 (no-op if n % 64 == 0). All other operations (and, or, xor) preserve existing zero padding since they can only clear or preserve bits already set by not().

Complexity

Operation Time Notes
new / open O(1) mmap setup + header parse
get / set (builder or reader) O(1) byte-level mmap
iter() O(n) byte-by-byte scan
count_ones / count_zeros O(n/64) POPCNT per u64 word
and / or / xor / not O(n/64) word-level bitwise ops
jaccard_dist / hamming_dist O(n/64) word AND/OR/XOR + POPCNT
build_from O(file_size) OS copy
build_from_counts / build_from_presence O(n) count iter + word fill
close O(1) flush only

PersistentBitMatrix — column-major directory

Design

A directory containing meta.json and N column files col_000000.pbiv, col_000001.pbiv, …, each a PersistentBitVec. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.

presence/
  meta.json          {"n": <n_slots>, "n_cols": <G>}
  col_000000.pbiv    genome 0
  col_000001.pbiv    genome 1
  ...

Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.

Builder (PersistentBitMatrixBuilder)

struct PersistentBitMatrixBuilder {
    dir:    PathBuf,
    n:      usize,
    n_cols: usize,
}

new(n: usize, dir: &Path) -> io::Result<Self>

Creates the directory (including parents).

add_col(&mut self) -> io::Result<PersistentBitVecBuilder>

Creates col_NNNNNN.pbiv for the next column and returns its builder. The caller fills the column and calls builder.close() before calling add_col again.

close(self) -> io::Result<()>

Writes meta.json with the final n and n_cols.

Reader (PersistentBitMatrix)

struct PersistentBitMatrix {
    cols: Vec<PersistentBitVec>,
    n:    usize,
}

open(dir: &Path) -> io::Result<Self>

Reads meta.json, opens all col_NNNNNN.pbiv files.

row(slot: usize) -> Box<[bool]>

Returns the presence vector: [col_0[slot], col_1[slot], …, col_{G-1}[slot]]. One byte read per column. O(G).

col(c: usize) -> &PersistentBitVec

Direct access to a single column for column-oriented operations.

LayerData implementation

impl LayerData for PersistentBitMatrix {
    type Item = Box<[bool]>;
    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
    fn read(&self, slot: usize) -> Box<[bool]>   { self.row(slot) }
}

Aggregation traits — obicompactvec::traits

PersistentBitMatrix implements two aggregation traits used by LayeredStore<S> for cross-layer and cross-partition distance computations.

ColumnWeights

impl ColumnWeights for PersistentBitMatrix {
    fn col_weights(&self) -> Array1<u64>   // = self.count_ones()
}

col_weights()[c] = number of set bits in column c across all slots.

BitPartials

impl BitPartials for PersistentBitMatrix {
    // Self-contained partials (additive across layers)
    fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)   // (inter, union)
    fn partial_hamming(&self) -> Array2<u64>                   // differing bits

    // Provided finalisations
    fn jaccard_dist_matrix(&self) -> Array2<f64>
    fn hamming_dist_matrix(&self) -> Array2<u64>
}

partial_jaccard returns (inter, union) as a pair because union is not reconstructible from per-column count_ones() — it depends on both columns simultaneously. Both components are additively decomposable across (partition, layer) pairs; the final jaccard_dist_matrix() is computed from their element-wise sums.