Files
obikmer/docmd/implementation/persistent_compact_int_vec.md
T
Eric Coissac f2de79acde Add persistent compact integer vector and cache-line-optimized MPHF
Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 10:09:46 +08:00

5.8 KiB
Raw Blame History

PersistentCompactIntVec

Purpose

PersistentCompactIntVec stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0254) and large values are rare. It is designed for mmap-compatible random access with minimal memory footprint and optimal cache behaviour.

Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).


Design

Two-tier structure:

  1. Primary array[u8; n], mmap'd as a flat file. Values 0254 are stored directly. Value 255 is a sentinel meaning "look in overflow".
  2. Overflow structure — sorted list of (slot: u32, value: u32) pairs for all slots where the true value ≥ 255, with a sparse L1-fitting index for fast lookup.
primary[slot] < 255  →  return primary[slot]
primary[slot] == 255 →  binary search in overflow

Lifecycle

The structure has two distinct runtime roles with different APIs.

Builder (PersistentCompactIntVecBuilder)

Used during layer construction. Holds the primary array and overflow map in memory; supports arbitrary reads and writes before finalisation.

struct PersistentCompactIntVecBuilder {
    primary:  Vec<u8>,           // in memory; written to disk at close()
    overflow: HashMap<u64, u32>, // O(1) get/set for values ≥ 255
}

Phase 1 — new(n: usize) Allocates primary of length n initialised to 0. overflow is empty.

Phase 2 — fill (repeated set / get)

fn set(&mut self, slot: u64, value: u32) {
    if value < 255 {
        self.primary[slot] = value as u8;
        self.overflow.remove(&slot);         // in case of downward mutation
    } else {
        self.primary[slot] = 255;            // sentinel
        self.overflow.insert(slot, value);
    }
}

fn get(&self, slot: u64) -> u32 {
    match self.primary[slot] {
        255 => *self.overflow.get(&slot).unwrap(),
        v   => v as u32,
    }
}

Reads and mutations are both O(1). Overflow entries can be created, updated, or removed freely during this phase.

Phase 3 — close(primary_path, overflow_path)

  1. Write primary as raw bytes to counts_primary.bin.
  2. Collect overflow into Vec<(u32, u32)>, sort by slot.
  3. Compute step from n_overflow (see below).
  4. Build sparse index.
  5. Write counts_overflow.bin.
  6. Drop all in-memory state.

The HashMap is the only extra allocation: bounded by n_overflow × (8 + 4 + overhead) bytes, typically a few MB in practice.


Reader (PersistentCompactIntVec)

Used at query time. Both files are mmap'd; the sparse index is loaded into a Vec at open time (≤ 32 KB, L1-resident).

struct PersistentCompactIntVec {
    primary:  Mmap,              // mmap of counts_primary.bin
    index:    Vec<(u32, u32)>,   // sparse index, loaded into RAM at open
    data:     Mmap,              // mmap of overflow data region
    n_overflow: u32,
    step:     u32,
}

open(primary_path, overflow_path) Mmaps both files. Parses the overflow file header; copies the sparse index into a Vec (tiny, warm in cache). The data region stays mmap'd.

get(slot: u64) -> u32 — see Lookup section.


Overflow file format

magic:       [u8; 4]  = b"PCIV"
n_overflow:  u32
step:        u32      (0 if n_overflow ≤ L1_entries → no sparse index)
[if step > 0]
  n_index:   u32      = ⌈n_overflow / step⌉
  index:     [(slot: u32, pos: u32); n_index]    ← loaded into RAM at open
data:        [(slot: u32, value: u32); n_overflow]  sorted by slot, mmap'd

index[i] stores the slot value and data-array position of the i × step-th overflow entry.


Step computation

The step is chosen at close() time, once n_overflow is known:

L1_SIZE     = 32 * 1024     // 32 KB conservative target
INDEX_ENTRY = 8             // bytes: (u32, u32)
L1_entries  = L1_SIZE / INDEX_ENTRY  = 4096

if n_overflow ≤ L1_entries:
    step = 0    // no sparse index; data itself fits in a few cache lines
else:
    step = ⌈n_overflow / L1_entries⌉

For the Betula nana reference (359 044 overflows): step = 88, index = 4 080 entries = 31.9 KB.


Lookup

fn get(slot: u64) -> u32:
    if primary[slot] < 255:
        return primary[slot] as u32

    if step == 0:
        return binary_search(data[0..n_overflow], slot)

    // 1. binary search in index (Vec, L1-resident)
    i = upper_bound(index[..].slot, slot) - 1
    pos_start = index[i].pos
    pos_end   = if i+1 < n_index { index[i+1].pos } else { n_overflow }

    // 2. binary search in contiguous block (mmap'd)
    return binary_search(data[pos_start..pos_end], slot)

Cache behaviour: step 1 is entirely within the L1-resident Vec<(u32,u32)>; step 2 loads a contiguous block of ≤ step × 8 bytes from the mmap.


Files

layer_N/
  counts_primary.bin    — [u8; n_slots], raw bytes
  counts_overflow.bin   — PCIV header + sparse index + sorted data
                          (absent if n_overflow == 0)

If counts_overflow.bin is absent, no slot has value ≥ 255; all reads go directly to the primary array.


Complexity

Operation Time Notes
set / get (builder) O(1) HashMap for overflow
get (no overflow) O(1) single byte read
get (overflow, with index) O(log step) ~2 memory regions
get (overflow, no index) O(log n_overflow) data fits in a few cache lines
close O(n_overflow log n_overflow) sort + index build
open O(n_index) index copy into Vec

Generalisation

The sentinel (255) and primary type (u8) are fixed. The overflow value type is u32, sufficient for any realistic k-mer count. For a count matrix (mode 4), one PersistentCompactIntVec per genome column shares the primary array layout.