feat: add PersistentBitVec and upgrade PersistentCompactIntVec format
Introduces PersistentBitVec, a dense, memory-mapped bit vector optimized for bulk u64-word operations and SIMD acceleration, complete with bitwise operators and Jaccard/Hamming distance metrics. Upgrades PersistentCompactIntVec to a unified .pciv format using 64-bit indices and offsets, consolidating the binary layout and updating builder/reader lifecycles accordingly. Adds corresponding documentation, updates MkDocs navigation, and implements a comprehensive test suite for persistence round-trips, edge cases, and metric accuracy.
This commit is contained in:
@@ -0,0 +1,173 @@
|
||||
# PersistentBitVec
|
||||
|
||||
## Purpose
|
||||
|
||||
`PersistentBitVec` stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of `PersistentCompactIntVec` and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.
|
||||
|
||||
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
|
||||
|
||||
---
|
||||
|
||||
## File format
|
||||
|
||||
Single `.pbiv` file.
|
||||
|
||||
```
|
||||
offset 0:
|
||||
magic: [u8; 4] = b"PBIV"
|
||||
_pad: [u8; 4] = 0 alignment padding
|
||||
n: u64 number of bits
|
||||
|
||||
offset 16:
|
||||
data: [u64; ⌈n/64⌉] bit words, LSB-first, zero-padded
|
||||
```
|
||||
|
||||
**Header is 16 bytes**, so data starts at an offset divisible by 8. Since `mmap` returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy `&[u8] → &[u64]` reinterpretation.
|
||||
|
||||
**Bit layout**: bit `i` is in `data[i >> 6]` at bit position `i & 63` (LSB-first). Bits `[n, ⌈n/64⌉×64)` are **always zero** (padding). This invariant is maintained by all write operations and must be restored by `not()` after flipping.
|
||||
|
||||
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Builder (`PersistentBitVecBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVecBuilder {
|
||||
mmap: MmapMut,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
|
||||
|
||||
#### Constructors
|
||||
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
|
||||
|
||||
**`build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).
|
||||
|
||||
**`build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates a new file, iterates `source` with its merge-scan iterator (O(n)), and writes bits directly into u64 words:
|
||||
|
||||
```rust
|
||||
// bit i = 1 iff source[i] >= threshold
|
||||
words[slot >> 6] |= 1u64 << (slot & 63);
|
||||
```
|
||||
|
||||
Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.
|
||||
|
||||
**`build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Shorthand for `build_from_counts(source, 1, path)`.
|
||||
|
||||
#### Bit-level access
|
||||
|
||||
```rust
|
||||
fn get(&self, slot: u64) -> bool
|
||||
fn set(&mut self, slot: u64, value: bool)
|
||||
```
|
||||
|
||||
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
|
||||
|
||||
#### Word-level bulk operations
|
||||
|
||||
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
|
||||
|
||||
```rust
|
||||
builder.and(&other); // self[i] &= other[i] for all i
|
||||
builder.or(&other); // self[i] |= other[i]
|
||||
builder.xor(&other); // self[i] ^= other[i]
|
||||
builder.not(); // self[i] = !self[i], then re-zero padding bits
|
||||
```
|
||||
|
||||
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
|
||||
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentBitVec`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVec {
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
path: PathBuf,
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
|
||||
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
|
||||
|
||||
#### `get(slot: u64) -> bool`
|
||||
|
||||
Byte-level read from `mmap[16 + slot/8]`. O(1).
|
||||
|
||||
#### `iter() -> BitIter<'_>`
|
||||
|
||||
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
|
||||
|
||||
#### Aggregates
|
||||
|
||||
```rust
|
||||
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
|
||||
fn count_zeros(&self) -> u64 // n - count_ones()
|
||||
```
|
||||
|
||||
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
|
||||
|
||||
#### Distance methods
|
||||
|
||||
Both operate word by word. O(n/64).
|
||||
|
||||
| Method | Formula | Notes |
|
||||
|---|---|---|
|
||||
| `jaccard_dist(&other) -> f64` | `1 − |A∩B| / |A∪B|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||||
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
|
||||
|
||||
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
|
||||
|
||||
---
|
||||
|
||||
## Implementation notes
|
||||
|
||||
### u64 word view
|
||||
|
||||
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||||
|
||||
1. `mmap` base is page-aligned (≥ 4096-byte boundary).
|
||||
2. Data offset = 16, and `16 % 8 == 0` → the data pointer is 8-byte aligned.
|
||||
3. Data length = `⌈n/64⌉ × 8` bytes — always a multiple of 8.
|
||||
|
||||
This gives zero-copy word-level access with no intermediate allocation.
|
||||
|
||||
### Padding invariant
|
||||
|
||||
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
| `new` / `open` | O(1) | mmap setup + header parse |
|
||||
| `get` / `set` (builder or reader) | O(1) | byte-level mmap |
|
||||
| `iter()` | O(n) | byte-by-byte scan |
|
||||
| `count_ones` / `count_zeros` | O(n/64) | POPCNT per u64 word |
|
||||
| `and` / `or` / `xor` / `not` | O(n/64) | word-level bitwise ops |
|
||||
| `jaccard_dist` / `hamming_dist` | O(n/64) | word AND/OR/XOR + POPCNT |
|
||||
| `build_from` | O(file_size) | OS copy |
|
||||
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
|
||||
| `close` | O(1) | flush only |
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Purpose
|
||||
|
||||
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random access with minimal memory footprint and optimal cache behaviour.
|
||||
`PersistentCompactIntVec` stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
@@ -12,8 +12,8 @@ Motivation from observed count distributions in genomics data: 99.9% of k-mer co
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, mmap'd as a flat file. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow structure** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
@@ -22,149 +22,161 @@ primary[slot] == 255 → binary search in overflow
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle
|
||||
## Single-file format
|
||||
|
||||
The structure has two distinct runtime roles with different APIs.
|
||||
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
|
||||
|
||||
```
|
||||
offset 0:
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n: u64 number of slots
|
||||
n_overflow: u32 number of overflow entries
|
||||
step: u32 sparse index step (0 = no index)
|
||||
n_index: u32 number of index entries
|
||||
|
||||
offset 24:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
|
||||
offset 24 + n:
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot
|
||||
|
||||
offset 24 + n + n_overflow × 8:
|
||||
index: [(slot: u32, pos: u32); n_index] sparse index
|
||||
```
|
||||
|
||||
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
|
||||
|
||||
---
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during layer construction. Holds the primary array and overflow map in memory; supports arbitrary reads and writes before finalisation.
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
primary: Vec<u8>, // in memory; written to disk at close()
|
||||
overflow: HashMap<u64, u32>, // O(1) get/set for values ≥ 255
|
||||
path: PathBuf,
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
n: usize,
|
||||
overflow: HashMap<u64, u32>, // values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
**Phase 1 — `new(n: usize)`**
|
||||
Allocates `primary` of length `n` initialised to 0. `overflow` is empty.
|
||||
#### `new(n: usize, path: &Path) -> io::Result<Self>`
|
||||
|
||||
**Phase 2 — fill (repeated `set` / `get`)**
|
||||
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
|
||||
|
||||
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
|
||||
|
||||
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
|
||||
|
||||
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
|
||||
|
||||
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
|
||||
|
||||
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
|
||||
|
||||
#### Element-wise operations — `min`, `max`, `add`, `diff`
|
||||
|
||||
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
|
||||
|
||||
```rust
|
||||
fn set(&mut self, slot: u64, value: u32) {
|
||||
if value < 255 {
|
||||
self.primary[slot] = value as u8;
|
||||
self.overflow.remove(&slot); // in case of downward mutation
|
||||
} else {
|
||||
self.primary[slot] = 255; // sentinel
|
||||
self.overflow.insert(slot, value);
|
||||
}
|
||||
}
|
||||
|
||||
fn get(&self, slot: u64) -> u32 {
|
||||
match self.primary[slot] {
|
||||
255 => *self.overflow.get(&slot).unwrap(),
|
||||
v => v as u32,
|
||||
}
|
||||
}
|
||||
builder.min(&other); // self[i] = min(self[i], other[i])
|
||||
builder.max(&other); // self[i] = max(self[i], other[i])
|
||||
builder.add(&other); // self[i] = self[i].checked_add(other[i]) (panics on u32 overflow)
|
||||
builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
|
||||
```
|
||||
|
||||
Reads and mutations are both O(1). Overflow entries can be created, updated, or removed freely during this phase.
|
||||
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
|
||||
|
||||
**Phase 3 — `close(primary_path, overflow_path)`**
|
||||
#### `close(self) -> io::Result<()>`
|
||||
|
||||
1. Write `primary` as raw bytes to `counts_primary.bin`.
|
||||
2. Collect `overflow` into `Vec<(u32, u32)>`, sort by slot.
|
||||
3. Compute `step` from `n_overflow` (see below).
|
||||
4. Build sparse index.
|
||||
5. Write `counts_overflow.bin`.
|
||||
6. Drop all in-memory state.
|
||||
|
||||
The `HashMap` is the only extra allocation: bounded by `n_overflow × (8 + 4 + overhead)` bytes, typically a few MB in practice.
|
||||
1. Flush and drop the mmap (primary changes are now on disk).
|
||||
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
|
||||
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
|
||||
4. Append sorted overflow data, then sparse index.
|
||||
5. Seek to offset 0, overwrite the header with final values.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. Both files are mmap'd; the sparse index is loaded into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVec {
|
||||
primary: Mmap, // mmap of counts_primary.bin
|
||||
index: Vec<(u32, u32)>, // sparse index, loaded into RAM at open
|
||||
data: Mmap, // mmap of overflow data region
|
||||
n_overflow: u32,
|
||||
step: u32,
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
step: u32,
|
||||
index: Vec<(u32, u32)>, // L1-resident
|
||||
primary_offset: usize, // = 24 (HEADER_SIZE)
|
||||
data_offset: usize, // = 24 + n
|
||||
path: PathBuf,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(primary_path, overflow_path)`**
|
||||
Mmaps both files. Parses the overflow file header; copies the sparse index into a `Vec` (tiny, warm in cache). The data region stays mmap'd.
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
|
||||
**`get(slot: u64) -> u32`** — see Lookup section.
|
||||
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
|
||||
---
|
||||
|
||||
## Overflow file format
|
||||
#### `get(slot: u64) -> u32` — random access
|
||||
|
||||
```
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n_overflow: u32
|
||||
step: u32 (0 if n_overflow ≤ L1_entries → no sparse index)
|
||||
[if step > 0]
|
||||
n_index: u32 = ⌈n_overflow / step⌉
|
||||
index: [(slot: u32, pos: u32); n_index] ← loaded into RAM at open
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot, mmap'd
|
||||
primary[slot] < 255 → return it directly
|
||||
|
||||
step == 0:
|
||||
binary_search(data[0..n_overflow], slot)
|
||||
|
||||
step > 0:
|
||||
i = upper_bound(index[..].slot, slot) − 1 // in L1-resident Vec
|
||||
binary_search(data[index[i].pos .. index[i+1].pos], slot)
|
||||
```
|
||||
|
||||
`index[i]` stores the slot value and data-array position of the `i × step`-th overflow entry.
|
||||
#### `iter() -> Iter<'_>` — sequential scan, O(n)
|
||||
|
||||
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
|
||||
|
||||
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
|
||||
|
||||
#### Aggregate
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
|
||||
```
|
||||
|
||||
#### Distance methods
|
||||
|
||||
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
|
||||
|
||||
| Method | Formula |
|
||||
|---|---|
|
||||
| `bray_dist` | `1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)` |
|
||||
| `relfreq_bray_dist` | Bray-Curtis on relative frequencies: `1 − Σmin(pᵢ,qᵢ)` where `pᵢ = aᵢ/Σa` |
|
||||
| `euclidean_dist` | `√Σ(aᵢ − bᵢ)²` |
|
||||
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
|
||||
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
|
||||
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − |A∩B| / |A∪B|` where presence iff count ≥ threshold |
|
||||
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
|
||||
|
||||
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
|
||||
The step is chosen at `close()` time, once `n_overflow` is known:
|
||||
Chosen at `close()` once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_SIZE = 32 * 1024 // 32 KB conservative target
|
||||
INDEX_ENTRY = 8 // bytes: (u32, u32)
|
||||
L1_entries = L1_SIZE / INDEX_ENTRY = 4096
|
||||
L1_entries = 32 768 / 8 = 4096
|
||||
|
||||
if n_overflow ≤ L1_entries:
|
||||
step = 0 // no sparse index; data itself fits in a few cache lines
|
||||
else:
|
||||
step = ⌈n_overflow / L1_entries⌉
|
||||
step = 0 if n_overflow ≤ 4096
|
||||
step = ⌈n_overflow / 4096⌉ otherwise
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Lookup
|
||||
|
||||
```
|
||||
fn get(slot: u64) -> u32:
|
||||
if primary[slot] < 255:
|
||||
return primary[slot] as u32
|
||||
|
||||
if step == 0:
|
||||
return binary_search(data[0..n_overflow], slot)
|
||||
|
||||
// 1. binary search in index (Vec, L1-resident)
|
||||
i = upper_bound(index[..].slot, slot) - 1
|
||||
pos_start = index[i].pos
|
||||
pos_end = if i+1 < n_index { index[i+1].pos } else { n_overflow }
|
||||
|
||||
// 2. binary search in contiguous block (mmap'd)
|
||||
return binary_search(data[pos_start..pos_end], slot)
|
||||
```
|
||||
|
||||
Cache behaviour: step 1 is entirely within the L1-resident `Vec<(u32,u32)>`; step 2 loads a contiguous block of ≤ `step × 8` bytes from the mmap.
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
layer_N/
|
||||
counts_primary.bin — [u8; n_slots], raw bytes
|
||||
counts_overflow.bin — PCIV header + sparse index + sorted data
|
||||
(absent if n_overflow == 0)
|
||||
```
|
||||
|
||||
If `counts_overflow.bin` is absent, no slot has value ≥ 255; all reads go directly to the primary array.
|
||||
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
@@ -172,15 +184,13 @@ If `counts_overflow.bin` is absent, no slot has value ≥ 255; all reads go dire
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
| `set` / `get` (builder) | O(1) | HashMap for overflow |
|
||||
| `get` (no overflow) | O(1) | single byte read |
|
||||
| `get` (overflow, with index) | O(log step) | ~2 memory regions |
|
||||
| `get` (overflow, no index) | O(log n_overflow) | data fits in a few cache lines |
|
||||
| `close` | O(n_overflow log n_overflow) | sort + index build |
|
||||
| `set` / `get` (builder) | O(1) | mmap byte + HashMap |
|
||||
| `get` (reader, no overflow) | O(1) | single mmap byte |
|
||||
| `get` (reader, with index) | O(log step) | ≤ 2 memory regions |
|
||||
| `get` (reader, no index) | O(log n_overflow) | data fits in a few cache lines |
|
||||
| `iter()` full scan | O(n + n_overflow) | merge-scan, no binary search |
|
||||
| `sum`, distances | O(n) | via `iter()` / `zip(iter(), iter())` |
|
||||
| `min` / `max` / `add` / `diff` | O(n) | via `other.iter()` + builder `set` |
|
||||
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
|
||||
---
|
||||
|
||||
## Generalisation
|
||||
|
||||
The sentinel (255) and primary type (u8) are fixed. The overflow value type is u32, sufficient for any realistic k-mer count. For a count matrix (mode 4), one `PersistentCompactIntVec` per genome column shares the primary array layout.
|
||||
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |
|
||||
|
||||
Reference in New Issue
Block a user