feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
This commit is contained in:
@@ -10,26 +10,26 @@
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
|
||||
| Mode | Description | Payload type | File |
|
||||
| Mode | Description | Payload type | Storage |
|
||||
|---|---|---|---|
|
||||
| 1. Set | membership test only | `()` | — |
|
||||
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |
|
||||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
|
||||
Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::<PersistentBitMatrix>::build_presence`); mode 4 is not yet implemented.
|
||||
|
||||
### Payload for mode 2: PersistentCompactIntVec
|
||||
### Payload for modes 2/4: PersistentCompactIntMatrix
|
||||
|
||||
`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.
|
||||
`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
|
||||
|
||||
Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.
|
||||
Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
|
||||
|
||||
### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec
|
||||
### Payload for mode 3: PersistentBitMatrix
|
||||
|
||||
`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.
|
||||
`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
|
||||
|
||||
Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.
|
||||
Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
|
||||
|
||||
---
|
||||
|
||||
@@ -57,14 +57,15 @@ pub struct Hit<T = ()> {
|
||||
}
|
||||
```
|
||||
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
|
||||
Implemented concrete types:
|
||||
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
| `()` | `()` | mode 1 — membership only |
|
||||
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
|
||||
|
||||
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
|
||||
|
||||
@@ -81,8 +82,14 @@ index_root/ ← LayeredMap (collection)
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts.pciv [mode 2 only]
|
||||
presence_N.pbiv [mode 3/4, one per genome — not yet implemented]
|
||||
counts/ [modes 2/4]
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3]
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
col_000001.pbiv
|
||||
...
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
@@ -106,7 +113,8 @@ layer_N/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts.pciv — [mode 2] PersistentCompactIntVec: one u32 per slot
|
||||
counts/ — [modes 2/4] PersistentCompactIntMatrix
|
||||
presence/ — [mode 3] PersistentBitMatrix
|
||||
```
|
||||
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
@@ -165,13 +173,24 @@ impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntVec> {
|
||||
// modes 2/4
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 3
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn build_presence(
|
||||
out_dir: &Path,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||
) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
|
||||
|
||||
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
|
||||
|
||||
---
|
||||
@@ -196,6 +215,8 @@ fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
|
||||
|
||||
For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
|
||||
|
||||
---
|
||||
|
||||
## Add-layer algorithm
|
||||
@@ -221,15 +242,13 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
|
||||
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
|
||||
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# PersistentBitVec
|
||||
# PersistentBitVec and PersistentBitMatrix
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -6,9 +6,13 @@
|
||||
|
||||
Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.
|
||||
|
||||
`PersistentBitMatrix` wraps multiple `PersistentBitVec` columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.
|
||||
|
||||
---
|
||||
|
||||
## File format
|
||||
## PersistentBitVec — single-column file
|
||||
|
||||
### File format
|
||||
|
||||
Single `.pbiv` file.
|
||||
|
||||
@@ -28,11 +32,9 @@ offset 16:
|
||||
|
||||
**Total file size**: `16 + ⌈n/64⌉ × 8` bytes.
|
||||
|
||||
---
|
||||
### Lifecycle
|
||||
|
||||
## Lifecycle
|
||||
|
||||
### Builder (`PersistentBitVecBuilder`)
|
||||
#### Builder (`PersistentBitVecBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVecBuilder {
|
||||
@@ -43,8 +45,6 @@ struct PersistentBitVecBuilder {
|
||||
|
||||
The file and mmap are created immediately at construction. The header is written once at `new()` or copied from the source at `build_from*()`. `close()` is a single flush — there is no tail to append, unlike `PersistentCompactIntVec`.
|
||||
|
||||
#### Constructors
|
||||
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, writes the header, zero-extends to `16 + ⌈n/64⌉×8` bytes, mmaps immediately. All bits default to 0.
|
||||
@@ -68,16 +68,16 @@ Handles overflow values (≥ 255) transparently — the count iterator returns t
|
||||
|
||||
Shorthand for `build_from_counts(source, 1, path)`.
|
||||
|
||||
#### Bit-level access
|
||||
**Bit-level access**
|
||||
|
||||
```rust
|
||||
fn get(&self, slot: u64) -> bool
|
||||
fn set(&mut self, slot: u64, value: bool)
|
||||
fn get(&self, slot: usize) -> bool
|
||||
fn set(&mut self, slot: usize, value: bool)
|
||||
```
|
||||
|
||||
Byte-level mmap access: `mmap[16 + slot/8]`, bit `slot % 8`. O(1).
|
||||
|
||||
#### Word-level bulk operations
|
||||
**Word-level bulk operations**
|
||||
|
||||
All operate on `⌈n/64⌉` u64 words. O(n/64) per call.
|
||||
|
||||
@@ -90,13 +90,11 @@ builder.not(); // self[i] = !self[i], then re-zero padding bits
|
||||
|
||||
`and`/`or`/`xor` read `other`'s word slice directly (no allocation). `not()` flips all words then masks the last word's padding bits to restore the invariant.
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentBitVec`)
|
||||
#### Reader (`PersistentBitVec`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitVec {
|
||||
@@ -106,19 +104,19 @@ struct PersistentBitVec {
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
**`open(path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Mmaps the file, validates magic, reads `n` from bytes `[8..16]`. O(1).
|
||||
|
||||
#### `get(slot: u64) -> bool`
|
||||
**`get(slot: usize) -> bool`**
|
||||
|
||||
Byte-level read from `mmap[16 + slot/8]`. O(1).
|
||||
|
||||
#### `iter() -> BitIter<'_>`
|
||||
**`iter() -> BitIter<'_>`**
|
||||
|
||||
Sequential scan, byte by byte, yielding `bool` values in slot order. Implements `ExactSizeIterator`. O(n).
|
||||
|
||||
#### Aggregates
|
||||
**Aggregates**
|
||||
|
||||
```rust
|
||||
fn count_ones(&self) -> u64 // popcount over all words; padding bits are 0
|
||||
@@ -127,22 +125,20 @@ fn count_zeros(&self) -> u64 // n - count_ones()
|
||||
|
||||
`count_ones` iterates `⌈n/64⌉` words and calls `u64::count_ones()` (maps to `POPCNT`). O(n/64).
|
||||
|
||||
#### Distance methods
|
||||
**Distance methods**
|
||||
|
||||
Both operate word by word. O(n/64).
|
||||
|
||||
| Method | Formula | Notes |
|
||||
|---|---|---|
|
||||
| `jaccard_dist(&other) -> f64` | `1 − |A∩B| / |A∪B|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||||
| `jaccard_dist(&other) -> f64` | `1 − \|A∩B\| / \|A∪B\|` | `(a&b).count_ones()`, `(a\|b).count_ones()` per word |
|
||||
| `hamming_dist(&other) -> u64` | number of differing bits | `(a^b).count_ones()` per word |
|
||||
|
||||
Edge case (both all-zero → union = 0): `jaccard_dist` returns 0.0.
|
||||
|
||||
---
|
||||
### Implementation notes
|
||||
|
||||
## Implementation notes
|
||||
|
||||
### u64 word view
|
||||
#### u64 word view
|
||||
|
||||
The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||||
|
||||
@@ -152,13 +148,11 @@ The unsafe cast from `&[u8]` to `&[u64]` is sound because:
|
||||
|
||||
This gives zero-copy word-level access with no intermediate allocation.
|
||||
|
||||
### Padding invariant
|
||||
#### Padding invariant
|
||||
|
||||
Writing `not()` without masking the last word would corrupt `count_ones()`, `hamming_dist()`, and `jaccard_dist()`. The mask applied after flipping is `(1u64 << (n % 64)) - 1` (no-op if `n % 64 == 0`). All other operations (`and`, `or`, `xor`) preserve existing zero padding since they can only clear or preserve bits already set by `not()`.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
### Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
@@ -171,3 +165,74 @@ Writing `not()` without masking the last word would corrupt `count_ones()`, `ham
|
||||
| `build_from` | O(file_size) | OS copy |
|
||||
| `build_from_counts` / `build_from_presence` | O(n) | count iter + word fill |
|
||||
| `close` | O(1) | flush only |
|
||||
|
||||
---
|
||||
|
||||
## PersistentBitMatrix — column-major directory
|
||||
|
||||
### Design
|
||||
|
||||
A directory containing `meta.json` and N column files `col_000000.pbiv`, `col_000001.pbiv`, …, each a `PersistentBitVec`. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.
|
||||
|
||||
```
|
||||
presence/
|
||||
meta.json {"n": <n_slots>, "n_cols": <G>}
|
||||
col_000000.pbiv genome 0
|
||||
col_000001.pbiv genome 1
|
||||
...
|
||||
```
|
||||
|
||||
Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.
|
||||
|
||||
### Builder (`PersistentBitMatrixBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the directory (including parents).
|
||||
|
||||
**`add_col(&mut self) -> io::Result<PersistentBitVecBuilder>`**
|
||||
|
||||
Creates `col_NNNNNN.pbiv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
|
||||
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Writes `meta.json` with the final `n` and `n_cols`.
|
||||
|
||||
### Reader (`PersistentBitMatrix`)
|
||||
|
||||
```rust
|
||||
struct PersistentBitMatrix {
|
||||
cols: Vec<PersistentBitVec>,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Reads `meta.json`, opens all `col_NNNNNN.pbiv` files.
|
||||
|
||||
**`row(slot: usize) -> Box<[bool]>`**
|
||||
|
||||
Returns the presence vector: `[col_0[slot], col_1[slot], …, col_{G-1}[slot]]`. One byte read per column. O(G).
|
||||
|
||||
**`col(c: usize) -> &PersistentBitVec`**
|
||||
|
||||
Direct access to a single column for column-oriented operations.
|
||||
|
||||
### LayerData implementation
|
||||
|
||||
```rust
|
||||
impl LayerData for PersistentBitMatrix {
|
||||
type Item = Box<[bool]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
|
||||
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
|
||||
}
|
||||
```
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# PersistentCompactIntVec
|
||||
# PersistentCompactIntVec and PersistentCompactIntMatrix
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -6,78 +6,81 @@
|
||||
|
||||
Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).
|
||||
|
||||
`PersistentCompactIntMatrix` wraps multiple `PersistentCompactIntVec` columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
## PersistentCompactIntVec — single-column file
|
||||
|
||||
### Design
|
||||
|
||||
Two-tier structure:
|
||||
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 24 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u32, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
1. **Primary array** — `[u8; n]`, stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value **255 is a sentinel** meaning "look in overflow".
|
||||
2. **Overflow section** — sorted list of `(slot: u64, value: u32)` pairs for all slots where the true value ≥ 255, with a **sparse L1-fitting index** for fast lookup.
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return primary[slot]
|
||||
primary[slot] == 255 → binary search in overflow
|
||||
```
|
||||
|
||||
---
|
||||
### File format
|
||||
|
||||
## Single-file format
|
||||
|
||||
Everything is stored in a single `.pciv` file. Write order matches computation order: the header placeholder is written first, then primary (known at `new()`), then overflow data and index (known at `close()`), then the header is overwritten at offset 0.
|
||||
Single `.pciv` file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.
|
||||
|
||||
```
|
||||
offset 0:
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
n: u64 number of slots
|
||||
n_overflow: u32 number of overflow entries
|
||||
step: u32 sparse index step (0 = no index)
|
||||
n_index: u32 number of index entries
|
||||
magic: [u8; 4] = b"PCIV"
|
||||
_pad: [u8; 4] = 0
|
||||
n: u64 number of slots
|
||||
n_overflow: u64 number of overflow entries
|
||||
n_index: u64 number of sparse index entries
|
||||
step: u64 sparse index step (0 = no index)
|
||||
|
||||
offset 24:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
offset 40:
|
||||
primary: [u8; n] one byte per slot, 255 = overflow sentinel
|
||||
|
||||
offset 24 + n:
|
||||
data: [(slot: u32, value: u32); n_overflow] sorted by slot
|
||||
offset 40 + n:
|
||||
data: [(slot: u64, value: u32); n_overflow] 12 bytes each, sorted by slot
|
||||
|
||||
offset 24 + n + n_overflow × 8:
|
||||
index: [(slot: u32, pos: u32); n_index] sparse index
|
||||
offset 40 + n + n_overflow × 12:
|
||||
index: [(slot: u64, pos: u64); n_index] 16 bytes each, sparse index
|
||||
```
|
||||
|
||||
The index entries point into `data`: `index[i] = (slot of data[i×step], i×step)`.
|
||||
|
||||
---
|
||||
All integer fields are little-endian. Slot indices are stored as `u64` in the file; they are `usize` in Rust code.
|
||||
|
||||
## Lifecycle
|
||||
### Lifecycle
|
||||
|
||||
### Builder (`PersistentCompactIntVecBuilder`)
|
||||
#### Builder (`PersistentCompactIntVecBuilder`)
|
||||
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<u64, u32>` in RAM.
|
||||
Used during construction. The primary section is **mmap'd immediately** at construction time (both for `new` and `build_from`), so the file exists and is addressable from the start. The overflow is held in a `HashMap<usize, u32>` in RAM.
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntVecBuilder {
|
||||
path: PathBuf,
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
mmap: MmapMut, // primary section live in the file from the start
|
||||
n: usize,
|
||||
overflow: HashMap<u64, u32>, // values ≥ 255
|
||||
overflow: HashMap<usize, u32>, // values ≥ 255
|
||||
}
|
||||
```
|
||||
|
||||
#### `new(n: usize, path: &Path) -> io::Result<Self>`
|
||||
**`new(n: usize, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the file, pre-allocates `HEADER_SIZE + n` zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for `set` / `get`.
|
||||
|
||||
#### `build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`
|
||||
**`build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Copies the source PCIV file to `path` (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a `HashMap`. Initialisation cost: O(file copy) + O(n_overflow), not O(n).
|
||||
|
||||
At `close()`, the primary section is **not rewritten**: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.
|
||||
|
||||
#### `set(slot: u64, value: u32)` / `get(slot: u64) -> u32`
|
||||
**`set(slot: usize, value: u32)` / `get(slot: usize) -> u32`**
|
||||
|
||||
Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).
|
||||
|
||||
#### Element-wise operations — `min`, `max`, `add`, `diff`
|
||||
**Element-wise operations — `min`, `max`, `add`, `diff`**
|
||||
|
||||
Each takes a `&PersistentCompactIntVec` of equal length and updates `self` in place via `set`:
|
||||
|
||||
@@ -90,17 +93,15 @@ builder.diff(&other); // self[i] = self[i].saturating_sub(other[i])
|
||||
|
||||
All iterate `other` with `other.iter()` (merge-scan, O(n_other)).
|
||||
|
||||
#### `close(self) -> io::Result<()>`
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
1. Flush and drop the mmap (primary changes are now on disk).
|
||||
2. Sort the overflow HashMap into `Vec<(u32, u32)>`.
|
||||
2. Sort the overflow HashMap into `Vec<(usize, u32)>`.
|
||||
3. Truncate the file to `HEADER_SIZE + n` (removes old data+index if `build_from` was used).
|
||||
4. Append sorted overflow data, then sparse index.
|
||||
5. Seek to offset 0, overwrite the header with final values.
|
||||
|
||||
---
|
||||
|
||||
### Reader (`PersistentCompactIntVec`)
|
||||
#### Reader (`PersistentCompactIntVec`)
|
||||
|
||||
Used at query time. The whole file is mmap'd; only the sparse index is copied into a `Vec` at open time (≤ 32 KB, L1-resident).
|
||||
|
||||
@@ -109,19 +110,19 @@ struct PersistentCompactIntVec {
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
step: u32,
|
||||
index: Vec<(u32, u32)>, // L1-resident
|
||||
primary_offset: usize, // = 24 (HEADER_SIZE)
|
||||
data_offset: usize, // = 24 + n
|
||||
step: usize,
|
||||
index: Vec<(usize, usize)>, // (slot, pos) — L1-resident
|
||||
primary_offset: usize, // = 40 (HEADER_SIZE)
|
||||
data_offset: usize, // = 40 + n
|
||||
path: PathBuf,
|
||||
}
|
||||
```
|
||||
|
||||
#### `open(path: &Path) -> io::Result<Self>`
|
||||
**`open(path: &Path) -> io::Result<Self>`**
|
||||
|
||||
Mmaps the file, parses the 24-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
Mmaps the file, parses the 40-byte header, copies the sparse index entries into a `Vec`. The primary and data sections stay mmap'd.
|
||||
|
||||
#### `get(slot: u64) -> u32` — random access
|
||||
**`get(slot: usize) -> u32` — random access**
|
||||
|
||||
```
|
||||
primary[slot] < 255 → return it directly
|
||||
@@ -134,19 +135,19 @@ step > 0:
|
||||
binary_search(data[index[i].pos .. index[i+1].pos], slot)
|
||||
```
|
||||
|
||||
#### `iter() -> Iter<'_>` — sequential scan, O(n)
|
||||
**`iter() -> Iter<'_>` — sequential scan, O(n)**
|
||||
|
||||
Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.
|
||||
|
||||
`Iter` implements `ExactSizeIterator`. `&PersistentCompactIntVec` implements `IntoIterator`.
|
||||
|
||||
#### Aggregate
|
||||
**Aggregate**
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> u64 // Σ self[i] as u64, via iter()
|
||||
```
|
||||
|
||||
#### Distance methods
|
||||
**Distance methods**
|
||||
|
||||
All take `&other` of equal length, iterate both with `zip(self.iter(), other.iter())`, and return `f64`.
|
||||
|
||||
@@ -158,29 +159,23 @@ All take `&other` of equal length, iterate both with `zip(self.iter(), other.ite
|
||||
| `relfreq_euclidean_dist` | Euclidean on relative frequencies |
|
||||
| `hellinger_euclidean_dist` | `√Σ(√pᵢ − √qᵢ)²` — Euclidean on sqrt(relfreq) |
|
||||
| `hellinger_dist` | `hellinger_euclidean_dist / √2` — standard Hellinger distance ∈ [0, 1] |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − |A∩B| / |A∪B|` where presence iff count ≥ threshold |
|
||||
| `threshold_jaccard_dist(&other, threshold: u32)` | `1 − \|A∩B\| / \|A∪B\|` where presence iff count ≥ threshold |
|
||||
| `jaccard_dist` | `threshold_jaccard_dist(&other, 1)` |
|
||||
|
||||
Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.
|
||||
|
||||
---
|
||||
|
||||
## Step computation
|
||||
### Step computation
|
||||
|
||||
Chosen at `close()` once `n_overflow` is known:
|
||||
|
||||
```
|
||||
L1_entries = 32 768 / 8 = 4096
|
||||
L1_INDEX_ENTRIES = 2048
|
||||
|
||||
step = 0 if n_overflow ≤ 4096
|
||||
step = ⌈n_overflow / 4096⌉ otherwise
|
||||
step = 0 if n_overflow ≤ 2048
|
||||
step = ⌈n_overflow / 2048⌉ otherwise
|
||||
```
|
||||
|
||||
For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 entries = 31.9 KB.
|
||||
|
||||
---
|
||||
|
||||
## Complexity
|
||||
### Complexity
|
||||
|
||||
| Operation | Time | Notes |
|
||||
|---|---|---|
|
||||
@@ -194,3 +189,72 @@ For the Betula nana reference (359 044 overflows): step = 88, n_index = 4 080 en
|
||||
| `close` | O(n_overflow log n_overflow) | sort + sequential write |
|
||||
| `open` | O(n_index) | index copy into Vec |
|
||||
| `build_from` | O(file_size) + O(n_overflow) | OS copy + HashMap load |
|
||||
|
||||
---
|
||||
|
||||
## PersistentCompactIntMatrix — column-major directory
|
||||
|
||||
### Design
|
||||
|
||||
A directory containing `meta.json` and N column files `col_000000.pciv`, `col_000001.pciv`, …, each a `PersistentCompactIntVec`. This is the type used by `LayerData` — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.
|
||||
|
||||
```
|
||||
counts/
|
||||
meta.json {"n": <n_slots>, "n_cols": <N>}
|
||||
col_000000.pciv
|
||||
col_000001.pciv
|
||||
...
|
||||
```
|
||||
|
||||
### Builder (`PersistentCompactIntMatrixBuilder`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`new(n: usize, dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Creates the directory (including parents). Does not write `meta.json` yet.
|
||||
|
||||
**`add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>`**
|
||||
|
||||
Creates `col_NNNNNN.pciv` for the next column and returns its builder. The caller fills the column and calls `builder.close()` before calling `add_col` again.
|
||||
|
||||
**`close(self) -> io::Result<()>`**
|
||||
|
||||
Writes `meta.json` with the final `n` and `n_cols`. Must be called after all column builders are closed.
|
||||
|
||||
### Reader (`PersistentCompactIntMatrix`)
|
||||
|
||||
```rust
|
||||
struct PersistentCompactIntMatrix {
|
||||
cols: Vec<PersistentCompactIntVec>,
|
||||
n: usize,
|
||||
}
|
||||
```
|
||||
|
||||
**`open(dir: &Path) -> io::Result<Self>`**
|
||||
|
||||
Reads `meta.json`, opens all `col_NNNNNN.pciv` files.
|
||||
|
||||
**`row(slot: usize) -> Box<[u32]>`**
|
||||
|
||||
Returns the full row: `[col_0[slot], col_1[slot], …, col_{N-1}[slot]]`. One mmap access per column. O(N).
|
||||
|
||||
**`col(c: usize) -> &PersistentCompactIntVec`**
|
||||
|
||||
Direct access to a single column for column-oriented operations (distance computations, iteration).
|
||||
|
||||
### LayerData implementation
|
||||
|
||||
```rust
|
||||
impl LayerData for PersistentCompactIntMatrix {
|
||||
type Item = Box<[u32]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
|
||||
fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) }
|
||||
}
|
||||
```
|
||||
|
||||
@@ -0,0 +1,57 @@
|
||||
use std::{fs, io, path::{Path, PathBuf}};
|
||||
|
||||
use crate::bitvec::{PersistentBitVec, PersistentBitVecBuilder};
|
||||
use crate::meta::MatrixMeta;
|
||||
|
||||
fn col_path(dir: &Path, col: usize) -> PathBuf {
|
||||
dir.join(format!("col_{col:06}.pbiv"))
|
||||
}
|
||||
|
||||
pub struct PersistentBitMatrix {
|
||||
cols: Vec<PersistentBitVec>,
|
||||
n: usize,
|
||||
}
|
||||
|
||||
impl PersistentBitMatrix {
|
||||
pub fn open(dir: &Path) -> io::Result<Self> {
|
||||
let meta = MatrixMeta::load(dir)?;
|
||||
let cols = (0..meta.n_cols)
|
||||
.map(|c| PersistentBitVec::open(&col_path(dir, c)))
|
||||
.collect::<io::Result<Vec<_>>>()?;
|
||||
Ok(Self { cols, n: meta.n })
|
||||
}
|
||||
|
||||
pub fn n(&self) -> usize { self.n }
|
||||
pub fn n_cols(&self) -> usize { self.cols.len() }
|
||||
pub fn col(&self, c: usize) -> &PersistentBitVec { &self.cols[c] }
|
||||
|
||||
pub fn row(&self, slot: usize) -> Box<[bool]> {
|
||||
self.cols.iter().map(|c| c.get(slot)).collect()
|
||||
}
|
||||
}
|
||||
|
||||
pub struct PersistentBitMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
|
||||
impl PersistentBitMatrixBuilder {
|
||||
pub fn new(n: usize, dir: &Path) -> io::Result<Self> {
|
||||
fs::create_dir_all(dir)?;
|
||||
Ok(Self { dir: dir.to_path_buf(), n, n_cols: 0 })
|
||||
}
|
||||
|
||||
pub fn n(&self) -> usize { self.n }
|
||||
pub fn n_cols(&self) -> usize { self.n_cols }
|
||||
|
||||
pub fn add_col(&mut self) -> io::Result<PersistentBitVecBuilder> {
|
||||
let path = col_path(&self.dir, self.n_cols);
|
||||
self.n_cols += 1;
|
||||
PersistentBitVecBuilder::new(self.n, &path)
|
||||
}
|
||||
|
||||
pub fn close(self) -> io::Result<()> {
|
||||
MatrixMeta { n: self.n, n_cols: self.n_cols }.save(&self.dir)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
use std::{fs, io, path::{Path, PathBuf}};
|
||||
|
||||
use crate::builder::PersistentCompactIntVecBuilder;
|
||||
use crate::meta::MatrixMeta;
|
||||
use crate::reader::PersistentCompactIntVec;
|
||||
|
||||
fn col_path(dir: &Path, col: usize) -> PathBuf {
|
||||
dir.join(format!("col_{col:06}.pciv"))
|
||||
}
|
||||
|
||||
pub struct PersistentCompactIntMatrix {
|
||||
cols: Vec<PersistentCompactIntVec>,
|
||||
n: usize,
|
||||
}
|
||||
|
||||
impl PersistentCompactIntMatrix {
|
||||
pub fn open(dir: &Path) -> io::Result<Self> {
|
||||
let meta = MatrixMeta::load(dir)?;
|
||||
let cols = (0..meta.n_cols)
|
||||
.map(|c| PersistentCompactIntVec::open(&col_path(dir, c)))
|
||||
.collect::<io::Result<Vec<_>>>()?;
|
||||
Ok(Self { cols, n: meta.n })
|
||||
}
|
||||
|
||||
pub fn n(&self) -> usize { self.n }
|
||||
pub fn n_cols(&self) -> usize { self.cols.len() }
|
||||
pub fn col(&self, c: usize) -> &PersistentCompactIntVec { &self.cols[c] }
|
||||
|
||||
pub fn row(&self, slot: usize) -> Box<[u32]> {
|
||||
self.cols.iter().map(|c| c.get(slot)).collect()
|
||||
}
|
||||
}
|
||||
|
||||
pub struct PersistentCompactIntMatrixBuilder {
|
||||
dir: PathBuf,
|
||||
n: usize,
|
||||
n_cols: usize,
|
||||
}
|
||||
|
||||
impl PersistentCompactIntMatrixBuilder {
|
||||
pub fn new(n: usize, dir: &Path) -> io::Result<Self> {
|
||||
fs::create_dir_all(dir)?;
|
||||
Ok(Self { dir: dir.to_path_buf(), n, n_cols: 0 })
|
||||
}
|
||||
|
||||
pub fn n(&self) -> usize { self.n }
|
||||
pub fn n_cols(&self) -> usize { self.n_cols }
|
||||
|
||||
pub fn add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder> {
|
||||
let path = col_path(&self.dir, self.n_cols);
|
||||
self.n_cols += 1;
|
||||
PersistentCompactIntVecBuilder::new(self.n, &path)
|
||||
}
|
||||
|
||||
pub fn close(self) -> io::Result<()> {
|
||||
MatrixMeta { n: self.n, n_cols: self.n_cols }.save(&self.dir)
|
||||
}
|
||||
}
|
||||
@@ -1,10 +1,15 @@
|
||||
mod bitvec;
|
||||
mod bitmatrix;
|
||||
mod builder;
|
||||
mod format;
|
||||
mod intmatrix;
|
||||
mod meta;
|
||||
mod reader;
|
||||
|
||||
pub use bitvec::{BitIter, PersistentBitVec, PersistentBitVecBuilder};
|
||||
pub use bitmatrix::{PersistentBitMatrix, PersistentBitMatrixBuilder};
|
||||
pub use builder::PersistentCompactIntVecBuilder;
|
||||
pub use intmatrix::{PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder};
|
||||
pub use reader::PersistentCompactIntVec;
|
||||
|
||||
#[cfg(test)]
|
||||
|
||||
@@ -0,0 +1,32 @@
|
||||
use std::{fs, io, path::Path};
|
||||
|
||||
pub struct MatrixMeta {
|
||||
pub n: usize,
|
||||
pub n_cols: usize,
|
||||
}
|
||||
|
||||
impl MatrixMeta {
|
||||
pub fn load(dir: &Path) -> io::Result<Self> {
|
||||
let s = fs::read_to_string(dir.join("meta.json"))?;
|
||||
parse(&s).ok_or_else(|| io::Error::new(io::ErrorKind::InvalidData, "bad meta.json"))
|
||||
}
|
||||
|
||||
pub fn save(&self, dir: &Path) -> io::Result<()> {
|
||||
fs::write(
|
||||
dir.join("meta.json"),
|
||||
format!("{{\"n\":{},\"n_cols\":{}}}\n", self.n, self.n_cols),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
fn parse(s: &str) -> Option<MatrixMeta> {
|
||||
Some(MatrixMeta { n: field(s, "n")?, n_cols: field(s, "n_cols")? })
|
||||
}
|
||||
|
||||
fn field(s: &str, name: &str) -> Option<usize> {
|
||||
let key = format!("\"{}\":", name);
|
||||
let pos = s.find(&key)? + key.len();
|
||||
let rest = s[pos..].trim_start();
|
||||
let end = rest.find(|c: char| !c.is_ascii_digit()).unwrap_or(rest.len());
|
||||
rest[..end].parse().ok()
|
||||
}
|
||||
+201
-57
@@ -7,41 +7,45 @@ use memmap2::Mmap;
|
||||
use crate::format::{HEADER_SIZE, INDEX_ENTRY_SIZE, MAGIC, OVERFLOW_ENTRY_SIZE};
|
||||
|
||||
pub struct PersistentCompactIntVec {
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
pub step: usize,
|
||||
index: Vec<(usize, usize)>, // (slot, pos) — L1-resident sparse index
|
||||
primary_offset: usize, // = HEADER_SIZE
|
||||
data_offset: usize, // = HEADER_SIZE + n
|
||||
path: PathBuf,
|
||||
mmap: Mmap,
|
||||
n: usize,
|
||||
n_overflow: usize,
|
||||
pub step: usize,
|
||||
index: Vec<(usize, usize)>, // (slot, pos) — L1-resident sparse index
|
||||
primary_offset: usize, // = HEADER_SIZE
|
||||
data_offset: usize, // = HEADER_SIZE + n
|
||||
path: PathBuf,
|
||||
}
|
||||
|
||||
impl PersistentCompactIntVec {
|
||||
/// Opens a persistent compact int vector from the given path.
|
||||
pub fn open(path: &Path) -> io::Result<Self> {
|
||||
let mmap = unsafe { Mmap::map(&File::open(path)?)? };
|
||||
|
||||
if mmap.len() < HEADER_SIZE {
|
||||
return Err(io::Error::new(io::ErrorKind::InvalidData, "PCIV file too short"));
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
"PCIV file too short",
|
||||
));
|
||||
}
|
||||
if &mmap[0..4] != &MAGIC {
|
||||
return Err(io::Error::new(io::ErrorKind::InvalidData, "bad PCIV magic"));
|
||||
}
|
||||
|
||||
let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize;
|
||||
let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize;
|
||||
let n_overflow = u64::from_le_bytes(mmap[16..24].try_into().unwrap()) as usize;
|
||||
let n_index = u64::from_le_bytes(mmap[24..32].try_into().unwrap()) as usize;
|
||||
let step = u64::from_le_bytes(mmap[32..40].try_into().unwrap()) as usize;
|
||||
let n_index = u64::from_le_bytes(mmap[24..32].try_into().unwrap()) as usize;
|
||||
let step = u64::from_le_bytes(mmap[32..40].try_into().unwrap()) as usize;
|
||||
|
||||
let primary_offset = HEADER_SIZE;
|
||||
let data_offset = primary_offset + n;
|
||||
let index_offset = data_offset + n_overflow * OVERFLOW_ENTRY_SIZE;
|
||||
let data_offset = primary_offset + n;
|
||||
let index_offset = data_offset + n_overflow * OVERFLOW_ENTRY_SIZE;
|
||||
|
||||
let mut index = Vec::with_capacity(n_index);
|
||||
for i in 0..n_index {
|
||||
let off = index_offset + i * INDEX_ENTRY_SIZE;
|
||||
let off = index_offset + i * INDEX_ENTRY_SIZE;
|
||||
let slot = u64::from_le_bytes(mmap[off..off + 8].try_into().unwrap()) as usize;
|
||||
let pos = u64::from_le_bytes(mmap[off + 8..off + 16].try_into().unwrap()) as usize;
|
||||
let pos = u64::from_le_bytes(mmap[off + 8..off + 16].try_into().unwrap()) as usize;
|
||||
index.push((slot, pos));
|
||||
}
|
||||
|
||||
@@ -57,36 +61,44 @@ impl PersistentCompactIntVec {
|
||||
})
|
||||
}
|
||||
|
||||
/// Returns the path of the compact int vector file.
|
||||
pub fn path(&self) -> &Path {
|
||||
&self.path
|
||||
}
|
||||
|
||||
/// Returns the length of the compact int vector.
|
||||
pub fn len(&self) -> usize {
|
||||
self.n
|
||||
}
|
||||
|
||||
/// Returns whether the compact int vector is empty.
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.n == 0
|
||||
}
|
||||
|
||||
/// Returns the value at the given slot.
|
||||
pub fn get(&self, slot: usize) -> u32 {
|
||||
match self.mmap[self.primary_offset + slot] {
|
||||
255 => self.overflow_get(slot),
|
||||
v => v as u32,
|
||||
v => v as u32,
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the value at the given slot from the overflow region.
|
||||
fn overflow_get(&self, slot: usize) -> u32 {
|
||||
let pos_start;
|
||||
let pos_end;
|
||||
|
||||
if self.step == 0 {
|
||||
pos_start = 0;
|
||||
pos_end = self.n_overflow;
|
||||
pos_end = self.n_overflow;
|
||||
} else {
|
||||
let i = self.index.partition_point(|&(s, _)| s <= slot).saturating_sub(1);
|
||||
let i = self
|
||||
.index
|
||||
.partition_point(|&(s, _)| s <= slot)
|
||||
.saturating_sub(1);
|
||||
pos_start = self.index[i].1;
|
||||
pos_end = if i + 1 < self.index.len() {
|
||||
pos_end = if i + 1 < self.index.len() {
|
||||
self.index[i + 1].1
|
||||
} else {
|
||||
self.n_overflow
|
||||
@@ -98,8 +110,8 @@ impl PersistentCompactIntVec {
|
||||
while lo < hi {
|
||||
let mid = lo + (hi - lo) / 2;
|
||||
match self.data_slot(mid).cmp(&slot) {
|
||||
std::cmp::Ordering::Equal => return self.data_value(mid),
|
||||
std::cmp::Ordering::Less => lo = mid + 1,
|
||||
std::cmp::Ordering::Equal => return self.data_value(mid),
|
||||
std::cmp::Ordering::Less => lo = mid + 1,
|
||||
std::cmp::Ordering::Greater => hi = mid,
|
||||
}
|
||||
}
|
||||
@@ -107,85 +119,203 @@ impl PersistentCompactIntVec {
|
||||
}
|
||||
|
||||
#[inline]
|
||||
/// Returns the slot at the given index in the overflow region.
|
||||
fn data_slot(&self, i: usize) -> usize {
|
||||
let off = self.data_offset + i * OVERFLOW_ENTRY_SIZE;
|
||||
u64::from_le_bytes(self.mmap[off..off + 8].try_into().unwrap()) as usize
|
||||
}
|
||||
|
||||
#[inline]
|
||||
/// Returns the value at the given index in the overflow region.
|
||||
fn data_value(&self, i: usize) -> u32 {
|
||||
let off = self.data_offset + i * OVERFLOW_ENTRY_SIZE + 8;
|
||||
u32::from_le_bytes(self.mmap[off..off + 4].try_into().unwrap())
|
||||
}
|
||||
|
||||
#[inline]
|
||||
/// Returns the sum of all values in the compact int vector.
|
||||
pub fn sum(&self) -> u64 {
|
||||
self.iter().map(|v| v as u64).sum()
|
||||
}
|
||||
|
||||
#[inline]
|
||||
/// Returns the Bray-Curtis distance between two compact int vectors.
|
||||
pub fn bray_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let (sum_min, sum_a, sum_b) = self.iter().zip(other.iter()).fold(
|
||||
(0u64, 0u64, 0u64),
|
||||
|(sm, sa, sb), (a, b)| (sm + a.min(b) as u64, sa + a as u64, sb + b as u64),
|
||||
);
|
||||
let denom = sum_a + sum_b;
|
||||
if denom == 0 { return 0.0; }
|
||||
let (sum_min, denom) = self.partial_bray_dist(other);
|
||||
if denom == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
1.0 - 2.0 * sum_min as f64 / denom as f64
|
||||
}
|
||||
|
||||
/// Returns the partial Bray-Curtis distance between two compact int vectors.
|
||||
///
|
||||
/// Returns a tuple `(sum_min, denom)` where `sum_min` is the sum of the minimum values
|
||||
/// at each index, and `denom` is the sum of the values in both vectors.
|
||||
/// This is used internally by [`bray_dist`] and to easily compute the Bray-Curtis distance
|
||||
/// over a set of vector pairs.
|
||||
///
|
||||
/// Returns the tuple `(sum_min, sum_a + sum_b)` where `sum_min` is the sum of the minimum
|
||||
/// values at each index, `sum_a` is the sum of the first vector's counts, and `sum_b` is
|
||||
/// the sum of the second vector's counts.
|
||||
pub fn partial_bray_dist(&self, other: &PersistentCompactIntVec) -> (u64, u64) {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let (sum_min, sum_a, sum_b) = self
|
||||
.iter()
|
||||
.zip(other.iter())
|
||||
.fold((0u64, 0u64, 0u64), |(sm, sa, sb), (a, b)| {
|
||||
(sm + a.min(b) as u64, sa + a as u64, sb + b as u64)
|
||||
});
|
||||
(sum_min, sum_a + sum_b)
|
||||
}
|
||||
|
||||
/// Returns the relative frequency Bray-Curtis distance between two compact int vectors.
|
||||
///
|
||||
/// This is a variant of [`bray_dist`] that uses relative frequencies instead of raw counts.
|
||||
pub fn relfreq_bray_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let sum_a = self.sum() as f64;
|
||||
let sum_b = other.sum() as f64;
|
||||
if sum_a == 0.0 && sum_b == 0.0 { return 0.0; }
|
||||
let sum_min: f64 = self.iter().zip(other.iter())
|
||||
if sum_a == 0.0 && sum_b == 0.0 {
|
||||
return 0.0;
|
||||
}
|
||||
let sum_min = self.partial_relfreq_bray_dist(other, sum_a, sum_b);
|
||||
1.0 - sum_min
|
||||
}
|
||||
|
||||
/// Returns the partial relative frequency Bray-Curtis distance between two compact int vectors.
|
||||
///
|
||||
/// This is used internally by [`relfreq_bray_dist`] and to easily compute the relative frequency
|
||||
/// Bray-Curtis distance over a set of vector pairs.
|
||||
///
|
||||
/// Arguments:
|
||||
/// - `other`: the other compact int vector to compare with
|
||||
/// - `sum_a`: the sum of the first vector's counts
|
||||
/// - `sum_b`: the sum of the second vector's counts
|
||||
///
|
||||
/// Returns the sum of the minimum relative frequencies at each index.
|
||||
pub fn partial_relfreq_bray_dist(
|
||||
&self,
|
||||
other: &PersistentCompactIntVec,
|
||||
sum_a: f64,
|
||||
sum_b: f64,
|
||||
) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let sum_min: f64 = self
|
||||
.iter()
|
||||
.zip(other.iter())
|
||||
.map(|(a, b)| {
|
||||
let pa = if sum_a > 0.0 { a as f64 / sum_a } else { 0.0 };
|
||||
let pb = if sum_b > 0.0 { b as f64 / sum_b } else { 0.0 };
|
||||
pa.min(pb)
|
||||
})
|
||||
.sum();
|
||||
1.0 - sum_min
|
||||
sum_min
|
||||
}
|
||||
|
||||
/// Returns the euclidean distance between two compact int vectors.
|
||||
pub fn euclidean_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let sq: f64 = self.iter().zip(other.iter())
|
||||
.map(|(a, b)| { let d = a as f64 - b as f64; d * d })
|
||||
.sum();
|
||||
sq.sqrt()
|
||||
self.partial_euclidean_dist(other).sqrt()
|
||||
}
|
||||
|
||||
/// Returns the partial euclidean distance between two compact int vectors.
|
||||
///
|
||||
/// This is used internally by [`euclidean_dist`] and to easily compute the euclidean distance
|
||||
/// over a set of vector pairs.
|
||||
///
|
||||
/// The result is the sum of the squared differences between corresponding elements of the two
|
||||
/// vectors.
|
||||
pub fn partial_euclidean_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
self.iter()
|
||||
.zip(other.iter())
|
||||
.map(|(a, b)| {
|
||||
let d = a as f64 - b as f64;
|
||||
d * d
|
||||
})
|
||||
.sum()
|
||||
}
|
||||
|
||||
/// Returns the relative frequency euclidean distance between two compact int vectors.
|
||||
///
|
||||
/// This is a variant of [`euclidean_dist`] that uses relative frequencies instead of raw counts.
|
||||
pub fn relfreq_euclidean_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let sum_a = self.sum() as f64;
|
||||
let sum_b = other.sum() as f64;
|
||||
if sum_a == 0.0 && sum_b == 0.0 { return 0.0; }
|
||||
let sq: f64 = self.iter().zip(other.iter())
|
||||
if sum_a == 0.0 && sum_b == 0.0 {
|
||||
return 0.0;
|
||||
}
|
||||
self.partial_relfreq_euclidean_dist(other, sum_a, sum_b)
|
||||
.sqrt()
|
||||
}
|
||||
|
||||
/// Returns the partial relative frequency euclidean distance between two compact int vectors.
|
||||
///
|
||||
/// This is used internally by [`relfreq_euclidean_dist`] and to easily compute the relative frequency
|
||||
/// euclidean distance over a set of vector pairs.
|
||||
pub fn partial_relfreq_euclidean_dist(
|
||||
&self,
|
||||
other: &PersistentCompactIntVec,
|
||||
sum_a: f64,
|
||||
sum_b: f64,
|
||||
) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
self.iter()
|
||||
.zip(other.iter())
|
||||
.map(|(a, b)| {
|
||||
let pa = if sum_a > 0.0 { a as f64 / sum_a } else { 0.0 };
|
||||
let pb = if sum_b > 0.0 { b as f64 / sum_b } else { 0.0 };
|
||||
let d = pa - pb;
|
||||
d * d
|
||||
})
|
||||
.sum();
|
||||
sq.sqrt()
|
||||
.sum()
|
||||
}
|
||||
|
||||
/// Returns the Euclidean distance between two compact int vectors using the Hellinger transform.
|
||||
///
|
||||
/// The Hellinger transform is applied to the raw counts of each vector, and the result is
|
||||
/// the Euclidean distance between the transformed vectors. The Hellinger transform is defined
|
||||
/// as the square root of the relative frequencies.
|
||||
pub fn hellinger_euclidean_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let sum_a = self.sum() as f64;
|
||||
let sum_b = other.sum() as f64;
|
||||
if sum_a == 0.0 && sum_b == 0.0 { return 0.0; }
|
||||
let sq: f64 = self.iter().zip(other.iter())
|
||||
if sum_a == 0.0 && sum_b == 0.0 {
|
||||
return 0.0;
|
||||
}
|
||||
self.partial_hellinger_euclidean_dist(other, sum_a, sum_b)
|
||||
.sqrt()
|
||||
}
|
||||
|
||||
/// Returns the partial Hellinger Euclidean distance between two compact int vectors.
|
||||
///
|
||||
/// This is used internally by [`hellinger_euclidean_dist`] and to easily compute the Hellinger
|
||||
/// Euclidean distance over a set of vector pairs.
|
||||
pub fn partial_hellinger_euclidean_dist(
|
||||
&self,
|
||||
other: &PersistentCompactIntVec,
|
||||
sum_a: f64,
|
||||
sum_b: f64,
|
||||
) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
self.iter()
|
||||
.zip(other.iter())
|
||||
.map(|(a, b)| {
|
||||
let pa = if sum_a > 0.0 { (a as f64 / sum_a).sqrt() } else { 0.0 };
|
||||
let pb = if sum_b > 0.0 { (b as f64 / sum_b).sqrt() } else { 0.0 };
|
||||
let pa = if sum_a > 0.0 {
|
||||
(a as f64 / sum_a).sqrt()
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
let pb = if sum_b > 0.0 {
|
||||
(b as f64 / sum_b).sqrt()
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
let d = pa - pb;
|
||||
d * d
|
||||
})
|
||||
.sum();
|
||||
sq.sqrt()
|
||||
.sum()
|
||||
}
|
||||
|
||||
pub fn hellinger_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
@@ -194,16 +324,26 @@ impl PersistentCompactIntVec {
|
||||
|
||||
pub fn threshold_jaccard_dist(&self, other: &PersistentCompactIntVec, threshold: u32) -> f64 {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
let (intersection, union) = self.iter().zip(other.iter()).fold(
|
||||
(0u64, 0u64),
|
||||
|(inter, uni), (a, b)| {
|
||||
let (intersection, union) = self.partial_threshold_jaccard_dist(other, threshold);
|
||||
if union == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
1.0 - intersection as f64 / union as f64
|
||||
}
|
||||
|
||||
pub fn partial_threshold_jaccard_dist(
|
||||
&self,
|
||||
other: &PersistentCompactIntVec,
|
||||
threshold: u32,
|
||||
) -> (u64, u64) {
|
||||
assert_eq!(self.n, other.len(), "length mismatch");
|
||||
self.iter()
|
||||
.zip(other.iter())
|
||||
.fold((0u64, 0u64), |(inter, uni), (a, b)| {
|
||||
let ap = a >= threshold;
|
||||
let bp = b >= threshold;
|
||||
(inter + (ap & bp) as u64, uni + (ap | bp) as u64)
|
||||
},
|
||||
);
|
||||
if union == 0 { return 0.0; }
|
||||
1.0 - intersection as f64 / union as f64
|
||||
})
|
||||
}
|
||||
|
||||
pub fn jaccard_dist(&self, other: &PersistentCompactIntVec) -> f64 {
|
||||
@@ -211,7 +351,11 @@ impl PersistentCompactIntVec {
|
||||
}
|
||||
|
||||
pub fn iter(&self) -> Iter<'_> {
|
||||
Iter { pciv: self, slot: 0, overflow_pos: 0 }
|
||||
Iter {
|
||||
pciv: self,
|
||||
slot: 0,
|
||||
overflow_pos: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -225,8 +369,8 @@ impl<'a> IntoIterator for &'a PersistentCompactIntVec {
|
||||
}
|
||||
|
||||
pub struct Iter<'a> {
|
||||
pciv: &'a PersistentCompactIntVec,
|
||||
slot: usize,
|
||||
pciv: &'a PersistentCompactIntVec,
|
||||
slot: usize,
|
||||
overflow_pos: usize,
|
||||
}
|
||||
|
||||
|
||||
@@ -0,0 +1,69 @@
|
||||
use tempfile::tempdir;
|
||||
|
||||
use crate::{PersistentBitMatrix, PersistentBitMatrixBuilder};
|
||||
|
||||
#[test]
|
||||
fn single_col_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentBitMatrixBuilder::new(4, dir.path()).unwrap();
|
||||
let mut col = b.add_col().unwrap();
|
||||
col.set(0, true);
|
||||
col.set(1, false);
|
||||
col.set(2, true);
|
||||
col.set(3, true);
|
||||
col.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentBitMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 1);
|
||||
assert_eq!(m.n(), 4);
|
||||
assert_eq!(&*m.row(0), &[true]);
|
||||
assert_eq!(&*m.row(1), &[false]);
|
||||
assert_eq!(&*m.row(2), &[true]);
|
||||
assert_eq!(&*m.row(3), &[true]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn two_cols_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentBitMatrixBuilder::new(3, dir.path()).unwrap();
|
||||
let mut col0 = b.add_col().unwrap();
|
||||
col0.set(0, true); col0.set(1, false); col0.set(2, true);
|
||||
col0.close().unwrap();
|
||||
let mut col1 = b.add_col().unwrap();
|
||||
col1.set(0, false); col1.set(1, true); col1.set(2, false);
|
||||
col1.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentBitMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 2);
|
||||
assert_eq!(&*m.row(0), &[true, false]);
|
||||
assert_eq!(&*m.row(1), &[false, true]);
|
||||
assert_eq!(&*m.row(2), &[true, false]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn col_accessor() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentBitMatrixBuilder::new(3, dir.path()).unwrap();
|
||||
let mut col = b.add_col().unwrap();
|
||||
col.set(0, true); col.set(1, false); col.set(2, true);
|
||||
col.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentBitMatrix::open(dir.path()).unwrap();
|
||||
assert!(m.col(0).get(0));
|
||||
assert!(!m.col(0).get(1));
|
||||
assert!(m.col(0).get(2));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn zero_cols_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let b = PersistentBitMatrixBuilder::new(8, dir.path()).unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentBitMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 0);
|
||||
assert_eq!(m.n(), 8);
|
||||
}
|
||||
@@ -0,0 +1,68 @@
|
||||
use tempfile::tempdir;
|
||||
|
||||
use crate::{PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder};
|
||||
|
||||
#[test]
|
||||
fn single_col_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentCompactIntMatrixBuilder::new(4, dir.path()).unwrap();
|
||||
let mut col = b.add_col().unwrap();
|
||||
col.set(0, 10);
|
||||
col.set(1, 200);
|
||||
col.set(2, 300);
|
||||
col.set(3, 1000);
|
||||
col.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentCompactIntMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 1);
|
||||
assert_eq!(m.n(), 4);
|
||||
assert_eq!(&*m.row(0), &[10u32]);
|
||||
assert_eq!(&*m.row(1), &[200u32]);
|
||||
assert_eq!(&*m.row(2), &[300u32]);
|
||||
assert_eq!(&*m.row(3), &[1000u32]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn two_cols_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentCompactIntMatrixBuilder::new(3, dir.path()).unwrap();
|
||||
let mut col0 = b.add_col().unwrap();
|
||||
col0.set(0, 1); col0.set(1, 2); col0.set(2, 3);
|
||||
col0.close().unwrap();
|
||||
let mut col1 = b.add_col().unwrap();
|
||||
col1.set(0, 10); col1.set(1, 20); col1.set(2, 30);
|
||||
col1.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentCompactIntMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 2);
|
||||
assert_eq!(&*m.row(0), &[1u32, 10]);
|
||||
assert_eq!(&*m.row(1), &[2u32, 20]);
|
||||
assert_eq!(&*m.row(2), &[3u32, 30]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn col_accessor() {
|
||||
let dir = tempdir().unwrap();
|
||||
let mut b = PersistentCompactIntMatrixBuilder::new(2, dir.path()).unwrap();
|
||||
let mut col0 = b.add_col().unwrap();
|
||||
col0.set(0, 5); col0.set(1, 7);
|
||||
col0.close().unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentCompactIntMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.col(0).get(0), 5);
|
||||
assert_eq!(m.col(0).get(1), 7);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn zero_cols_roundtrip() {
|
||||
let dir = tempdir().unwrap();
|
||||
let b = PersistentCompactIntMatrixBuilder::new(10, dir.path()).unwrap();
|
||||
b.close().unwrap();
|
||||
|
||||
let m = PersistentCompactIntMatrix::open(dir.path()).unwrap();
|
||||
assert_eq!(m.n_cols(), 0);
|
||||
assert_eq!(m.n(), 10);
|
||||
}
|
||||
@@ -1,4 +1,6 @@
|
||||
mod bitmatrix;
|
||||
mod bitvec;
|
||||
mod intmatrix;
|
||||
|
||||
use tempfile::tempdir;
|
||||
|
||||
|
||||
@@ -4,7 +4,10 @@ use std::path::Path;
|
||||
|
||||
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
||||
use epserde::prelude::*;
|
||||
use obicompactvec::{PersistentCompactIntVec, PersistentCompactIntVecBuilder};
|
||||
use obicompactvec::{
|
||||
PersistentBitMatrix, PersistentBitMatrixBuilder,
|
||||
PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder,
|
||||
};
|
||||
use obikseq::CanonicalKmer;
|
||||
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
||||
use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
||||
@@ -15,7 +18,8 @@ use crate::evidence::{Evidence, EvidenceWriter};
|
||||
pub(crate) const MPHF_FILE: &str = "mphf.bin";
|
||||
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
|
||||
const EVIDENCE_FILE: &str = "evidence.bin";
|
||||
const COUNTS_FILE: &str = "counts.pciv";
|
||||
const COUNTS_DIR: &str = "counts";
|
||||
const PRESENCE_DIR: &str = "presence";
|
||||
|
||||
type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
||||
|
||||
@@ -33,12 +37,20 @@ impl LayerData for () {
|
||||
fn read(&self, _slot: usize) {}
|
||||
}
|
||||
|
||||
impl LayerData for PersistentCompactIntVec {
|
||||
type Item = u32;
|
||||
impl LayerData for PersistentCompactIntMatrix {
|
||||
type Item = Box<[u32]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> {
|
||||
PersistentCompactIntVec::open(&layer_dir.join(COUNTS_FILE)).map_err(OLMError::Io)
|
||||
PersistentCompactIntMatrix::open(&layer_dir.join(COUNTS_DIR)).map_err(OLMError::Io)
|
||||
}
|
||||
fn read(&self, slot: usize) -> u32 { self.get(slot) }
|
||||
fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) }
|
||||
}
|
||||
|
||||
impl LayerData for PersistentBitMatrix {
|
||||
type Item = Box<[bool]>;
|
||||
fn open(layer_dir: &Path) -> OLMResult<Self> {
|
||||
PersistentBitMatrix::open(&layer_dir.join(PRESENCE_DIR)).map_err(OLMError::Io)
|
||||
}
|
||||
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
|
||||
}
|
||||
|
||||
// ── Structures ────────────────────────────────────────────────────────────────
|
||||
@@ -151,27 +163,31 @@ impl Layer<()> {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Mode 2 — counts (PersistentCompactIntVec) ─────────────────────────────────
|
||||
// ── Mode 2 — count matrix (1 column per layer) ────────────────────────────────
|
||||
|
||||
impl Layer<PersistentCompactIntVec> {
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
||||
let unitigs = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?;
|
||||
let n = unitigs.n_kmers();
|
||||
let counts_dir = out_dir.join(COUNTS_DIR);
|
||||
if n == 0 {
|
||||
empty_layer(out_dir)?;
|
||||
PersistentCompactIntVecBuilder::new(0, &out_dir.join(COUNTS_FILE))
|
||||
.and_then(|b| b.close())
|
||||
let mut mb = PersistentCompactIntMatrixBuilder::new(0, &counts_dir)
|
||||
.map_err(OLMError::Io)?;
|
||||
mb.add_col().map_err(OLMError::Io)?.close().map_err(OLMError::Io)?;
|
||||
mb.close().map_err(OLMError::Io)?;
|
||||
return Ok(0);
|
||||
}
|
||||
let mphf = build_mphf(out_dir, n)?;
|
||||
let mut cnt = PersistentCompactIntVecBuilder::new(n, &out_dir.join(COUNTS_FILE))
|
||||
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
|
||||
.map_err(OLMError::Io)?;
|
||||
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
||||
build_second_pass(out_dir, n, &mphf, &mut |slot, kmer| {
|
||||
cnt.set(slot, count_of(kmer));
|
||||
col.set(slot, count_of(kmer));
|
||||
Ok(())
|
||||
})?;
|
||||
cnt.close().map_err(OLMError::Io)?;
|
||||
col.close().map_err(OLMError::Io)?;
|
||||
mb.close().map_err(OLMError::Io)?;
|
||||
Ok(n)
|
||||
}
|
||||
|
||||
@@ -183,6 +199,49 @@ impl Layer<PersistentCompactIntVec> {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Mode 3 — presence/absence matrix (1 column per genome) ───────────────────
|
||||
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn build_presence(
|
||||
out_dir: &Path,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||
) -> OLMResult<usize> {
|
||||
let unitigs = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?;
|
||||
let n = unitigs.n_kmers();
|
||||
let presence_dir = out_dir.join(PRESENCE_DIR);
|
||||
if n == 0 {
|
||||
empty_layer(out_dir)?;
|
||||
let mut mb = PersistentBitMatrixBuilder::new(0, &presence_dir)
|
||||
.map_err(OLMError::Io)?;
|
||||
for _ in 0..n_genomes {
|
||||
mb.add_col().map_err(OLMError::Io)?.close().map_err(OLMError::Io)?;
|
||||
}
|
||||
mb.close().map_err(OLMError::Io)?;
|
||||
return Ok(0);
|
||||
}
|
||||
let mphf = build_mphf(out_dir, n)?;
|
||||
|
||||
let mut mb = PersistentBitMatrixBuilder::new(n, &presence_dir).map_err(OLMError::Io)?;
|
||||
let mut cols: Vec<_> = (0..n_genomes)
|
||||
.map(|_| mb.add_col().map_err(OLMError::Io))
|
||||
.collect::<OLMResult<_>>()?;
|
||||
|
||||
build_second_pass(out_dir, n, &mphf, &mut |slot, kmer| {
|
||||
for (g, col) in cols.iter_mut().enumerate() {
|
||||
col.set(slot, present_in(kmer, g));
|
||||
}
|
||||
Ok(())
|
||||
})?;
|
||||
|
||||
for col in cols {
|
||||
col.close().map_err(OLMError::Io)?;
|
||||
}
|
||||
mb.close().map_err(OLMError::Io)?;
|
||||
Ok(n)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[path = "tests/layer.rs"]
|
||||
mod tests;
|
||||
|
||||
@@ -2,7 +2,7 @@ use std::collections::HashMap;
|
||||
use std::fs;
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use obicompactvec::PersistentCompactIntVec;
|
||||
use obicompactvec::PersistentCompactIntMatrix;
|
||||
use obikseq::CanonicalKmer;
|
||||
use obiskio::UnitigFileWriter;
|
||||
|
||||
@@ -96,13 +96,13 @@ impl LayeredMap<()> {
|
||||
}
|
||||
}
|
||||
|
||||
// ── Mode 2 — counts ───────────────────────────────────────────────────────────
|
||||
// ── Mode 2 — count matrix ─────────────────────────────────────────────────────
|
||||
|
||||
impl LayeredMap<PersistentCompactIntVec> {
|
||||
impl LayeredMap<PersistentCompactIntMatrix> {
|
||||
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
||||
let i = self.layers.len();
|
||||
let dir = layer_dir(&self.root, i);
|
||||
Layer::<PersistentCompactIntVec>::build(&dir, count_of)?;
|
||||
Layer::<PersistentCompactIntMatrix>::build(&dir, count_of)?;
|
||||
self.append_layer()?;
|
||||
Ok(i)
|
||||
}
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
use super::*;
|
||||
use obicompactvec::PersistentCompactIntVec;
|
||||
use obicompactvec::PersistentCompactIntMatrix;
|
||||
use obikseq::{set_k, Kmer, Sequence as _, Unitig};
|
||||
use tempfile::tempdir;
|
||||
|
||||
@@ -44,14 +44,14 @@ fn counts_are_stored_and_retrieved() {
|
||||
let kmers = all_canonical_kmers(dir.path(), 4);
|
||||
let count_map: HashMap<CanonicalKmer, u32> =
|
||||
kmers.iter().enumerate().map(|(i, &k)| (k, i as u32 + 1)).collect();
|
||||
Layer::<PersistentCompactIntVec>::build(
|
||||
Layer::<PersistentCompactIntMatrix>::build(
|
||||
dir.path(),
|
||||
|kmer| count_map.get(&kmer).copied().unwrap_or(0),
|
||||
).unwrap();
|
||||
let layer = Layer::<PersistentCompactIntVec>::open(dir.path()).unwrap();
|
||||
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
||||
for kmer in &kmers {
|
||||
let hit = layer.query(*kmer).expect("kmer must be present");
|
||||
assert_eq!(hit.data, count_map[kmer]);
|
||||
assert_eq!(hit.data[0], count_map[kmer]);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -71,10 +71,10 @@ fn open_after_build_is_consistent() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
||||
let n = Layer::<PersistentCompactIntVec>::build(dir.path(), |_| 7).unwrap();
|
||||
let n = Layer::<PersistentCompactIntMatrix>::build(dir.path(), |_| 7).unwrap();
|
||||
assert_eq!(n, 4);
|
||||
let layer = Layer::<PersistentCompactIntVec>::open(dir.path()).unwrap();
|
||||
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
||||
let kmer = Kmer::from_ascii(b"AAAA").unwrap().canonical();
|
||||
let hit = layer.query(kmer).expect("AAAA must be present");
|
||||
assert_eq!(hit.data, 7);
|
||||
assert_eq!(hit.data[0], 7);
|
||||
}
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
use super::*;
|
||||
use obicompactvec::PersistentCompactIntVec;
|
||||
use obicompactvec::PersistentCompactIntMatrix;
|
||||
use obikseq::{set_k, Sequence as _, Unitig};
|
||||
use tempfile::tempdir;
|
||||
|
||||
fn push_unitigs_and_layer(
|
||||
map: &mut LayeredMap<PersistentCompactIntVec>,
|
||||
map: &mut LayeredMap<PersistentCompactIntMatrix>,
|
||||
seqs: &[&[u8]],
|
||||
count: u32,
|
||||
) {
|
||||
@@ -33,10 +33,10 @@ fn open_reloads_layer_count() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
{
|
||||
let mut map = LayeredMap::<PersistentCompactIntVec>::create(dir.path()).unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntMatrix>::create(dir.path()).unwrap();
|
||||
push_unitigs_and_layer(&mut map, &[b"AAAACGT"], 1);
|
||||
}
|
||||
let map = LayeredMap::<PersistentCompactIntVec>::open(dir.path()).unwrap();
|
||||
let map = LayeredMap::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
||||
assert_eq!(map.n_layers(), 1);
|
||||
}
|
||||
|
||||
@@ -44,37 +44,37 @@ fn open_reloads_layer_count() {
|
||||
fn query_finds_kmer_in_layer_zero() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntVec>::create(dir.path()).unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntMatrix>::create(dir.path()).unwrap();
|
||||
push_unitigs_and_layer(&mut map, &[b"AAAACGT"], 3);
|
||||
let kmer = canonical(b"AAAC");
|
||||
let (layer_idx, hit) = map.query(kmer).expect("kmer must be found");
|
||||
assert_eq!(layer_idx, 0);
|
||||
assert_eq!(hit.data, 3);
|
||||
assert_eq!(hit.data[0], 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_finds_kmer_in_correct_layer() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntVec>::create(dir.path()).unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntMatrix>::create(dir.path()).unwrap();
|
||||
push_unitigs_and_layer(&mut map, &[b"AAAACGT"], 1);
|
||||
push_unitigs_and_layer(&mut map, &[b"GGGACGT"], 2);
|
||||
assert_eq!(map.n_layers(), 2);
|
||||
|
||||
let (li, hit) = map.query(canonical(b"AAAA")).expect("AAAA must be found");
|
||||
assert_eq!(li, 0);
|
||||
assert_eq!(hit.data, 1);
|
||||
assert_eq!(hit.data[0], 1);
|
||||
|
||||
let (li, hit) = map.query(canonical(b"GGGA")).expect("GGGA must be found");
|
||||
assert_eq!(li, 1);
|
||||
assert_eq!(hit.data, 2);
|
||||
assert_eq!(hit.data[0], 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_absent_returns_none() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntVec>::create(dir.path()).unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntMatrix>::create(dir.path()).unwrap();
|
||||
push_unitigs_and_layer(&mut map, &[b"AAAACGT"], 1);
|
||||
let absent = canonical(b"CCCC");
|
||||
assert!(map.query(absent).is_none());
|
||||
@@ -84,7 +84,7 @@ fn query_absent_returns_none() {
|
||||
fn push_layer_from_map_convenience() {
|
||||
set_k(4);
|
||||
let dir = tempdir().unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntVec>::create(dir.path()).unwrap();
|
||||
let mut map = LayeredMap::<PersistentCompactIntMatrix>::create(dir.path()).unwrap();
|
||||
let mut w = map.next_layer_writer().unwrap();
|
||||
w.write(&Unitig::from_ascii(b"AAAACGT")).unwrap();
|
||||
w.close().unwrap();
|
||||
@@ -93,5 +93,5 @@ fn push_layer_from_map_convenience() {
|
||||
].into_iter().collect();
|
||||
map.push_layer_from_map(&counts).unwrap();
|
||||
let (_, hit) = map.query(canonical(b"AAAA")).unwrap();
|
||||
assert_eq!(hit.data, 10);
|
||||
assert_eq!(hit.data[0], 10);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user