feat: introduce column-major matrix storage and migrate layered map
Introduces `PersistentBitMatrix` and `PersistentCompactIntMatrix` to replace single-file vector storage with a column-major, directory-based layout. Each column is persisted as an individual file alongside a lightweight `meta.json` for dimension tracking. Migrates `obilayeredmap` to use these multi-column structures, updating Rust APIs, query return types, and build signatures. Includes comprehensive documentation, unit and integration tests for persistence and accessors, and refactors distance calculation helpers.
This commit is contained in:
@@ -10,26 +10,26 @@
|
||||
|
||||
The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
|
||||
|
||||
| Mode | Description | Payload type | File |
|
||||
| Mode | Description | Payload type | Storage |
|
||||
|---|---|---|---|
|
||||
| 1. Set | membership test only | `()` | — |
|
||||
| 2. Set with count | occurrences per kmer per sample | `PersistentCompactIntVec` | `counts.pciv` |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitVec` per genome | `presence_N.pbiv` |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntVec` per genome | `counts_N.pciv` |
|
||||
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
|
||||
| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
|
||||
|
||||
Both `PersistentCompactIntVec` and `PersistentBitVec` come from the `obicompactvec` crate. Modes 3 and 4 are not yet implemented; the per-genome multi-file layout and query API remain to be designed.
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::<PersistentBitMatrix>::build_presence`); mode 4 is not yet implemented.
|
||||
|
||||
### Payload for mode 2: PersistentCompactIntVec
|
||||
### Payload for modes 2/4: PersistentCompactIntMatrix
|
||||
|
||||
`PersistentCompactIntVec` (PCIV) stores one `u32` count per MPHF slot in a single mmap'd `.pciv` file. Its encoding: a primary `u8` array (value 255 = overflow sentinel) backed by a sorted overflow section of `(slot: u64, value: u32)` entries and a sparse L1-fitting index for fast binary search. This handles the geometric count distribution efficiently — most values fit in 1 byte, overflow entries are rare.
|
||||
`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
|
||||
|
||||
Capacity: 0 to u32::MAX per slot. No separate decision needed on bit-width: PCIV adapts to the data.
|
||||
Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
|
||||
|
||||
### Payload for mode 3/4: PersistentBitVec / PersistentCompactIntVec
|
||||
### Payload for mode 3: PersistentBitMatrix
|
||||
|
||||
`PersistentBitVec` (PBIV) stores one bit per MPHF slot in a mmap'd `.pbiv` file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). One PBIV per genome gives a column-major presence/absence matrix, making per-genome set operations cache-friendly.
|
||||
`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
|
||||
|
||||
Mode 4 replaces PBIV with PCIV per genome. Multi-file layout and query API are not yet designed.
|
||||
Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
|
||||
|
||||
---
|
||||
|
||||
@@ -57,14 +57,15 @@ pub struct Hit<T = ()> {
|
||||
}
|
||||
```
|
||||
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes (mode 1 takes no extra argument, mode 2 takes a `count_of` closure) and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
|
||||
|
||||
Implemented concrete types:
|
||||
|
||||
| Type | `Item` | Description |
|
||||
|---|---|---|
|
||||
| `()` | `()` | mode 1 — membership only |
|
||||
| `PersistentCompactIntVec` | `u32` | mode 2 — per-slot count |
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
|
||||
|
||||
`LayeredMap` mirrors the same parameterisation: `LayeredMap<D: LayerData = ()>`.
|
||||
|
||||
@@ -81,8 +82,14 @@ index_root/ ← LayeredMap (collection)
|
||||
unitigs.bin
|
||||
unitigs.bin.idx
|
||||
evidence.bin
|
||||
counts.pciv [mode 2 only]
|
||||
presence_N.pbiv [mode 3/4, one per genome — not yet implemented]
|
||||
counts/ [modes 2/4]
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3]
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
col_000001.pbiv
|
||||
...
|
||||
layer_1/
|
||||
...
|
||||
part_00001/
|
||||
@@ -106,7 +113,8 @@ layer_N/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
|
||||
counts.pciv — [mode 2] PersistentCompactIntVec: one u32 per slot
|
||||
counts/ — [modes 2/4] PersistentCompactIntMatrix
|
||||
presence/ — [mode 3] PersistentBitMatrix
|
||||
```
|
||||
|
||||
`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
@@ -165,13 +173,24 @@ impl Layer<()> {
|
||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 2
|
||||
impl Layer<PersistentCompactIntVec> {
|
||||
// modes 2/4
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||
}
|
||||
|
||||
// mode 3
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn build_presence(
|
||||
out_dir: &Path,
|
||||
n_genomes: usize,
|
||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||
) -> OLMResult<usize>
|
||||
}
|
||||
```
|
||||
|
||||
Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
|
||||
|
||||
Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec<u64>`.
|
||||
|
||||
---
|
||||
@@ -196,6 +215,8 @@ fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
|
||||
|
||||
For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
|
||||
|
||||
---
|
||||
|
||||
## Add-layer algorithm
|
||||
@@ -221,15 +242,13 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `epserde 0.8` | zero-copy serialisation of MPHF |
|
||||
| `memmap2` | mmap of layer files |
|
||||
| `obiskio` | unitig file writer/reader |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntVec`, `PersistentBitVec` |
|
||||
| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Mode 3/4 multi-file layout**: one PBIV/PCIV per genome per layer means O(n_layers × n_genomes) files. Directory layout, open strategy, and query API are not yet designed.
|
||||
- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation may be required; access pattern and density threshold are not yet defined.
|
||||
- **Presence matrix layout**: column-major (one PBIV per genome) favours per-genome operations; row-major favours per-kmer queries. Dominant access pattern not yet characterised.
|
||||
- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
|
||||
- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
|
||||
- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
|
||||
- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
|
||||
|
||||
Reference in New Issue
Block a user