Add persistent compact integer vector and cache-line-optimized MPHF

Introduce the `obicompactvec` crate, featuring a two-tier, memory-mapped integer vector that uses a primary `u8` array with a sentinel for overflow dispatch and a sparse L1-resident index for fast random access. Implement builder and reader modules with zero-copy serialization and comprehensive test coverage. Update `obilayeredmap` to replace the default hash function with a cache-line-optimized `Mphf`, adding explicit bounds checking and duplicate-slot detection. Add documentation for both modules and update project configuration files accordingly.
2026-05-13 06:24:43 +08:00
parent 84ed752b78
commit f2de79acde
14 changed files with 710 additions and 91 deletions
@@ -6,6 +6,59 @@

 ---

+## Four usage modes
+
+The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
+
+| Mode | Description | Payload |
+|---|---|---|
+| 1. Set | membership test only | none |
+| 2. Set with count | occurrences per kmer per sample | compact integer vector |
+| 3. Presence/absence matrix | which genomes contain each kmer | bit matrix n_kmers × n_genomes |
+| 4. Count matrix | occurrences per kmer per genome | integer matrix n_kmers × n_genomes |
+
+Mode 4 is architecturally identical to mode 3 but with counts instead of bits; the main open question is scale — a naive dense representation reaches hundreds of GB for large genome collections, which may require a sparse encoding.
+
+### Payload for mode 2: compact count vector
+
+Counts follow a roughly geometric distribution: the large majority are below 128, almost all below 16 000, with rare large values in highly covered regions. Options for a mmap-compatible random-access count vector:
+
+- **`Vec<u16>`** — 2 bytes/slot, covers 0–65 535, O(1) access, trivially mmap-able. Practical upper bound sufficient for most metagenomic use cases.
+- **Bit-packed fixed width** — choose B = ⌈log₂(max\_count)⌉ globally (e.g. B=14 for 99.9% coverage at 1.75 bytes/slot). O(1) access via bit-shift arithmetic.
+- **Block-varint (PForDelta, StreamVByte)** — good compression but random access requires a separate offset index; no mature Rust crate for mmap use.
+
+Decision not yet made. `Vec<u16>` is the default baseline pending profiling.
+
+### Payload for mode 3: presence/absence matrix
+
+Column-major bit matrix: column j (genome j) is a contiguous `n_slots`-bit array. This layout makes per-genome operations (union, intersection, diff) cache-friendly. For 10⁹ slots × 100 genomes ≈ 12.5 GB — large but mmap-able with no resident-memory cost at open time.
+
+---
+
+## Payload architecture
+
+The payload is orthogonal to the MPHF + evidence layer. `Layer` will be parameterised by a payload type `D`:
+
+```rust
+struct Layer<D = ()> {
+    mphf:     Mphf,
+    evidence: Evidence,
+    unitigs:  UnitigFileReader,
+    data:     D,
+}
+```
+
+Where `D` implements a `LayerData` trait covering open/close/get. Concrete instances:
+
+- `()` — mode 1
+- `CountVec` — mode 2 (u16 or bit-packed)
+- `BitMatrix` — mode 3
+- `CountMatrix` — mode 4
+
+This parameterisation is not yet implemented; the current code uses a fixed `Counts` field.
+
+---
+
 ## Three-level hierarchy

 ```
@@ -15,14 +68,14 @@ index_root/                        ← LayeredMap (collection)
    layer_0/                       ← Layer
      mphf.bin
      unitigs.bin
+      unitigs.bin.idx
      evidence.bin
-      counts.bin
-      presence.bin
+      counts.bin          [mode 2 only]
+      presence.bin        [mode 3/4 only]
    layer_1/
      ...
  part_00001/
    layer_0/
-    layer_1/
    ...
 ```

@@ -38,19 +91,15 @@ index_root/                        ← LayeredMap (collection)

 ```
 layer_N/
-  mphf.bin            — ptr_hash MPHF (exact key count, phase-2 construction)
-  unitigs.bin         — packed 2-bit nucleotide sequences (concatenated, variable-length)
-  unitig_offsets.bin  — u32 per unitig: nucleotide offset of unitig j in unitigs.bin
+  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
+  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
+  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
-  counts.bin          — u32 per MPHF slot (total kmer occurrences)
-  presence.bin        — bit matrix: n_slots × n_samples  [optional]
+  counts.bin          — [optional] u16 per MPHF slot (mode 2)
+  presence.bin        — [optional] bit matrix n_slots × n_samples (mode 3/4)
 ```

-Unitigs have variable lengths. Each record in `unitigs.bin` is self-delimiting: it begins with a varint `seql` (sequence length in nucleotides) followed by `(seql+3)/4` packed bytes — the streaming format defined in `obiskio`. Sequential scan is always possible using the in-record `seql`.
-
-For O(1) random access, `unitig_offsets.bin` is a **precomputed derived index**: a u32 array of byte offsets into `unitigs.bin`, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from `unitigs.bin` if lost. Access: `unitigs.bin[offsets[j] .. offsets[j+1]]`.
-
-All files except `mphf.bin` are flat arrays of fixed-size elements, serialised with **rkyv** for zero-copy mmap access. `mphf.bin` uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions).
+`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

 ### Evidence encoding

@@ -68,9 +117,29 @@ kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice

 `rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.

-### Presence/absence matrix
+---

-Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time.
+## ptr_hash configuration
+
+The MPHF per layer is configured as:
+
+```rust
+type Mphf = PtrHash<
+    u64,                          // key type: canonical kmer raw encoding
+    CubicEps,                     // bucket fn: balanced (2.4 bits/key, λ=3.5)
+    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
+    Xx64,                         // hasher: XXH3-64 with seed, handles structured keys
+    Vec<u8>,                      // pilots
+>;
+```
+
+**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
+
+**Bucket function — `CubicEps` with `PtrHashParams::<CubicEps>::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
+
+**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
+
+**Construction:** `new_from_par_iter` avoids materialising all keys as `Vec<u64>`. MPHF correctness is verified inline during the second pass (evidence/counts fill) using an n/8-byte bitset; any duplicate slot or out-of-bounds index returns `OLMError::Mphf`.

 ---

@@ -79,14 +148,14 @@ Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array
 A kmer query routes through all three levels:

 1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
-2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.query(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
-3. **Data access**: read `counts[slot]` and/or `presence[slot]` from the matching layer.
+2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.index(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins.
+3. **Data access**: read payload at `slot` from the matching layer (counts, presence row, etc.).

 ```
 fn query(kmer) → Option<Hit>:
    part = partition_of(kmer)
    for (i, layer) in part.layers.iter().enumerate():
-        slot = layer.mphf.query(kmer)
+        slot = layer.mphf.index(&kmer.raw())
        if layer.evidence.decode(slot) == kmer:
            return Some(Hit { layer: i, slot })
    return None
@@ -102,7 +171,7 @@ When adding dataset B to an existing index:

 1. For each partition, iterate kmers of B routed to that partition.
 2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
-3. Build a new layer from `B \ index` using the two-phase pipeline (FMPHGO provisional → ptr_hash definitive).
+3. Build a new layer from `B \ index`.
 4. Append the new layer directory under each `part_XXXXX/`.
 5. Update `meta.json` (layer count, sample registry).

@@ -110,59 +179,25 @@ Each partition's new layer is built independently; the operation is fully parall

 ---

-## Core API (sketch)
-
-```rust
-// Open an existing index
-let map = LayeredMap::open(path)?;
-
-// Query a canonical kmer across all partitions and layers
-match map.query(kmer) {
-    Some(hit) => {
-        let count   = hit.count();
-        let present = hit.presence_row();  // bit slice over samples
-    }
-    None => { /* absent */ }
-}
-
-// Non-destructive extension with a new dataset
-// unitigs produced by the two-phase pipeline, one per partition
-let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?;
-```
-
---
-
 ## Dependencies

 | crate | role |
 |---|---|
-| `ptr_hash` | phase-2 MPHF per layer |
-| `ph` (FMPHGO) | phase-1 provisional MPHF during layer construction |
-| `rkyv` | zero-copy serialisation of flat arrays (evidence, counts, presence) |
+| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
+| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
+| `epserde 0.8` | zero-copy serialisation of MPHF |
 | `memmap2` | mmap of layer files |
-| `bitm` | bit-packed presence matrix |
+| `obiskio` | unitig file writer/reader |

---
-
-## Serialisation strategy
-
-All flat arrays use `rkyv::Archive`:
-
-```rust
-#[derive(Archive, Serialize, Deserialize)]
-struct Evidence { slots: Vec<u32> }   // packed (unitig_id: 25 | rank: 7)
-
-#[derive(Archive, Serialize, Deserialize)]
-struct Counts { data: Vec<u32> }
-```
-
-At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement.
+Flat arrays (evidence, counts, presence) use a simple custom binary format (raw bytes, fixed element size) opened via mmap — no additional serialisation crate required.

 ---

 ## Open questions

- **ptr_hash + rkyv**: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper.
- **Presence matrix layout**: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
+- **Mode 4 scale**: count matrix (n_kmers × n_genomes × bytes_per_count) reaches hundreds of GB for large collections. A sparse representation (store only non-zero entries) may be required; the access pattern and density threshold are not yet defined.
+- **Count vector encoding (mode 2)**: `Vec<u16>` vs bit-packed (B=14). Decision pending; depends on whether counts > 65 535 occur in practice for the target datasets.
+- **Presence matrix layout**: column-major favours per-genome operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
 - **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path.
 - **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
+- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.