diff --git a/doc/implementation/mphf/index.html b/doc/implementation/mphf/index.html index a20e083..56c46b0 100644 --- a/doc/implementation/mphf/index.html +++ b/doc/implementation/mphf/index.html @@ -654,22 +654,22 @@ @@ -1290,10 +1178,10 @@

obilayeredmap — layered kmer index crate

Purpose

-

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: collection → partition → layer. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

+

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.


-

Four usage modes

-

The MPHF + evidence infrastructure is fixed for all modes. The payload — data associated with each slot — is orthogonal and varies by mode.

+

Three usage modes

+

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

@@ -1317,29 +1205,46 @@ - + - - - - - -
counts/ directory
3. Presence/absence matrix3. Presence/absence which genomes contain each kmer PersistentBitMatrix presence/ directory
4. Count matrixoccurrences per kmer per genomePersistentCompactIntMatrixcounts/ directory
-

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate. Mode 3 has a build path (Layer::<PersistentBitMatrix>::build_presence); mode 4 is not yet implemented.

-

Payload for modes 2/4: PersistentCompactIntMatrix

-

PersistentCompactIntMatrix is a column-major matrix stored in a directory: one col_NNNNNN.pciv file per column, plus a meta.json. Each column is a PersistentCompactIntVec — a mmap'd PCIV file with a u8 primary array (255 = overflow sentinel), a sorted overflow section of (slot: u64, value: u32) entries, and a sparse L1-fitting index.

-

Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). read(slot) returns Box<[u32]> — the full row across all columns.

-

Payload for mode 3: PersistentBitMatrix

-

PersistentBitMatrix is a column-major bit matrix stored in a directory: one col_NNNNNN.pbiv per genome, plus meta.json. Each column is a PersistentBitVec — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). read(slot) returns Box<[bool]> — the presence vector across all genomes.

-

Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.

+

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.


-

Payload architecture

-

The payload is orthogonal to the MPHF + evidence layer. Layer is parameterised by D: LayerData:

+

MphfLayer — autonomous kmer → slot mapping

+

MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.

+
pub struct MphfLayer {
+    mphf:     Mphf,
+    evidence: Evidence,
+    unitigs:  UnitigFileReader,
+    n:        usize,   // number of indexed kmers = number of MPHF slots
+}
+
+

Public API:

+
impl MphfLayer {
+    pub fn open(dir: &Path) -> OLMResult<Self>
+    pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>   // Some(slot) or None
+    pub fn n(&self) -> usize
+    pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
+    pub(crate) fn build(
+        dir: &Path,
+        fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
+    ) -> OLMResult<usize>
+}
+
+

find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).

+

build runs two sequential passes over unitigs.bin:

+
    +
  1. Pass 1: iterate all canonical kmers in parallel via rayon, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
  2. +
  3. Pass 2: iterate again sequentially, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.
  4. +
+

For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.

+
+

Layer\<D: LayerData> — MPHF + payload

+

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
     type Item;
     fn open(layer_dir: &Path) -> OLMResult<Self>;
@@ -1347,10 +1252,8 @@
 }
 
 pub struct Layer<D: LayerData = ()> {
-    mphf:     Mphf,
-    evidence: Evidence,
-    unitigs:  UnitigFileReader,
-    data:     D,
+    mphf: MphfLayer,
+    data: D,
 }
 
 pub struct Hit<T = ()> {
@@ -1358,8 +1261,7 @@
     pub data: T,
 }
 
-

LayerData covers the read path only (open + read). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated Context type with no benefit over specialized impl blocks.

-

Implemented concrete types:

+

LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.

@@ -1377,87 +1279,22 @@ - + - +
PersistentCompactIntMatrix Box<[u32]>modes 2/4 — one count per columnmode 2 — count matrix (one u32 per column per slot)
PersistentBitMatrix Box<[bool]>mode 3 — one presence bit per columnmode 3 — presence matrix (one bit per genome per slot)
-

LayeredMap mirrors the same parameterisation: LayeredMap<D: LayerData = ()>.

-
-

Three-level hierarchy

-
index_root/                        ← LayeredMap (collection)
-  meta.json
-  part_00000/                      ← Partition
-    layer_0/                       ← Layer
-      mphf.bin
-      unitigs.bin
-      unitigs.bin.idx
-      evidence.bin
-      counts/              [modes 2/4]
-        meta.json          {"n": N, "n_cols": 1}
-        col_000000.pciv
-      presence/            [mode 3]
-        meta.json          {"n": N, "n_cols": G}
-        col_000000.pbiv
-        col_000001.pbiv
-        ...
-    layer_1/
-      ...
-  part_00001/
-    layer_0/
-    ...
-
-

Collection (index_root/): global metadata — kmer size k, number of partitions, layer count, sample registry.

-

Partition (part_XXXXX/): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in part_XXXXX. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in meta.json.

-

Layer (layer_N/): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.

-
-

Layer file layout

-
layer_N/
-  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
-  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
-  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
-  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
-  counts/             — [modes 2/4] PersistentCompactIntMatrix
-  presence/           — [mode 3] PersistentBitMatrix
-
-

unitigs.bin is the packed-2-bit sequence file produced by obiskio::UnitigFileWriter. The companion .idx file stores: magic UIDX, n_unitigs: u32, n_kmers: u64, seqls: [u8; n_unitigs] (kmer count − 1 per chunk), and packed_offsets: [u32; n_unitigs + 1] (byte offsets into unitigs.bin, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

-

Evidence encoding

-

Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.

-
slot s  →  unitig_id: u25  |  rank: u7
-
-

Packed into a u32 (29 bits used, 3 spare). Decoding:

-
kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice
-
-

rank is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A u7 (0–127) is sufficient.

-
-

ptr_hash configuration

-

The MPHF per layer is configured as:

-
type Mphf = PtrHash<
-    u64,                              // key type: canonical kmer raw encoding
-    CubicEps,                         // bucket fn: balanced (2.4 bits/key, λ=3.5)
-    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
-    Xx64,                             // hasher: XXH3-64 with seed, handles structured keys
-    Vec<u8>,                          // pilots
->;
-
-

Hasher choice — Xx64: k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). FxHash (single multiply) distributes these poorly. Xx64 (XXH3 64-bit, seeded) handles structured input correctly.

-

Bucket function — CubicEps with PtrHashParams::<CubicEps>::default(): λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than Linear/λ=3.0 (the default_fast preset), 20% less space. default_compact (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.

-

Remap — CachelineEfVec: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for Vec<u32>). Already a transitive dependency of ptr_hash. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.

-
-

Build path

-

The build path is not part of LayerData. Each mode exposes its own impl Layer<D>::build with the exact signature it needs. Two private module-level helpers avoid code duplication:

-

build_mphf(out_dir, n) -> OLMResult<Mphf>: first pass — opens unitigs.bin, iterates all canonical kmers in parallel via new_from_par_iter, stores mphf.bin. O(n).

-

build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>: second pass — opens unitigs.bin again, fills evidence.bin and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls fill_slot(slot, kmer) once per kmer for the mode-specific payload. O(n).

+

Build signatures:

// mode 1
 impl Layer<()> {
     pub fn build(out_dir: &Path) -> OLMResult<usize>
 }
 
-// modes 2/4
+// mode 2
 impl Layer<PersistentCompactIntMatrix> {
     pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
     pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
@@ -1472,35 +1309,104 @@
     ) -> OLMResult<usize>
 }
 
-

Mode 2 creates a PersistentCompactIntMatrixBuilder with 1 column and fills it via build_second_pass. Mode 3 creates a PersistentBitMatrixBuilder with n_genomes columns and fills all columns in a single pass.

-

Any duplicate slot or out-of-bounds index detected during build_second_pass returns OLMError::Mphf. new_from_par_iter avoids materialising all keys as Vec<u64>.

+

All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.

+
+

LayeredStore\<S> and aggregation traits

+

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

+
pub struct LayeredStore<S>(pub Vec<S>);
+
+impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> {  }  // Σ col_weights across inner stores
+impl<S: CountPartials> CountPartials  for LayeredStore<S> {  }  // element-wise Σ partials
+impl<S: BitPartials>   BitPartials    for LayeredStore<S> {  }  // element-wise Σ partials
+
+

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

+

Aggregation hierarchy:

+
PersistentCompactIntMatrix                  implements CountPartials
+LayeredStore<PersistentCompactIntMatrix>         via blanket impl  (one partition)
+LayeredStore<LayeredStore<…>>                    via blanket impl  (partitioned index)
+
+

Leaf implementors (in obicompactvec):

+ + + + + + + + + + + + + + + + + +
TypeTraits
PersistentCompactIntMatrixColumnWeights (via sum()) + CountPartials
PersistentBitMatrixColumnWeights (via count_ones()) + BitPartials
+

PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.

+

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.

+
+

On-disk structure

+
index_root/                        ← LayeredMap (collection)
+  meta.json
+  part_00000/                      ← Partition
+    layer_0/                       ← Layer
+      mphf.bin           — ptr_hash MPHF (epserde format)
+      unitigs.bin        — packed 2-bit nucleotide sequences
+      unitigs.bin.idx    — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
+      evidence.bin       — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
+      counts/            [mode 2] PersistentCompactIntMatrix
+        meta.json          {"n": N, "n_cols": 1}
+        col_000000.pciv
+      presence/          [mode 3] PersistentBitMatrix
+        meta.json          {"n": N, "n_cols": G}
+        col_000000.pbiv
+        …
+    layer_1/
+      …
+  part_00001/
+    …
+
+

Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.

+

Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.

+
+

Evidence encoding

+

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

+
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
+bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)
+
+

Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.

+

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.

+
+

ptr_hash configuration

+
type Mphf = PtrHash<
+    u64,                              // key type: canonical kmer raw encoding
+    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
+    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
+    Xx64,                             // hasher: XXH3-64 with seed
+    Vec<u8>,                          // pilots
+>;
+
+

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

+

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.


Query path

-

A kmer query routes through all three levels:

-
    -
  1. Partition routing: hash canonical minimiser of the query kmer → partition index → open part_XXXXX/.
  2. -
  3. Layer probing: iterate layers in order; for each layer compute slot = mphf.index(kmer), decode evidence, compare to query. First match wins.
  4. -
  5. Data access: layer.data.read(slot) returns D::Item.
  6. -
-
// pseudo-code
-fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
-    for (i, layer) in self.layers.iter().enumerate():
-        slot = layer.mphf.index(&kmer.raw())
-        if layer.evidence.decode(slot) == kmer:
-            return Some((i, Hit { slot, data: layer.data.read(slot) }))
-    return None
+
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
+    self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
+}
 
-

Expected probe depth: 1 for kmers in layer 0, increasing for later layers.

-

For mode 2, hit.data is Box<[u32]> with 1 element; hit.data[0] is the count. For mode 3, hit.data is Box<[bool]> with G elements, one per genome.

+

MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.

+

In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.


Add-layer algorithm

When adding dataset B to an existing index:

    -
  1. For each partition, iterate kmers of B routed to that partition.
  2. -
  3. Probe existing layers; collect kmers absent from all layers → B \ index.
  4. -
  5. Build a new layer from B \ index.
  6. -
  7. Append the new layer directory under each part_XXXXX/.
  8. -
  9. Update meta.json (layer count, sample registry).
  10. +
  11. For each partition, probe existing layers for kmers of B routed to that partition.
  12. +
  13. Collect kmers absent from all layers → B \ index.
  14. +
  15. Write B \ index to a new unitigs.bin via MphfLayer::unitig_writer.
  16. +
  17. Call Layer<D>::build on the new directory.
  18. +
  19. Update meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.


@@ -1515,19 +1421,19 @@ ptr_hash 1.1 -MPHF per layer (epserde serialisation) +MPHF per layer cacheline-ef 1.1 -compact remap storage inside ptr_hash +compact remap inside ptr_hash epserde 0.8 -zero-copy serialisation of MPHF +zero-copy MPHF serialisation -memmap2 -mmap of layer files +memmap2 0.9 +mmap of evidence and payload files obiskio @@ -1535,21 +1441,18 @@ obicompactvec -payload types: PersistentCompactIntMatrix, PersistentBitMatrix +payload types + aggregation traits + + +rayon 1 +parallel MPHF construction pass + + +ndarray 0.16 +aggregation output arrays -
-

Relationship to target architecture

-

The target architecture (see Kmer index architecture) separates MphfLayer from data stores entirely and introduces a PartitionedIndex with parallel dispatch and an Aggregator pattern. The current implementation is a stepping stone: obicompactvec types are already fully decoupled from the MPHF; the remaining refactoring is within obilayeredmap itself.

-
-

Open questions

-
    -
  • Mode 4: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses PersistentCompactIntMatrix with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
  • -
  • Layer merge: merging two LayeredMap instances into a single-layer index requires full rebuild. Define API and cost model.
  • -
  • Canonical kmer orientation: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
  • -
  • try_new_from_par_iter: ptr_hash::new_from_par_iter silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A try_new_from_par_iter PR upstream would close this gap.
  • -
diff --git a/doc/implementation/persistent_bit_vec/index.html b/doc/implementation/persistent_bit_vec/index.html index 21e6248..e1ae9f5 100644 --- a/doc/implementation/persistent_bit_vec/index.html +++ b/doc/implementation/persistent_bit_vec/index.html @@ -952,6 +952,45 @@ + + +
  • + + + + Aggregation traits — obicompactvec::traits + + + + + +
  • @@ -1293,6 +1332,45 @@ + + +
  • + + + + Aggregation traits — obicompactvec::traits + + + + + +
  • @@ -1520,6 +1598,27 @@ offset 16: fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) } }
    +
    +

    Aggregation traits — obicompactvec::traits

    +

    PersistentBitMatrix implements two aggregation traits used by LayeredStore<S> for cross-layer and cross-partition distance computations.

    +

    ColumnWeights

    +
    impl ColumnWeights for PersistentBitMatrix {
    +    fn col_weights(&self) -> Array1<u64>   // = self.count_ones()
    +}
    +
    +

    col_weights()[c] = number of set bits in column c across all slots.

    +

    BitPartials

    +
    impl BitPartials for PersistentBitMatrix {
    +    // Self-contained partials (additive across layers)
    +    fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)   // (inter, union)
    +    fn partial_hamming(&self) -> Array2<u64>                   // differing bits
    +
    +    // Provided finalisations
    +    fn jaccard_dist_matrix(&self) -> Array2<f64>
    +    fn hamming_dist_matrix(&self) -> Array2<u64>
    +}
    +
    +

    partial_jaccard returns (inter, union) as a pair because union is not reconstructible from per-column count_ones() — it depends on both columns simultaneously. Both components are additively decomposable across (partition, layer) pairs; the final jaccard_dist_matrix() is computed from their element-wise sums.

    diff --git a/doc/implementation/persistent_compact_int_vec/index.html b/doc/implementation/persistent_compact_int_vec/index.html index 5eea493..f37e659 100644 --- a/doc/implementation/persistent_compact_int_vec/index.html +++ b/doc/implementation/persistent_compact_int_vec/index.html @@ -907,6 +907,45 @@ + + +
  • + + + + Aggregation traits — obicompactvec::traits + + + + + +
  • @@ -1259,6 +1298,45 @@ + + +
  • + + + + Aggregation traits — obicompactvec::traits + + + + + +
  • @@ -1535,6 +1613,40 @@ step = ⌈n_overflow / 2048⌉ otherwise fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) } }
    +
    +

    Aggregation traits — obicompactvec::traits

    +

    PersistentCompactIntMatrix implements two aggregation traits used by LayeredStore<S> for cross-layer and cross-partition distance computations.

    +

    ColumnWeights

    +
    impl ColumnWeights for PersistentCompactIntMatrix {
    +    fn col_weights(&self) -> Array1<u64>   // = self.sum()
    +}
    +
    +

    col_weights()[c] = sum of all values in column c across all slots.

    +

    CountPartials

    +
    impl CountPartials for PersistentCompactIntMatrix {
    +    // Self-contained partials (additive across layers, no external parameter)
    +    fn partial_bray(&self)                                      -> Array2<u64>
    +    fn partial_euclidean(&self)                                 -> Array2<f64>
    +    fn partial_threshold_jaccard(&self, threshold: u32)         -> (Array2<u64>, Array2<u64>)
    +
    +    // Normalised partials (require global col_weights across all layers/partitions)
    +    fn partial_relfreq_bray(&self, global: &Array1<u64>)        -> Array2<f64>
    +    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)   -> Array2<f64>
    +    fn partial_hellinger(&self, global: &Array1<u64>)           -> Array2<f64>
    +
    +    // Provided finalisations (default implementations on the trait)
    +    fn bray_dist_matrix(&self)                                  -> Array2<f64>
    +    fn euclidean_dist_matrix(&self)                             -> Array2<f64>
    +    fn threshold_jaccard_dist_matrix(&self, threshold: u32)     -> Array2<f64>
    +    fn relfreq_bray_dist_matrix(&self)                          -> Array2<f64>
    +    fn relfreq_euclidean_dist_matrix(&self)                     -> Array2<f64>
    +    fn hellinger_dist_matrix(&self)                             -> Array2<f64>
    +}
    +
    +

    Self-contained partials are additively decomposable: summing partial_bray() across all (partition, layer) pairs and finalising gives the same result as computing on the combined data.

    +

    Normalised partials require the global column weights (sum across all layers and all partitions). The global parameter must reflect the complete index, not a per-layer sum. The provided relfreq_bray_dist_matrix() etc. call col_weights() first (pass 1) then the normalised partial (pass 2); when called on a LayeredStore<LayeredStore<…>> these two-pass calls cascade automatically through the blanket impls.

    +

    partial_bray returns Array2<u64> (sum_min only, not a tuple). The denominator is always reconstructible as col_weights()[i] + col_weights()[j].

    +

    partial_threshold_jaccard returns (inter, union) as a pair because union[i,j] is not reconstructible from per-column statistics — it depends on both columns simultaneously.

    diff --git a/doc/implementation/pipeline/index.html b/doc/implementation/pipeline/index.html index f95b9a7..a38914c 100644 --- a/doc/implementation/pipeline/index.html +++ b/doc/implementation/pipeline/index.html @@ -1188,23 +1188,23 @@ branching / dead-end → unitig start or end

    Output: unitigs.bin — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.

    Scope of local unitigs: these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.

    Phase 6 — MPHF construction and index finalisation

    -

    Built once on the definitive kmer set (all kmers in all unitigs of the partition):

    +

    Built once on the definitive kmer set (all kmers in all unitigs of the partition). See obilayeredmap and MPHF selection for the current implementation.

    kmers from unitigs → MPHF → mphf.bin
    -                   → counts.bin : packed n-bit array (or 1-bit for presence mode)
    -                   → refs.bin   : u32 nucleotide offset into unitigs.bin per kmer
    +                   → evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
    +                   → payload      : counts/ (mode 2) or presence/ (mode 3)
     
    -

    The MPHF is built once — no rebuild. The n-bit width for counts.bin is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into overflow.bin as sorted (mphf_index: u32, count: u32) pairs.

    +

    The MPHF is built in two passes over unitigs.bin: parallel pass for mphf.bin, sequential pass for evidence.bin and payload. The exact kmer count is available from the unitig index (unitigs.bin.idx) before the passes begin.

    Exact verification via unitig evidence:

    -

    unitigs.bin serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.

    +

    unitigs.bin serves as the evidence structure. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back (via evidence.bin) is the only correct membership test.

    query kmer q
    -  → canonical_minimizer(q) → hash → PART → part_XXXX/
    -  → MPHF(q) → index i
    -  → refs[i] = (unitig_id, kmer_offset)
    -  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
    -  → match   : return counts[i]   ← exact hit
    -  → no match: kmer absent        ← MPHF collision on absent kmer
    +  → canonical_minimizer(q) → hash → PART → part_XXXXX/
    +  → MPHF(q) → slot s
    +  → evidence[s] = (chunk_id, rank)
    +  → read k nucleotides at rank in unitigs[chunk_id] → compare with q
    +  → match   : return payload[s]   ← exact hit
    +  → no match: kmer absent         ← MPHF collision on absent kmer
     
    -

    One random disk access into unitigs.bin per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). superkmers.bin.gz is no longer needed at this point and can be deleted.

    +

    superkmers.bin.gz is no longer needed at this point and can be deleted.


      diff --git a/doc/implementation/storage/index.html b/doc/implementation/storage/index.html index d216fcb..543a74d 100644 --- a/doc/implementation/storage/index.html +++ b/doc/implementation/storage/index.html @@ -575,24 +575,6 @@ - - @@ -610,58 +592,6 @@ - - - - @@ -944,47 +874,6 @@ - - -
    @@ -1001,86 +890,8 @@

    On-disk collection structure

    -

    Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:

    -
    collection/
    -  metadata.toml          — collection parameters (see below)
    -  part_XXXX/
    -    superkmers.bin.gz    — dereplicated super-kmers for this partition (construction artifact)
    -    mphf.bin             — minimal perfect hash function for this partition
    -    counts.bin           — packed n-bit count array (or 1-bit presence array)
    -    refs.bin             — back-references u32 nucleotide offset into unitigs.bin per kmer
    -    unitigs.bin          — local de Bruijn unitigs (permanent evidence structure)
    -    overflow.bin         — counts exceeding the packed range (optional)
    -
    -

    superkmers.bin.gz is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is mphf.bin + counts.bin + refs.bin + unitigs.bin.

    -

    Collection parameters

    -

    Stored in metadata.toml:

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    ParameterRole
    kkmer length
    mminimizer length (odd, < k)
    ppartition bits (0 ≤ p ≤ min(14, 2m−16))
    modepresence (1 bit/kmer) or count (n bits/kmer)
    nbits per kmer in count mode (chosen at construction)
    min_countsingleton filtering threshold (0 = keep all)
    hash_fnhash function identifier
    hash_seedseed for the hash function
    -

    Count storage

    -

    refs.bin capacity: unitigs.bin is a flat 2-bit-packed nucleotide stream with no separators. Each entry in refs.bin is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.

    -

    Presence mode (coverage ≤ 1x, or when only presence/absence matters):

    - -

    Count mode (coverage > 1x):

    - -

    Query protocol

    -
    query kmer q
    -  → canonical_minimizer(q) → hash → PART → part_XXXX/
    -  → MPHF(q) → index i
    -  → refs[i] = (unitig_id, kmer_offset)
    -  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
    -  → match   : return counts[i]
    -  → no match: kmer absent
    -
    +

    See obilayeredmap crate for the current on-disk layout.

    +

    The index root contains one part_XXXXX/ directory per partition, each holding one or more layer_N/ directories. Each layer directory contains mphf.bin, unitigs.bin, unitigs.bin.idx, evidence.bin, and optionally a counts/ or presence/ payload directory.

    diff --git a/doc/implementation/unitig_evidence/index.html b/doc/implementation/unitig_evidence/index.html index 1ad5be1..3cf38b7 100644 --- a/doc/implementation/unitig_evidence/index.html +++ b/doc/implementation/unitig_evidence/index.html @@ -428,10 +428,10 @@