diff --git a/doc/404.html b/doc/404.html index 9091b5c..0b196ed 100644 --- a/doc/404.html +++ b/doc/404.html @@ -632,6 +632,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -714,6 +798,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/architecture/index_architecture/index.html b/doc/architecture/index_architecture/index.html new file mode 100644 index 0000000..43ccb55 --- /dev/null +++ b/doc/architecture/index_architecture/index.html @@ -0,0 +1,1816 @@ + + + + + + + + + + + + + + + + + + + + + + + + Kmer index - obikmer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + Skip to content + + +
    +
    + +
    + + + + + + +
    + + +
    + +
    + + + + + + +
    +
    + + + +
    +
    +
    + + + + + +
    +
    +
    + + + + + + + +
    + +
    + + + + + +

    Kmer index architecture

    +

    Fundamental invariant

    +

    A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.

    +
    +

    Three-level hierarchy

    +
    PartitionedIndex
    +├── LayeredPartition  (one per minimiser bucket)
    +│   ├── MphfLayer 0         kmer → slot  (immutable bijection)
    +│   │   ├── DataStore A     slot → T     (e.g. counts)
    +│   │   └── DataStore B     slot → T     (e.g. presence/absence, derived)
    +│   ├── MphfLayer 1
    +│   │   └── DataStore A
    +│   └── ...
    +├── LayeredPartition
    +│   └── ...
    +
    +

    PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.

    +

    LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.

    +

    MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.

    +

    DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.

    +
    +

    MphfLayer — autonomous mapping layer

    +
    MphfLayer::find(kmer: CanonicalKmer) -> Option<usize>   // slot, or None if absent
    +MphfLayer::n() -> usize                                  // number of slots
    +MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
    +MphfLayer::open(dir: &Path) -> OLMResult<Self>
    +
    +

    find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.

    +

    The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.

    +
    +

    DataStore — slot-indexed data

    +
    trait DataStore {
    +    type Item;
    +    fn get(&self, slot: usize) -> Self::Item;
    +    fn n(&self) -> usize;
    +}
    +
    +

    Concrete types from obicompactvec:

    + + + + + + + + + + + + + + + + + + + + + + + +
    TypeItemColumn statsUse
    PersistentCompactIntMatrixBox<[u32]>sum() -> Array1<u64>count per sample per slot
    PersistentBitMatrixBox<[bool]>count_ones() -> Array1<u64>presence per sample per slot
    +

    sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.

    +

    A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.

    +
    +

    Distance matrix API on DataStore types

    +

    Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.

    +

    Full distance matrices

    +

    Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.

    +
    // PersistentCompactIntMatrix
    +fn bray_dist_matrix(&self)              -> Array2<f64>
    +fn relfreq_bray_dist_matrix(&self)      -> Array2<f64>
    +fn euclidean_dist_matrix(&self)         -> Array2<f64>
    +fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
    +fn hellinger_dist_matrix(&self)         -> Array2<f64>
    +fn jaccard_dist_matrix(&self)           -> Array2<f64>
    +fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
    +
    +// PersistentBitMatrix
    +fn jaccard_dist_matrix(&self)           -> Array2<f64>
    +fn hamming_dist_matrix(&self)           -> Array2<u64>
    +
    +

    These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.

    +

    Partial distance matrices

    +

    Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.

    +

    Category 1 — self-contained partials: additive without any external parameter.

    +
    // PersistentCompactIntMatrix
    +fn partial_bray_dist_matrix(&self)
    +    -> (Array2<u64>,  // sum_min[i,j]
    +        Array1<u64>)  // col_sums[k]
    +
    +fn partial_euclidean_dist_matrix(&self)       -> Array2<f64>   // sum of squared diffs
    +fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
    +    -> (Array2<u64>,  // inter[i,j]
    +        Array2<u64>)  // union[i,j]
    +
    +// PersistentBitMatrix
    +fn partial_jaccard_dist_matrix(&self)
    +    -> (Array2<u64>,  // inter[i,j]
    +        Array2<u64>)  // union[i,j]
    +fn partial_hamming_dist_matrix(&self)         -> Array2<u64>   // differing bits
    +
    +

    Category 2 — normalised partials: require global column sums as input, computed beforehand across all (partition, layer) pairs.

    +
    // PersistentCompactIntMatrix only
    +fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
    +    -> Array2<f64>   // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
    +
    +fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    +    -> Array2<f64>   // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
    +
    +fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    +    -> Array2<f64>   // Σ_slot (√(a/sum_i) - √(b/sum_j))²
    +
    +

    The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.

    +
    +

    Progressive aggregation principle

    +

    Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.

    +
    PersistentCompactIntMatrix::sum()       — column sums for one (partition, layer) matrix
    +        ↓ Σ across layers
    +LayeredCompactIntMatrix::sum()          — column sums for one partition
    +        ↓ Σ across partitions
    +PartitionedCompactIntMatrix::sum()      — global column sums
    +
    +

    The same cascade applies to every partial computation:

    +
    PersistentCompactIntMatrix::partial_bray_dist_matrix()   — one (partition, layer)
    +        ↓ element-wise Σ across layers
    +LayeredCompactIntMatrix::partial_bray()                   — one partition
    +        ↓ element-wise Σ across partitions
    +PartitionedCompactIntMatrix::partial_bray()               — global partial → final dist
    +
    +

    This means LayeredCompactIntMatrix never inspects individual PersistentCompactIntVec columns directly, and PartitionedCompactIntMatrix never inspects individual layers. Each level presents a stable API surface to the level above.

    +
    +

    LayeredDataStore — aggregation within one partition

    +

    A LayeredDataStore holds one DataStore per layer within a single partition:

    +
    struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
    +struct LayeredBitMatrix         { layers: Vec<PersistentBitMatrix> }
    +
    +

    Column statistics

    +
    // LayeredCompactIntMatrix
    +fn sum(&self) -> Array1<u64>
    +    // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
    +
    +// LayeredBitMatrix
    +fn count_ones(&self) -> Array1<u64>
    +    // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
    +
    +

    Self-contained partials

    +

    Each method reduces across layers by element-wise addition of per-layer matrices:

    +
    fn partial_bray(&self)          -> (Array2<u64>, Array1<u64>)
    +    // Σ_l layer_l.partial_bray_dist_matrix()
    +
    +fn partial_euclidean(&self)      -> Array2<f64>
    +    // Σ_l layer_l.partial_euclidean_dist_matrix()
    +
    +fn partial_jaccard(&self)        -> (Array2<u64>, Array2<u64>)
    +    // Σ_l layer_l.partial_jaccard_dist_matrix()  [bit matrix]
    +    // Σ_l layer_l.partial_threshold_jaccard_dist_matrix()  [int matrix]
    +
    +fn partial_hamming(&self)        -> Array2<u64>
    +    // Σ_l layer_l.partial_hamming_dist_matrix()  [bit matrix]
    +
    +

    Normalised partials (require global sums from above)

    +
    fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
    +    // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
    +
    +fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
    +    // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
    +
    +fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
    +    // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
    +
    +

    global_sums is provided by the PartitionedDataStore; this level does not compute it.

    +
    +

    PartitionedDataStore — aggregation across all partitions

    +

    A PartitionedDataStore holds one LayeredDataStore per partition:

    +
    struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
    +struct PartitionedBitMatrix         { partitions: Vec<LayeredBitMatrix> }
    +
    +

    Column statistics

    +
    fn sum(&self) -> Array1<u64>
    +    // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
    +
    +

    p.sum() is itself a reduction across layers (see above) — the cascade is preserved.

    +

    Self-contained metrics — single pass

    +
    fn bray_dist_matrix(&self) -> Array2<f64> {
    +    let (sum_min, col_sums) = partitions
    +        .par_iter()
    +        .map(|p| p.partial_bray())
    +        .reduce(element-wise +);
    +    // finalise
    +    for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
    +}
    +
    +

    Normalised metrics — two passes

    +
    fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    +    // pass 1 — progressive: PartitionedDataStore::sum()
    +    //   calls LayeredDataStore::sum() per partition (parallel)
    +    //     calls PersistentCompactIntMatrix::sum() per layer (parallel)
    +    let global_sums = self.sum();
    +
    +    // pass 2 — per-partition partial using global_sums (parallel)
    +    let matrix = partitions
    +        .par_iter()
    +        .map(|p| p.partial_relfreq_bray(&global_sums))
    +        .reduce(element-wise +);
    +    // finalise
    +    for (i,j): dist[i,j] = 1 - matrix[i,j]
    +}
    +
    +

    global_sums is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.

    +
    +

    Parallelism model

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    LevelUnitCoordination
    Across partitionsLayeredDataStorenone — fully independent
    Across layers (self-contained)(partition, layer) pairnone — disjoint kmer sets
    Across layers (normalised, pass 1)(partition, layer) pairnone — sums are additive
    Across layers (normalised, pass 2)(partition, layer) pairglobal_sums broadcast read-only
    Within a DataStore (distance matrix)upper-triangle pair (i,j)none — rayon par_iter
    +
    +

    Query model

    +

    Point query — kmer → Option<Item>

    +
    minimiser(kmer) → partition p
    +for each layer l in p:
    +    slot = MphfLayer_l.find(kmer)
    +    if slot is Some:
    +        return DataStore_l.get(slot)
    +return None
    +
    +

    O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).

    +

    Aggregation — → Result

    +
    result = reduce(
    +    for p in partitions:            // parallel
    +        for l in layers(p):         // parallel
    +            partial(DataStore_p_l)
    +)
    +
    +

    For normalised metrics replace with the two-pass scheme above.

    +
    +

    DataStore derivation

    +

    Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:

    +
    // count → presence/absence, parallel across (partition, layer)
    +for (p, l) in all_partition_layer_pairs().par_iter():
    +    count_store   = open PersistentCompactIntMatrix at (p, l)
    +    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
    +
    +

    Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.

    +
    +

    Relationship to current implementation

    +

    The current obilayeredmap crate implements a subset of this architecture. Key divergences:

    +
      +
    • Layer<D: LayerData> fuses MphfLayer and one DataStore into a single generic type. Multiple data stores on the same MPHF are not supported.
    • +
    • LayerData::open(dir) embeds the path convention (counts/, presence/) inside the store type, preventing the PartitionedIndex from managing paths externally.
    • +
    • LayeredDataStore and PartitionedDataStore do not yet exist; LayeredMap is a single-partition structure without a distance matrix API.
    • +
    • The partial distance methods exist on PersistentCompactIntMatrix and PersistentBitMatrix and are tested; they are not yet composed across layers and partitions.
    • +
    +

    Planned refactoring: +1. Extract MphfLayer from Layer<D> as an autonomous type. +2. Replace LayerData trait with DataStore trait (no path knowledge). +3. Implement LayeredCompactIntMatrix / LayeredBitMatrix with the partial + full distance APIs described above. +4. Implement PartitionedCompactIntMatrix / PartitionedBitMatrix with two-pass support for normalised metrics. +5. Implement PartitionedIndex for point queries with parallel dispatch.

    + + + + + + + + + + + + + +
    +
    + + + +
    + +
    + + + +
    +
    +
    +
    + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/doc/architecture/sequences/invariant/index.html b/doc/architecture/sequences/invariant/index.html index 21b5634..c754d77 100644 --- a/doc/architecture/sequences/invariant/index.html +++ b/doc/architecture/sequences/invariant/index.html @@ -9,9 +9,11 @@ - + + + @@ -639,6 +641,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -733,6 +819,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/chunkreader/index.html b/doc/implementation/chunkreader/index.html index e48d405..b638090 100644 --- a/doc/implementation/chunkreader/index.html +++ b/doc/implementation/chunkreader/index.html @@ -347,6 +347,42 @@ + + + +
  • + + + + + obilayeredmap crate + + + + + +
  • +
  • + + + + + PersistentCompactIntVec + + + + + +
  • +
  • + + + + + PersistentBitVec + + +
  • @@ -385,6 +421,18 @@ + + + +
  • + + + + + Kmer index + + +
  • diff --git a/doc/implementation/kmer/index.html b/doc/implementation/kmer/index.html index 30326c5..91f0ea8 100644 --- a/doc/implementation/kmer/index.html +++ b/doc/implementation/kmer/index.html @@ -745,6 +745,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -827,6 +911,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/mphf/index.html b/doc/implementation/mphf/index.html index d3e525c..a20e083 100644 --- a/doc/implementation/mphf/index.html +++ b/doc/implementation/mphf/index.html @@ -745,6 +745,89 @@ + + +
  • + + + + Multilayer index architecture + + + + + +
  • @@ -795,6 +878,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -877,6 +1044,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + @@ -1002,6 +1197,89 @@ + + +
  • + + + + Multilayer index architecture + + + + + +
  • @@ -1071,7 +1349,7 @@

    MPHF choice per phase

    Phase 1 (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.

    -

    Phase 2 (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.

    +

    Phase 2 (persistent, queried repeatedly): ptr_hash. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.

    boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.


    Space at scale

    @@ -1106,12 +1384,47 @@

    On-disk and mmap considerations

    All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the ph crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.

    No established Rust crate provides a natively on-disk MPHF. SSHash (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.

    +
    +

    Multilayer index architecture

    +

    Motivation

    +

    An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.

    +

    Layer structure

    +

    Each layer is a self-contained unit:

    +
    layer_i/
    +  unitigs.bin     — packed 2-bit nucleotide sequences
    +  mphf.bin        — ptr_hash index (phase-2, exact key count)
    +  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
    +  counts.bin      — [u32] per MPHF slot
    +
    +

    Layers are disjoint: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:

    +
      +
    1. For each kmer in B: query layer 0 — if found, accumulate count into counts_0[MPHF_0(kmer)].
    2. +
    3. Collect all kmers of B not present in any existing layer → set B \ A.
    4. +
    5. Build layer 1 from B \ A using the standard two-phase pipeline (spectrum, filter, ptr_hash).
    6. +
    +

    Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from C \ A \ B.

    +

    Membership verification

    +

    ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from (unitig_id, rank) and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.

    +

    This makes the evidence layer load-bearing for correctness, not only for locality.

    +

    Query algorithm

    +
    fn query(kmer) → Option<count>:
    +    for layer in layers:
    +        slot = layer.mphf.query(kmer)
    +        if layer.evidence.decode(slot) == kmer:
    +            return Some(layer.counts[slot])
    +    return None
    +
    +

    Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.

    +

    Layer count and probe cost

    +

    Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.

    +

    Merging layers

    +

    Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.

    Open questions

    diff --git a/doc/implementation/obilayeredmap/index.html b/doc/implementation/obilayeredmap/index.html new file mode 100644 index 0000000..f203e82 --- /dev/null +++ b/doc/implementation/obilayeredmap/index.html @@ -0,0 +1,1611 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + obilayeredmap crate - obikmer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + Skip to content + + +
    +
    + +
    + + + + + + +
    + + +
    + +
    + + + + + + +
    +
    + + + +
    +
    +
    + + + + + +
    +
    +
    + + + + + + + +
    + +
    + + + + + +

    obilayeredmap — layered kmer index crate

    +

    Purpose

    +

    obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: collection → partition → layer. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

    +
    +

    Four usage modes

    +

    The MPHF + evidence infrastructure is fixed for all modes. The payload — data associated with each slot — is orthogonal and varies by mode.

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModeDescriptionPayload typeStorage
    1. Setmembership test only()
    2. Countoccurrences per kmer per samplePersistentCompactIntMatrixcounts/ directory
    3. Presence/absence matrixwhich genomes contain each kmerPersistentBitMatrixpresence/ directory
    4. Count matrixoccurrences per kmer per genomePersistentCompactIntMatrixcounts/ directory
    +

    Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate. Mode 3 has a build path (Layer::<PersistentBitMatrix>::build_presence); mode 4 is not yet implemented.

    +

    Payload for modes 2/4: PersistentCompactIntMatrix

    +

    PersistentCompactIntMatrix is a column-major matrix stored in a directory: one col_NNNNNN.pciv file per column, plus a meta.json. Each column is a PersistentCompactIntVec — a mmap'd PCIV file with a u8 primary array (255 = overflow sentinel), a sorted overflow section of (slot: u64, value: u32) entries, and a sparse L1-fitting index.

    +

    Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). read(slot) returns Box<[u32]> — the full row across all columns.

    +

    Payload for mode 3: PersistentBitMatrix

    +

    PersistentBitMatrix is a column-major bit matrix stored in a directory: one col_NNNNNN.pbiv per genome, plus meta.json. Each column is a PersistentBitVec — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). read(slot) returns Box<[bool]> — the presence vector across all genomes.

    +

    Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.

    +
    +

    Payload architecture

    +

    The payload is orthogonal to the MPHF + evidence layer. Layer is parameterised by D: LayerData:

    +
    pub trait LayerData: Sized {
    +    type Item;
    +    fn open(layer_dir: &Path) -> OLMResult<Self>;
    +    fn read(&self, slot: usize) -> Self::Item;
    +}
    +
    +pub struct Layer<D: LayerData = ()> {
    +    mphf:     Mphf,
    +    evidence: Evidence,
    +    unitigs:  UnitigFileReader,
    +    data:     D,
    +}
    +
    +pub struct Hit<T = ()> {
    +    pub slot: usize,
    +    pub data: T,
    +}
    +
    +

    LayerData covers the read path only (open + read). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated Context type with no benefit over specialized impl blocks.

    +

    Implemented concrete types:

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    TypeItemDescription
    ()()mode 1 — membership only
    PersistentCompactIntMatrixBox<[u32]>modes 2/4 — one count per column
    PersistentBitMatrixBox<[bool]>mode 3 — one presence bit per column
    +

    LayeredMap mirrors the same parameterisation: LayeredMap<D: LayerData = ()>.

    +
    +

    Three-level hierarchy

    +
    index_root/                        ← LayeredMap (collection)
    +  meta.json
    +  part_00000/                      ← Partition
    +    layer_0/                       ← Layer
    +      mphf.bin
    +      unitigs.bin
    +      unitigs.bin.idx
    +      evidence.bin
    +      counts/              [modes 2/4]
    +        meta.json          {"n": N, "n_cols": 1}
    +        col_000000.pciv
    +      presence/            [mode 3]
    +        meta.json          {"n": N, "n_cols": G}
    +        col_000000.pbiv
    +        col_000001.pbiv
    +        ...
    +    layer_1/
    +      ...
    +  part_00001/
    +    layer_0/
    +    ...
    +
    +

    Collection (index_root/): global metadata — kmer size k, number of partitions, layer count, sample registry.

    +

    Partition (part_XXXXX/): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in part_XXXXX. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in meta.json.

    +

    Layer (layer_N/): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.

    +
    +

    Layer file layout

    +
    layer_N/
    +  mphf.bin            — ptr_hash MPHF (epserde, ptr_hash native format)
    +  unitigs.bin         — packed 2-bit nucleotide sequences (obiskio binary format)
    +  unitigs.bin.idx     — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
    +  evidence.bin        — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
    +  counts/             — [modes 2/4] PersistentCompactIntMatrix
    +  presence/           — [mode 3] PersistentBitMatrix
    +
    +

    unitigs.bin is the packed-2-bit sequence file produced by obiskio::UnitigFileWriter. The companion .idx file stores: magic UIDX, n_unitigs: u32, n_kmers: u64, seqls: [u8; n_unitigs] (kmer count − 1 per chunk), and packed_offsets: [u32; n_unitigs + 1] (byte offsets into unitigs.bin, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

    +

    Evidence encoding

    +

    Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.

    +
    slot s  →  unitig_id: u25  |  rank: u7
    +
    +

    Packed into a u32 (29 bits used, 3 spare). Decoding:

    +
    kmer = unitigs[unitig_id][rank .. rank + k]   // 2-bit packed slice
    +
    +

    rank is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A u7 (0–127) is sufficient.

    +
    +

    ptr_hash configuration

    +

    The MPHF per layer is configured as:

    +
    type Mphf = PtrHash<
    +    u64,                              // key type: canonical kmer raw encoding
    +    CubicEps,                         // bucket fn: balanced (2.4 bits/key, λ=3.5)
    +    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry vs 32 for Vec<u32>
    +    Xx64,                             // hasher: XXH3-64 with seed, handles structured keys
    +    Vec<u8>,                          // pilots
    +>;
    +
    +

    Hasher choice — Xx64: k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). FxHash (single multiply) distributes these poorly. Xx64 (XXH3 64-bit, seeded) handles structured input correctly.

    +

    Bucket function — CubicEps with PtrHashParams::<CubicEps>::default(): λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than Linear/λ=3.0 (the default_fast preset), 20% less space. default_compact (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.

    +

    Remap — CachelineEfVec: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for Vec<u32>). Already a transitive dependency of ptr_hash. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.

    +
    +

    Build path

    +

    The build path is not part of LayerData. Each mode exposes its own impl Layer<D>::build with the exact signature it needs. Two private module-level helpers avoid code duplication:

    +

    build_mphf(out_dir, n) -> OLMResult<Mphf>: first pass — opens unitigs.bin, iterates all canonical kmers in parallel via new_from_par_iter, stores mphf.bin. O(n).

    +

    build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>: second pass — opens unitigs.bin again, fills evidence.bin and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls fill_slot(slot, kmer) once per kmer for the mode-specific payload. O(n).

    +
    // mode 1
    +impl Layer<()> {
    +    pub fn build(out_dir: &Path) -> OLMResult<usize>
    +}
    +
    +// modes 2/4
    +impl Layer<PersistentCompactIntMatrix> {
    +    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    +    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
    +}
    +
    +// mode 3
    +impl Layer<PersistentBitMatrix> {
    +    pub fn build_presence(
    +        out_dir: &Path,
    +        n_genomes: usize,
    +        present_in: impl Fn(CanonicalKmer, usize) -> bool,
    +    ) -> OLMResult<usize>
    +}
    +
    +

    Mode 2 creates a PersistentCompactIntMatrixBuilder with 1 column and fills it via build_second_pass. Mode 3 creates a PersistentBitMatrixBuilder with n_genomes columns and fills all columns in a single pass.

    +

    Any duplicate slot or out-of-bounds index detected during build_second_pass returns OLMError::Mphf. new_from_par_iter avoids materialising all keys as Vec<u64>.

    +
    +

    Query path

    +

    A kmer query routes through all three levels:

    +
      +
    1. Partition routing: hash canonical minimiser of the query kmer → partition index → open part_XXXXX/.
    2. +
    3. Layer probing: iterate layers in order; for each layer compute slot = mphf.index(kmer), decode evidence, compare to query. First match wins.
    4. +
    5. Data access: layer.data.read(slot) returns D::Item.
    6. +
    +
    // pseudo-code
    +fn query(kmer) -> Option<(usize, Hit<D::Item>)>:
    +    for (i, layer) in self.layers.iter().enumerate():
    +        slot = layer.mphf.index(&kmer.raw())
    +        if layer.evidence.decode(slot) == kmer:
    +            return Some((i, Hit { slot, data: layer.data.read(slot) }))
    +    return None
    +
    +

    Expected probe depth: 1 for kmers in layer 0, increasing for later layers.

    +

    For mode 2, hit.data is Box<[u32]> with 1 element; hit.data[0] is the count. For mode 3, hit.data is Box<[bool]> with G elements, one per genome.

    +
    +

    Add-layer algorithm

    +

    When adding dataset B to an existing index:

    +
      +
    1. For each partition, iterate kmers of B routed to that partition.
    2. +
    3. Probe existing layers; collect kmers absent from all layers → B \ index.
    4. +
    5. Build a new layer from B \ index.
    6. +
    7. Append the new layer directory under each part_XXXXX/.
    8. +
    9. Update meta.json (layer count, sample registry).
    10. +
    +

    Each partition's new layer is built independently; the operation is fully parallel across partitions.

    +
    +

    Dependencies

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    craterole
    ptr_hash 1.1MPHF per layer (epserde serialisation)
    cacheline-ef 1.1compact remap storage inside ptr_hash
    epserde 0.8zero-copy serialisation of MPHF
    memmap2mmap of layer files
    obiskiounitig file writer/reader
    obicompactvecpayload types: PersistentCompactIntMatrix, PersistentBitMatrix
    +
    +

    Relationship to target architecture

    +

    The target architecture (see Kmer index architecture) separates MphfLayer from data stores entirely and introduces a PartitionedIndex with parallel dispatch and an Aggregator pattern. The current implementation is a stepping stone: obicompactvec types are already fully decoupled from the MPHF; the remaining refactoring is within obilayeredmap itself.

    +
    +

    Open questions

    +
      +
    • Mode 4: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses PersistentCompactIntMatrix with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
    • +
    • Layer merge: merging two LayeredMap instances into a single-layer index requires full rebuild. Define API and cost model.
    • +
    • Canonical kmer orientation: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
    • +
    • try_new_from_par_iter: ptr_hash::new_from_par_iter silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A try_new_from_par_iter PR upstream would close this gap.
    • +
    + + + + + + + + + + + + + +
    +
    + + + +
    + +
    + + + +
    +
    +
    +
    + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/doc/implementation/obipipeline/index.html b/doc/implementation/obipipeline/index.html index ef5b23b..5018f41 100644 --- a/doc/implementation/obipipeline/index.html +++ b/doc/implementation/obipipeline/index.html @@ -795,6 +795,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -877,6 +961,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/persistent_bit_vec/index.html b/doc/implementation/persistent_bit_vec/index.html new file mode 100644 index 0000000..21e6248 --- /dev/null +++ b/doc/implementation/persistent_bit_vec/index.html @@ -0,0 +1,1581 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + PersistentBitVec - obikmer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + Skip to content + + +
    +
    + +
    + + + + + + +
    + + +
    + +
    + + + + + + +
    +
    + + + +
    +
    +
    + + + + + +
    +
    +
    + + + + + + + +
    + +
    + + + + + +

    PersistentBitVec and PersistentBitMatrix

    +

    Purpose

    +

    PersistentBitVec stores a dense bit vector (presence/absence per slot) backed by a single mmap'd file. It is the binary counterpart of PersistentCompactIntVec and shares the same lifecycle pattern (builder → close → reader). All bulk operations work on u64 words rather than bytes, giving 8× fewer iterations and enabling the compiler to emit POPCNT and SIMD instructions.

    +

    Typical use: converting k-mer count vectors to presence/absence vectors (with optional threshold), then computing set-theoretic distances (Jaccard) or edit distances (Hamming) between samples.

    +

    PersistentBitMatrix wraps multiple PersistentBitVec columns in a directory, exposing a column-major binary matrix with row-access API. A single-column bit matrix is a vector at the API level.

    +
    +

    PersistentBitVec — single-column file

    +

    File format

    +

    Single .pbiv file.

    +
    offset 0:
    +  magic:     [u8; 4]  = b"PBIV"
    +  _pad:      [u8; 4]  = 0           alignment padding
    +  n:         u64       number of bits
    +
    +offset 16:
    +  data:      [u64; ⌈n/64⌉]          bit words, LSB-first, zero-padded
    +
    +

    Header is 16 bytes, so data starts at an offset divisible by 8. Since mmap returns page-aligned memory (≥ 4096-byte aligned), the data slice is u64-aligned, enabling a zero-copy &[u8] → &[u64] reinterpretation.

    +

    Bit layout: bit i is in data[i >> 6] at bit position i & 63 (LSB-first). Bits [n, ⌈n/64⌉×64) are always zero (padding). This invariant is maintained by all write operations and must be restored by not() after flipping.

    +

    Total file size: 16 + ⌈n/64⌉ × 8 bytes.

    +

    Lifecycle

    +

    Builder (PersistentBitVecBuilder)

    +
    struct PersistentBitVecBuilder {
    +    mmap: MmapMut,
    +    n:    usize,
    +}
    +
    +

    The file and mmap are created immediately at construction. The header is written once at new() or copied from the source at build_from*(). close() is a single flush — there is no tail to append, unlike PersistentCompactIntVec.

    +

    new(n: usize, path: &Path) -> io::Result<Self>

    +

    Creates the file, writes the header, zero-extends to 16 + ⌈n/64⌉×8 bytes, mmaps immediately. All bits default to 0.

    +

    build_from(source: &PersistentBitVec, path: &Path) -> io::Result<Self>

    +

    OS-level file copy (no per-bit iteration), then mmap. Initialisation cost: O(file_size).

    +

    build_from_counts(source: &PersistentCompactIntVec, threshold: u32, path: &Path) -> io::Result<Self>

    +

    Creates a new file, iterates source with its merge-scan iterator (O(n)), and writes bits directly into u64 words:

    +
    // bit i = 1 iff source[i] >= threshold
    +words[slot >> 6] |= 1u64 << (slot & 63);
    +
    +

    Handles overflow values (≥ 255) transparently — the count iterator returns the true u32 value regardless.

    +

    build_from_presence(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>

    +

    Shorthand for build_from_counts(source, 1, path).

    +

    Bit-level access

    +
    fn get(&self, slot: usize) -> bool
    +fn set(&mut self, slot: usize, value: bool)
    +
    +

    Byte-level mmap access: mmap[16 + slot/8], bit slot % 8. O(1).

    +

    Word-level bulk operations

    +

    All operate on ⌈n/64⌉ u64 words. O(n/64) per call.

    +
    builder.and(&other);   // self[i] &= other[i]  for all i
    +builder.or(&other);    // self[i] |= other[i]
    +builder.xor(&other);   // self[i] ^= other[i]
    +builder.not();         // self[i]  = !self[i], then re-zero padding bits
    +
    +

    and/or/xor read other's word slice directly (no allocation). not() flips all words then masks the last word's padding bits to restore the invariant.

    +

    close(self) -> io::Result<()>

    +

    Flushes the mmap. The header was written at construction and is never rewritten. O(1) in Rust code.

    +

    Reader (PersistentBitVec)

    +
    struct PersistentBitVec {
    +    mmap: Mmap,
    +    n:    usize,
    +    path: PathBuf,
    +}
    +
    +

    open(path: &Path) -> io::Result<Self>

    +

    Mmaps the file, validates magic, reads n from bytes [8..16]. O(1).

    +

    get(slot: usize) -> bool

    +

    Byte-level read from mmap[16 + slot/8]. O(1).

    +

    iter() -> BitIter<'_>

    +

    Sequential scan, byte by byte, yielding bool values in slot order. Implements ExactSizeIterator. O(n).

    +

    Aggregates

    +
    fn count_ones(&self)  -> u64   // popcount over all words; padding bits are 0
    +fn count_zeros(&self) -> u64   // n - count_ones()
    +
    +

    count_ones iterates ⌈n/64⌉ words and calls u64::count_ones() (maps to POPCNT). O(n/64).

    +

    Distance methods

    +

    Both operate word by word. O(n/64).

    + + + + + + + + + + + + + + + + + + + + +
    MethodFormulaNotes
    jaccard_dist(&other) -> f641 − \|A∩B\| / \|A∪B\|(a&b).count_ones(), (a\|b).count_ones() per word
    hamming_dist(&other) -> u64number of differing bits(a^b).count_ones() per word
    +

    Edge case (both all-zero → union = 0): jaccard_dist returns 0.0.

    +

    Implementation notes

    +

    u64 word view

    +

    The unsafe cast from &[u8] to &[u64] is sound because:

    +
      +
    1. mmap base is page-aligned (≥ 4096-byte boundary).
    2. +
    3. Data offset = 16, and 16 % 8 == 0 → the data pointer is 8-byte aligned.
    4. +
    5. Data length = ⌈n/64⌉ × 8 bytes — always a multiple of 8.
    6. +
    +

    This gives zero-copy word-level access with no intermediate allocation.

    +

    Padding invariant

    +

    Writing not() without masking the last word would corrupt count_ones(), hamming_dist(), and jaccard_dist(). The mask applied after flipping is (1u64 << (n % 64)) - 1 (no-op if n % 64 == 0). All other operations (and, or, xor) preserve existing zero padding since they can only clear or preserve bits already set by not().

    +

    Complexity

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    OperationTimeNotes
    new / openO(1)mmap setup + header parse
    get / set (builder or reader)O(1)byte-level mmap
    iter()O(n)byte-by-byte scan
    count_ones / count_zerosO(n/64)POPCNT per u64 word
    and / or / xor / notO(n/64)word-level bitwise ops
    jaccard_dist / hamming_distO(n/64)word AND/OR/XOR + POPCNT
    build_fromO(file_size)OS copy
    build_from_counts / build_from_presenceO(n)count iter + word fill
    closeO(1)flush only
    +
    +

    PersistentBitMatrix — column-major directory

    +

    Design

    +

    A directory containing meta.json and N column files col_000000.pbiv, col_000001.pbiv, …, each a PersistentBitVec. Used for presence/absence matrices: one column per genome, one bit per MPHF slot.

    +
    presence/
    +  meta.json          {"n": <n_slots>, "n_cols": <G>}
    +  col_000000.pbiv    genome 0
    +  col_000001.pbiv    genome 1
    +  ...
    +
    +

    Column-major layout makes per-genome set operations (Jaccard, Hamming, AND/OR) cache-friendly — each genome is a contiguous file. Row access (which genomes contain a given kmer) requires one O(1) read per column.

    +

    Builder (PersistentBitMatrixBuilder)

    +
    struct PersistentBitMatrixBuilder {
    +    dir:    PathBuf,
    +    n:      usize,
    +    n_cols: usize,
    +}
    +
    +

    new(n: usize, dir: &Path) -> io::Result<Self>

    +

    Creates the directory (including parents).

    +

    add_col(&mut self) -> io::Result<PersistentBitVecBuilder>

    +

    Creates col_NNNNNN.pbiv for the next column and returns its builder. The caller fills the column and calls builder.close() before calling add_col again.

    +

    close(self) -> io::Result<()>

    +

    Writes meta.json with the final n and n_cols.

    +

    Reader (PersistentBitMatrix)

    +
    struct PersistentBitMatrix {
    +    cols: Vec<PersistentBitVec>,
    +    n:    usize,
    +}
    +
    +

    open(dir: &Path) -> io::Result<Self>

    +

    Reads meta.json, opens all col_NNNNNN.pbiv files.

    +

    row(slot: usize) -> Box<[bool]>

    +

    Returns the presence vector: [col_0[slot], col_1[slot], …, col_{G-1}[slot]]. One byte read per column. O(G).

    +

    col(c: usize) -> &PersistentBitVec

    +

    Direct access to a single column for column-oriented operations.

    +

    LayerData implementation

    +
    impl LayerData for PersistentBitMatrix {
    +    type Item = Box<[bool]>;
    +    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/presence/ */ }
    +    fn read(&self, slot: usize) -> Box<[bool]>   { self.row(slot) }
    +}
    +
    + + + + + + + + + + + + + +
    +
    + + + +
    + +
    + + + +
    +
    +
    +
    + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/doc/implementation/persistent_compact_int_vec/index.html b/doc/implementation/persistent_compact_int_vec/index.html new file mode 100644 index 0000000..5eea493 --- /dev/null +++ b/doc/implementation/persistent_compact_int_vec/index.html @@ -0,0 +1,1596 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + PersistentCompactIntVec - obikmer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + Skip to content + + +
    +
    + +
    + + + + + + +
    + + +
    + +
    + + + + + + +
    +
    + + + +
    +
    +
    + + + + + +
    +
    +
    + + + + + + + +
    + +
    + + + + + +

    PersistentCompactIntVec and PersistentCompactIntMatrix

    +

    Purpose

    +

    PersistentCompactIntVec stores a dense array of non-negative integers indexed by MPHF slot where the vast majority of values are small (0–254) and large values are rare. It is designed for mmap-compatible random and sequential access with minimal memory footprint and optimal cache behaviour.

    +

    Motivation from observed count distributions in genomics data: 99.9% of k-mer counts fit in a u8; overflow (count ≥ 255) affects ~0.07% of distinct k-mers but can reach values above 10⁶ (chloroplast, ribosomal repeats).

    +

    PersistentCompactIntMatrix wraps multiple PersistentCompactIntVec columns in a directory, exposing a column-major matrix with row-access API. A vector is a matrix with 1 column.

    +
    +

    PersistentCompactIntVec — single-column file

    +

    Design

    +

    Two-tier structure:

    +
      +
    1. Primary array[u8; n], stored at offset 40 in the PCIV file and mmap'd. Values 0–254 are stored directly. Value 255 is a sentinel meaning "look in overflow".
    2. +
    3. Overflow section — sorted list of (slot: u64, value: u32) pairs for all slots where the true value ≥ 255, with a sparse L1-fitting index for fast lookup.
    4. +
    +
    primary[slot] < 255  →  return primary[slot]
    +primary[slot] == 255 →  binary search in overflow
    +
    +

    File format

    +

    Single .pciv file. Write order: header placeholder → primary → overflow + index → header overwrite at offset 0.

    +
    offset 0:
    +  magic:      [u8; 4]   = b"PCIV"
    +  _pad:       [u8; 4]   = 0
    +  n:          u64        number of slots
    +  n_overflow: u64        number of overflow entries
    +  n_index:    u64        number of sparse index entries
    +  step:       u64        sparse index step (0 = no index)
    +
    +offset 40:
    +  primary:    [u8; n]    one byte per slot, 255 = overflow sentinel
    +
    +offset 40 + n:
    +  data:       [(slot: u64, value: u32); n_overflow]   12 bytes each, sorted by slot
    +
    +offset 40 + n + n_overflow × 12:
    +  index:      [(slot: u64, pos: u64); n_index]         16 bytes each, sparse index
    +
    +

    The index entries point into data: index[i] = (slot of data[i×step], i×step).

    +

    All integer fields are little-endian. Slot indices are stored as u64 in the file; they are usize in Rust code.

    +

    Lifecycle

    +

    Builder (PersistentCompactIntVecBuilder)

    +

    Used during construction. The primary section is mmap'd immediately at construction time (both for new and build_from), so the file exists and is addressable from the start. The overflow is held in a HashMap<usize, u32> in RAM.

    +
    struct PersistentCompactIntVecBuilder {
    +    path:     PathBuf,
    +    mmap:     MmapMut,            // primary section live in the file from the start
    +    n:        usize,
    +    overflow: HashMap<usize, u32>, // values ≥ 255
    +}
    +
    +

    new(n: usize, path: &Path) -> io::Result<Self>

    +

    Creates the file, pre-allocates HEADER_SIZE + n zero bytes, mmaps it. The primary is zero-initialised (all slots = 0). Returns immediately ready for set / get.

    +

    build_from(source: &PersistentCompactIntVec, path: &Path) -> io::Result<Self>

    +

    Copies the source PCIV file to path (OS-level copy — no per-slot iteration), mmaps the copy, then loads the overflow section into a HashMap. Initialisation cost: O(file copy) + O(n_overflow), not O(n).

    +

    At close(), the primary section is not rewritten: it is already in the file via mmap. Only the overflow data, the sparse index, and the header are updated.

    +

    set(slot: usize, value: u32) / get(slot: usize) -> u32

    +

    Direct mmap byte access for the primary; HashMap for the overflow. Both O(1). Mutations can move a slot between tiers freely (downward mutation removes the HashMap entry; upward mutation adds it).

    +

    Element-wise operations — min, max, add, diff

    +

    Each takes a &PersistentCompactIntVec of equal length and updates self in place via set:

    +
    builder.min(&other);   // self[i] = min(self[i], other[i])
    +builder.max(&other);   // self[i] = max(self[i], other[i])
    +builder.add(&other);   // self[i] = self[i].checked_add(other[i])  (panics on u32 overflow)
    +builder.diff(&other);  // self[i] = self[i].saturating_sub(other[i])
    +
    +

    All iterate other with other.iter() (merge-scan, O(n_other)).

    +

    close(self) -> io::Result<()>

    +
      +
    1. Flush and drop the mmap (primary changes are now on disk).
    2. +
    3. Sort the overflow HashMap into Vec<(usize, u32)>.
    4. +
    5. Truncate the file to HEADER_SIZE + n (removes old data+index if build_from was used).
    6. +
    7. Append sorted overflow data, then sparse index.
    8. +
    9. Seek to offset 0, overwrite the header with final values.
    10. +
    +

    Reader (PersistentCompactIntVec)

    +

    Used at query time. The whole file is mmap'd; only the sparse index is copied into a Vec at open time (≤ 32 KB, L1-resident).

    +
    struct PersistentCompactIntVec {
    +    mmap:           Mmap,
    +    n:              usize,
    +    n_overflow:     usize,
    +    step:           usize,
    +    index:          Vec<(usize, usize)>,  // (slot, pos) — L1-resident
    +    primary_offset: usize,               // = 40 (HEADER_SIZE)
    +    data_offset:    usize,               // = 40 + n
    +    path:           PathBuf,
    +}
    +
    +

    open(path: &Path) -> io::Result<Self>

    +

    Mmaps the file, parses the 40-byte header, copies the sparse index entries into a Vec. The primary and data sections stay mmap'd.

    +

    get(slot: usize) -> u32 — random access

    +
    primary[slot] < 255  →  return it directly
    +
    +step == 0:
    +    binary_search(data[0..n_overflow], slot)
    +
    +step > 0:
    +    i = upper_bound(index[..].slot, slot) − 1     // in L1-resident Vec
    +    binary_search(data[index[i].pos .. index[i+1].pos], slot)
    +
    +

    iter() -> Iter<'_> — sequential scan, O(n)

    +

    Merge-scan: reads primary bytes in order; on sentinel 255, advances a sequential pointer into the sorted data section rather than doing a binary search. This gives O(n + n_overflow) with no random access into the data section.

    +

    Iter implements ExactSizeIterator. &PersistentCompactIntVec implements IntoIterator.

    +

    Aggregate

    +
    fn sum(&self) -> u64   // Σ self[i] as u64, via iter()
    +
    +

    Distance methods

    +

    All take &other of equal length, iterate both with zip(self.iter(), other.iter()), and return f64.

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    MethodFormula
    bray_dist1 − 2·Σmin(aᵢ,bᵢ) / (Σaᵢ + Σbᵢ)
    relfreq_bray_distBray-Curtis on relative frequencies: 1 − Σmin(pᵢ,qᵢ) where pᵢ = aᵢ/Σa
    euclidean_dist√Σ(aᵢ − bᵢ)²
    relfreq_euclidean_distEuclidean on relative frequencies
    hellinger_euclidean_dist√Σ(√pᵢ − √qᵢ)² — Euclidean on sqrt(relfreq)
    hellinger_disthellinger_euclidean_dist / √2 — standard Hellinger distance ∈ [0, 1]
    threshold_jaccard_dist(&other, threshold: u32)1 − \|A∩B\| / \|A∪B\| where presence iff count ≥ threshold
    jaccard_distthreshold_jaccard_dist(&other, 1)
    +

    Edge cases (both vectors all-zero, or union empty for Jaccard): distance = 0.0.

    +

    Step computation

    +

    Chosen at close() once n_overflow is known:

    +
    L1_INDEX_ENTRIES = 2048
    +
    +step = 0                                if n_overflow ≤ 2048
    +step = ⌈n_overflow / 2048⌉             otherwise
    +
    +

    Complexity

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    OperationTimeNotes
    set / get (builder)O(1)mmap byte + HashMap
    get (reader, no overflow)O(1)single mmap byte
    get (reader, with index)O(log step)≤ 2 memory regions
    get (reader, no index)O(log n_overflow)data fits in a few cache lines
    iter() full scanO(n + n_overflow)merge-scan, no binary search
    sum, distancesO(n)via iter() / zip(iter(), iter())
    min / max / add / diffO(n)via other.iter() + builder set
    closeO(n_overflow log n_overflow)sort + sequential write
    openO(n_index)index copy into Vec
    build_fromO(file_size) + O(n_overflow)OS copy + HashMap load
    +
    +

    PersistentCompactIntMatrix — column-major directory

    +

    Design

    +

    A directory containing meta.json and N column files col_000000.pciv, col_000001.pciv, …, each a PersistentCompactIntVec. This is the type used by LayerData — a single-column matrix is functionally equivalent to a vector but shares the same interface as multi-column matrices.

    +
    counts/
    +  meta.json          {"n": <n_slots>, "n_cols": <N>}
    +  col_000000.pciv
    +  col_000001.pciv
    +  ...
    +
    +

    Builder (PersistentCompactIntMatrixBuilder)

    +
    struct PersistentCompactIntMatrixBuilder {
    +    dir:    PathBuf,
    +    n:      usize,
    +    n_cols: usize,
    +}
    +
    +

    new(n: usize, dir: &Path) -> io::Result<Self>

    +

    Creates the directory (including parents). Does not write meta.json yet.

    +

    add_col(&mut self) -> io::Result<PersistentCompactIntVecBuilder>

    +

    Creates col_NNNNNN.pciv for the next column and returns its builder. The caller fills the column and calls builder.close() before calling add_col again.

    +

    close(self) -> io::Result<()>

    +

    Writes meta.json with the final n and n_cols. Must be called after all column builders are closed.

    +

    Reader (PersistentCompactIntMatrix)

    +
    struct PersistentCompactIntMatrix {
    +    cols: Vec<PersistentCompactIntVec>,
    +    n:    usize,
    +}
    +
    +

    open(dir: &Path) -> io::Result<Self>

    +

    Reads meta.json, opens all col_NNNNNN.pciv files.

    +

    row(slot: usize) -> Box<[u32]>

    +

    Returns the full row: [col_0[slot], col_1[slot], …, col_{N-1}[slot]]. One mmap access per column. O(N).

    +

    col(c: usize) -> &PersistentCompactIntVec

    +

    Direct access to a single column for column-oriented operations (distance computations, iteration).

    +

    LayerData implementation

    +
    impl LayerData for PersistentCompactIntMatrix {
    +    type Item = Box<[u32]>;
    +    fn open(layer_dir: &Path) -> OLMResult<Self> { /* opens layer_dir/counts/ */ }
    +    fn read(&self, slot: usize) -> Box<[u32]>    { self.row(slot) }
    +}
    +
    + + + + + + + + + + + + + +
    +
    + + + +
    + +
    + + + +
    +
    +
    +
    + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/doc/implementation/pipeline/index.html b/doc/implementation/pipeline/index.html index a9798f4..f95b9a7 100644 --- a/doc/implementation/pipeline/index.html +++ b/doc/implementation/pipeline/index.html @@ -767,6 +767,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -849,6 +933,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/storage/index.html b/doc/implementation/storage/index.html index 047e47d..d216fcb 100644 --- a/doc/implementation/storage/index.html +++ b/doc/implementation/storage/index.html @@ -723,6 +723,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -805,6 +889,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/superkmer/index.html b/doc/implementation/superkmer/index.html index f8ab366..2537ee6 100644 --- a/doc/implementation/superkmer/index.html +++ b/doc/implementation/superkmer/index.html @@ -745,6 +745,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -827,6 +911,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/implementation/unitig_evidence/index.html b/doc/implementation/unitig_evidence/index.html index 3a479d8..1ad5be1 100644 --- a/doc/implementation/unitig_evidence/index.html +++ b/doc/implementation/unitig_evidence/index.html @@ -6,7 +6,7 @@ - + Unitig evidence encoding - obikmer @@ -467,6 +467,37 @@
  • + + + + Non-determinism of the unitig decomposition + + + + +
  • +
  • @@ -478,6 +509,42 @@
  • +
  • + + + + + obilayeredmap crate + + + + + +
  • +
  • + + + + + PersistentCompactIntVec + + + + + +
  • +
  • + + + + + PersistentBitVec + + + + + +
  • @@ -513,6 +580,18 @@ + + + +
  • + + + + + Kmer index + + +
  • @@ -698,6 +777,37 @@
  • + + + + Non-determinism of the unitig decomposition + + + + +
  • +
  • @@ -774,7 +884,8 @@ b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rcei -

    On Betula nana (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average; no unitig length distribution data measured yet. The rank field (kmer index within the unitig) fits in a u8 as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below.

    +

    Structural maximum from superkmer construction. For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = 21 kmers (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.

    +

    On Betula nana (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The rank field (kmer index within the unitig) fits in a u8 as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).

    Split strategy for long unitigs

    For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a k−1 nucleotide overlap at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.

    original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
    @@ -1026,6 +1137,43 @@ kmer      = nucleotides(unitig_id)[rank .. rank + k]   // 2-bit packed slice
     

    Forward vs reverse complement

    The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.


    +

    Non-determinism of the unitig decomposition

    +

    The unitig extraction is not deterministic: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.

    +

    Source of non-determinism

    +

    The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with ahash::RandomState::new()). The start_iter first pass emits every node whose can_extend_left flag is false — which includes not only true dead-end nodes but also branch points (nodes with 2 or more left neighbours, for which unique_neighbor returns None).

    +

    When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.

    +

    Example — fork topology (k = 31):

    +
    A → B ← C
    +    ↓
    +    D
    +
    +

    All four nodes are in the graph. B has two left neighbours (A and C), so can_extend_left = false; B also has one right neighbour D, so can_extend_right = true.

    + + + + + + + + + + + + + + + + + + + + +
    iteration orderunitigs producedcount
    A first, then B, CABD · C2
    B first, then A, CBD · A · C3
    +

    Both tilings cover the same 4 canonical k-mers.

    +

    Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.

    +

    Consequence for MPHF construction

    +

    The MPHF is built from the k-mer set, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.

    +

    Open questions

    @@ -790,6 +874,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/kmers/index.html b/doc/kmers/index.html index a066512..7b4795c 100644 --- a/doc/kmers/index.html +++ b/doc/kmers/index.html @@ -740,6 +740,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -822,6 +906,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/sitemap.xml.gz b/doc/sitemap.xml.gz index 25db22c..e43d52e 100644 Binary files a/doc/sitemap.xml.gz and b/doc/sitemap.xml.gz differ diff --git a/doc/theory/encoding/index.html b/doc/theory/encoding/index.html index f17a033..6adf132 100644 --- a/doc/theory/encoding/index.html +++ b/doc/theory/encoding/index.html @@ -712,6 +712,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -794,6 +878,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/theory/entropy/index.html b/doc/theory/entropy/index.html index 3296a6e..20e5a37 100644 --- a/doc/theory/entropy/index.html +++ b/doc/theory/entropy/index.html @@ -767,6 +767,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -849,6 +933,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/theory/indexing/index.html b/doc/theory/indexing/index.html index c54ca12..97513e3 100644 --- a/doc/theory/indexing/index.html +++ b/doc/theory/indexing/index.html @@ -712,6 +712,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -794,6 +878,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/doc/theory/minimizer/index.html b/doc/theory/minimizer/index.html index 95da5d2..0c56933 100644 --- a/doc/theory/minimizer/index.html +++ b/doc/theory/minimizer/index.html @@ -756,6 +756,90 @@ + + + + + + +
  • + + + + + + + + obilayeredmap crate + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentCompactIntVec + + + + + + + + +
  • + + + + + + + + + + +
  • + + + + + + + + PersistentBitVec + + + + + + + + +
  • + + + + @@ -838,6 +922,34 @@ + + + + + + +
  • + + + + + + + + Kmer index + + + + + + + + +
  • + + + + diff --git a/docmd/architecture/index_architecture.md b/docmd/architecture/index_architecture.md index 102b92f..9db00a2 100644 --- a/docmd/architecture/index_architecture.md +++ b/docmd/architecture/index_architecture.md @@ -58,12 +58,230 @@ trait DataStore { Concrete types from `obicompactvec`: -| Type | `Item` | Use | -|---|---|---| -| `PersistentCompactIntMatrix` | `Box<[u32]>` | count per sample per slot | -| `PersistentBitMatrix` | `Box<[bool]>` | presence per sample per slot | +| Type | `Item` | Column stats | Use | +|---|---|---|---| +| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | +| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | -A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. The path to its on-disk files is managed by the `LayeredPartition`, not embedded in the store type. +`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. + +A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. + +--- + +## Distance matrix API on DataStore types + +Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. + +### Full distance matrices + +Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. + +```rust +// PersistentCompactIntMatrix +fn bray_dist_matrix(&self) -> Array2 +fn relfreq_bray_dist_matrix(&self) -> Array2 +fn euclidean_dist_matrix(&self) -> Array2 +fn relfreq_euclidean_dist_matrix(&self) -> Array2 +fn hellinger_dist_matrix(&self) -> Array2 +fn jaccard_dist_matrix(&self) -> Array2 +fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 + +// PersistentBitMatrix +fn jaccard_dist_matrix(&self) -> Array2 +fn hamming_dist_matrix(&self) -> Array2 +``` + +These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. + +### Partial distance matrices + +Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. + +**Category 1 — self-contained partials**: additive without any external parameter. + +```rust +// PersistentCompactIntMatrix +fn partial_bray_dist_matrix(&self) + -> (Array2, // sum_min[i,j] + Array1) // col_sums[k] + +fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs +fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) + -> (Array2, // inter[i,j] + Array2) // union[i,j] + +// PersistentBitMatrix +fn partial_jaccard_dist_matrix(&self) + -> (Array2, // inter[i,j] + Array2) // union[i,j] +fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits +``` + +**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. + +```rust +// PersistentCompactIntMatrix only +fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) + -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) + +fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) + -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² + +fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) + -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² +``` + +The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. + +--- + +## Progressive aggregation principle + +Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. + +``` +PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix + ↓ Σ across layers +LayeredCompactIntMatrix::sum() — column sums for one partition + ↓ Σ across partitions +PartitionedCompactIntMatrix::sum() — global column sums +``` + +The same cascade applies to every partial computation: + +``` +PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer) + ↓ element-wise Σ across layers +LayeredCompactIntMatrix::partial_bray() — one partition + ↓ element-wise Σ across partitions +PartitionedCompactIntMatrix::partial_bray() — global partial → final dist +``` + +This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above. + +--- + +## LayeredDataStore — aggregation within one partition + +A `LayeredDataStore` holds one `DataStore` per layer within a single partition: + +```rust +struct LayeredCompactIntMatrix { layers: Vec } +struct LayeredBitMatrix { layers: Vec } +``` + +### Column statistics + +```rust +// LayeredCompactIntMatrix +fn sum(&self) -> Array1 + // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +) + +// LayeredBitMatrix +fn count_ones(&self) -> Array1 + // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +) +``` + +### Self-contained partials + +Each method reduces across layers by element-wise addition of per-layer matrices: + +```rust +fn partial_bray(&self) -> (Array2, Array1) + // Σ_l layer_l.partial_bray_dist_matrix() + +fn partial_euclidean(&self) -> Array2 + // Σ_l layer_l.partial_euclidean_dist_matrix() + +fn partial_jaccard(&self) -> (Array2, Array2) + // Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix] + // Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix] + +fn partial_hamming(&self) -> Array2 + // Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix] +``` + +### Normalised partials (require global sums from above) + +```rust +fn partial_relfreq_bray(&self, global_sums: &Array1) -> Array2 + // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums) + +fn partial_relfreq_euclidean(&self, global_sums: &Array1) -> Array2 + // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums) + +fn partial_hellinger(&self, global_sums: &Array1) -> Array2 + // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums) +``` + +`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it. + +--- + +## PartitionedDataStore — aggregation across all partitions + +A `PartitionedDataStore` holds one `LayeredDataStore` per partition: + +```rust +struct PartitionedCompactIntMatrix { partitions: Vec } +struct PartitionedBitMatrix { partitions: Vec } +``` + +### Column statistics + +```rust +fn sum(&self) -> Array1 + // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +) +``` + +`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved. + +### Self-contained metrics — single pass + +```rust +fn bray_dist_matrix(&self) -> Array2 { + let (sum_min, col_sums) = partitions + .par_iter() + .map(|p| p.partial_bray()) + .reduce(element-wise +); + // finalise + for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j]) +} +``` + +### Normalised metrics — two passes + +```rust +fn relfreq_bray_dist_matrix(&self) -> Array2 { + // pass 1 — progressive: PartitionedDataStore::sum() + // calls LayeredDataStore::sum() per partition (parallel) + // calls PersistentCompactIntMatrix::sum() per layer (parallel) + let global_sums = self.sum(); + + // pass 2 — per-partition partial using global_sums (parallel) + let matrix = partitions + .par_iter() + .map(|p| p.partial_relfreq_bray(&global_sums)) + .reduce(element-wise +); + // finalise + for (i,j): dist[i,j] = 1 - matrix[i,j] +} +``` + +`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy. + +--- + +## Parallelism model + +| Level | Unit | Coordination | +|---|---|---| +| Across partitions | `LayeredDataStore` | none — fully independent | +| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets | +| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive | +| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only | +| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter | --- @@ -80,65 +298,19 @@ for each layer l in p: return None ``` -O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer. +O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). -### Sequence scan — `sequence → Vec<(kmer, Option)>` - -Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results. - -Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent. - -### Aggregation — `→ Accumulator` - -For operations that traverse all kmers (distance, presence matrix, global counts): +### Aggregation — `→ Result` ``` result = reduce( - for p in partitions: // parallel - for l in layers(p): // parallel + for p in partitions: // parallel + for l in layers(p): // parallel partial(DataStore_p_l) ) ``` -Each `(partition, layer)` contributes an independent `Partial`. Global result = `reduce(all partials)`. - ---- - -## Aggregator pattern - -```rust -trait Aggregator { - type Partial: Send; - type Result; - fn partial(&self, store: &D) -> Self::Partial; - fn reduce(&self, parts: impl Iterator) -> Self::Result; -} -``` - -Concrete aggregators: - -| Aggregator | `Partial` | `Result` | -|---|---|---| -| `BrayCurtis(i, j)` | `(sum_min, sum_a, sum_b): (u64, u64, u64)` | `f64` | -| `Jaccard(i, j)` | `(inter, union): (u64, u64)` | `f64` | -| `Hellinger(i, j)` | `(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64)` | `f64` | -| `DistanceMatrix(metric)` | `n×n partial matrix` | `n×n f64 matrix` | -| `PresenceQuery(kmer)` | — | routed to point query | - -The `partial` for `BrayCurtis(i, j)` on a `PersistentCompactIntMatrix` with columns i and j already exists as `PersistentCompactIntVec::partial_bray_dist` — it needs to be lifted to the column-pair level on the matrix. - ---- - -## Parallelism model - -| Level | Unit | Coordination | -|---|---|---| -| Across partitions | `LayeredPartition` | none — fully independent | -| Across layers (aggregation) | `(partition, layer)` pair | none — disjoint kmer sets | -| Within a layer (point query) | n/a — single layer per kmer | n/a | -| DataStore derivation | one `(partition, layer)` per task | none | - -The dispatch model: `PartitionedIndex::aggregate(aggregator)` fans out over partitions (rayon `par_iter`), each partition fans out over its layers, collects partials, then a top-level `reduce` combines. +For normalised metrics replace with the two-pass scheme above. --- @@ -149,17 +321,11 @@ Because the `MphfLayer` is independent of its data stores, new stores can be der ``` // count → presence/absence, parallel across (partition, layer) for (p, l) in all_partition_layer_pairs().par_iter(): - count_store = open PersistentCompactIntMatrix at (p, l) + count_store = open PersistentCompactIntMatrix at (p, l) presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) - attach presence_store to MphfLayer(p, l) ``` -Other derivations: -- Threshold a count matrix → binary presence matrix -- Union two presence matrices (same MPHF, different samples) -- Merge two count matrices (saturating add, column-wise) - -All derivations are local to a `(partition, layer)` pair and fully parallelisable. +Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. --- @@ -169,11 +335,12 @@ The current `obilayeredmap` crate implements a subset of this architecture. Key - `Layer` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported. - `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally. -- The `Aggregator` pattern is not yet implemented; partial distance methods exist on `PersistentCompactIntVec` but are not composed across layers and partitions. -- No `PartitionedIndex` type exists; `LayeredMap` is a single-partition structure. +- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API. +- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions. Planned refactoring: 1. Extract `MphfLayer` from `Layer` as an autonomous type. 2. Replace `LayerData` trait with `DataStore` trait (no path knowledge). -3. Implement `LayeredPartition` that holds `Vec` and attaches data stores externally. -4. Implement `PartitionedIndex` with parallel dispatch and the `Aggregator` pattern. +3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above. +4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics. +5. Implement `PartitionedIndex` for point queries with parallel dispatch. diff --git a/src/obicompactvec/src/intmatrix.rs b/src/obicompactvec/src/intmatrix.rs index 4f6742f..9ea70a8 100644 --- a/src/obicompactvec/src/intmatrix.rs +++ b/src/obicompactvec/src/intmatrix.rs @@ -36,7 +36,20 @@ impl PersistentCompactIntMatrix { // ── Distance matrices ───────────────────────────────────────────────────── pub fn bray_dist_matrix(&self) -> Array2 { - self.pairwise(|i, j| self.col(i).bray_dist(self.col(j))) + let sum_min = self.partial_bray_dist_matrix(); + let col_sums = self.sum(); + let n = self.n_cols(); + let mut m = Array2::zeros((n, n)); + for i in 0..n { + for j in 0..n { + if i != j { + let denom = col_sums[i] + col_sums[j]; + m[[i, j]] = if denom == 0 { 0.0 } + else { 1.0 - 2.0 * sum_min[[i, j]] as f64 / denom as f64 }; + } + } + } + m } pub fn relfreq_bray_dist_matrix(&self) -> Array2 { @@ -74,23 +87,11 @@ impl PersistentCompactIntMatrix { // ── Partial matrices (additively decomposable across layers) ────────────── - /// Returns `(sum_min[n×n], col_sums[n])`. - /// `sum_min[i,j]` = Σ_slot min(col_i[slot], col_j[slot]). - /// `col_sums[k]` = Σ_slot col_k[slot]. - /// Reduce across layers by element-wise addition before computing the final distance. - pub fn partial_bray_dist_matrix(&self) -> (Array2, Array1) { - let n = self.n_cols(); - - let col_sums: Vec = (0..n) - .into_par_iter() - .map(|i| self.col(i).sum()) - .collect(); - - let sum_min = self.pairwise_u64(|i, j| { - self.col(i).partial_bray_dist(self.col(j)).0 - }); - - (sum_min, Array1::from_vec(col_sums)) + /// Returns `sum_min[n×n]` where `sum_min[i,j]` = Σ_slot min(col_i[slot], col_j[slot]). + /// The denominator `col_sums[i] + col_sums[j]` is obtained from `self.sum()`. + /// Additive across layers by element-wise addition. + pub fn partial_bray_dist_matrix(&self) -> Array2 { + self.pairwise_u64(|i, j| self.col(i).partial_bray_dist(self.col(j))) } /// Returns sum of squared differences `[n×n]`. diff --git a/src/obicompactvec/src/reader.rs b/src/obicompactvec/src/reader.rs index 8a23667..0f4ce25 100644 --- a/src/obicompactvec/src/reader.rs +++ b/src/obicompactvec/src/reader.rs @@ -141,32 +141,22 @@ impl PersistentCompactIntVec { #[inline] /// Returns the Bray-Curtis distance between two compact int vectors. pub fn bray_dist(&self, other: &PersistentCompactIntVec) -> f64 { - let (sum_min, denom) = self.partial_bray_dist(other); + let sum_min = self.partial_bray_dist(other); + let denom = self.sum() + other.sum(); if denom == 0 { return 0.0; } 1.0 - 2.0 * sum_min as f64 / denom as f64 } - /// Returns the partial Bray-Curtis distance between two compact int vectors. - /// - /// Returns a tuple `(sum_min, denom)` where `sum_min` is the sum of the minimum values - /// at each index, and `denom` is the sum of the values in both vectors. - /// This is used internally by [`bray_dist`] and to easily compute the Bray-Curtis distance - /// over a set of vector pairs. - /// - /// Returns the tuple `(sum_min, sum_a + sum_b)` where `sum_min` is the sum of the minimum - /// values at each index, `sum_a` is the sum of the first vector's counts, and `sum_b` is - /// the sum of the second vector's counts. - pub fn partial_bray_dist(&self, other: &PersistentCompactIntVec) -> (u64, u64) { + /// Returns `Σ_slot min(self[slot], other[slot])` — the additive numerator of Bray-Curtis. + /// The denominator `sum_a + sum_b` is obtained from `self.sum() + other.sum()`. + pub fn partial_bray_dist(&self, other: &PersistentCompactIntVec) -> u64 { assert_eq!(self.n, other.len(), "length mismatch"); - let (sum_min, sum_a, sum_b) = self - .iter() + self.iter() .zip(other.iter()) - .fold((0u64, 0u64, 0u64), |(sm, sa, sb), (a, b)| { - (sm + a.min(b) as u64, sa + a as u64, sb + b as u64) - }); - (sum_min, sum_a + sum_b) + .map(|(a, b)| a.min(b) as u64) + .sum() } /// Returns the relative frequency Bray-Curtis distance between two compact int vectors. diff --git a/src/obicompactvec/src/tests/intmatrix.rs b/src/obicompactvec/src/tests/intmatrix.rs index c195b42..2b9d45e 100644 --- a/src/obicompactvec/src/tests/intmatrix.rs +++ b/src/obicompactvec/src/tests/intmatrix.rs @@ -126,21 +126,17 @@ fn jaccard_dist_matrix_values_match_pairwise() { #[test] fn partial_bray_dist_matrix_consistent() { let (_d, m) = make_matrix(&[&[1, 0, 1], &[1, 1, 0], &[0, 1, 1]]); - let (sum_min, col_sums) = m.partial_bray_dist_matrix(); + let sum_min = m.partial_bray_dist_matrix(); + let col_sums = m.sum(); let n = m.n_cols(); - // symmetry of sum_min + // symmetry for i in 0..n { for j in 0..n { assert_eq!(sum_min[[i, j]], sum_min[[j, i]]); } } - // col_sums correct - for k in 0..n { - assert_eq!(col_sums[k], m.col(k).sum()); - } - // reconstruct distance from partials and compare to direct method for i in 0..n { for j in i + 1..n {