Push kztouvrzoqym #8
@@ -1,3 +1,21 @@
|
|||||||
|
## A finir dans le cadre de l'extension des index à une forme approximative
|
||||||
|
|
||||||
|
- Il faut avoir un chemin explicite pour construire en mode exact avec des méthodes qui ont ce mot exact à l'intérieur.
|
||||||
|
- pub fn find_exact (src/obilayeredmap/src/mphf_layer.rs)
|
||||||
|
- pub fn build_exact_evidence (src/obilayeredmap/src/layer.rs)
|
||||||
|
|
||||||
|
Comme elles existent actuellement pour le mode approx.
|
||||||
|
|
||||||
|
Ensuite, il faudra définir des méthodes génériques
|
||||||
|
- find()
|
||||||
|
- build_evidence()
|
||||||
|
|
||||||
|
qui utilise la bonne version suivant le mode de l'index de manière complètement transparente.
|
||||||
|
Avec ce système, tout le reste du code devrait être insensible au fait que l'on utilise un index exact ou approximatif.
|
||||||
|
|
||||||
|
Sauf qu'avec un index approximatif, les résultats seront approximatifs.
|
||||||
|
|
||||||
|
|
||||||
## commandes à ajouter
|
## commandes à ajouter
|
||||||
|
|
||||||
- aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.
|
- aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.
|
||||||
|
|||||||
@@ -2,169 +2,155 @@
|
|||||||
|
|
||||||
## Fundamental invariant
|
## Fundamental invariant
|
||||||
|
|
||||||
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
|
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Three-level hierarchy
|
## Three-level hierarchy
|
||||||
|
|
||||||
```
|
```
|
||||||
PartitionedIndex
|
KmerIndex (index.meta + KmerPartition)
|
||||||
├── LayeredPartition (one per minimiser bucket)
|
├── partition_0/index/ one directory per minimiser bucket
|
||||||
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
|
│ ├── meta.json PartitionMeta { n_layers }
|
||||||
│ │ ├── DataStore A slot → T (e.g. counts)
|
│ ├── layer_0/
|
||||||
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
|
│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind }
|
||||||
│ ├── MphfLayer 1
|
│ │ ├── mphf.bin PtrHash MPHF
|
||||||
│ │ └── DataStore A
|
│ │ ├── unitigs.bin unitig spine (never overwritten)
|
||||||
│ └── ...
|
│ │ ├── evidence.bin exact evidence (Exact only)
|
||||||
├── LayeredPartition
|
│ │ ├── unitigs.bin.idx block index (Exact only)
|
||||||
│ └── ...
|
│ │ ├── fingerprint.bin fingerprints (Approx only)
|
||||||
|
│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true)
|
||||||
|
│ │ └── presence/ PersistentBitMatrix
|
||||||
|
│ └── layer_1/
|
||||||
|
│ └── ...
|
||||||
|
└── partition_1/index/
|
||||||
|
└── ...
|
||||||
```
|
```
|
||||||
|
|
||||||
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
|
**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.
|
||||||
|
|
||||||
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
|
**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets.
|
||||||
|
|
||||||
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
|
**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`.
|
||||||
|
|
||||||
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## MphfLayer — autonomous mapping layer
|
## IndexConfig and IndexMeta
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
|
pub struct IndexConfig {
|
||||||
MphfLayer::n() -> usize // number of slots
|
pub kmer_size: usize,
|
||||||
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
|
pub minimizer_size: usize,
|
||||||
MphfLayer::open(dir: &Path) -> OLMResult<Self>
|
pub n_bits: usize, // log2(n_partitions)
|
||||||
|
pub with_counts: bool,
|
||||||
|
pub evidence: EvidenceKind,
|
||||||
|
pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct IndexMeta {
|
||||||
|
pub version: u32,
|
||||||
|
pub config: IndexConfig,
|
||||||
|
pub genomes: Vec<GenomeInfo>, // ordered; index = genome column number
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct GenomeInfo {
|
||||||
|
pub label: String,
|
||||||
|
pub meta: HashMap<String, String>, // arbitrary categorical metadata
|
||||||
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
|
`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.
|
||||||
|
|
||||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
|
---
|
||||||
|
|
||||||
|
## EvidenceKind
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum EvidenceKind {
|
||||||
|
Exact,
|
||||||
|
Approx { b: u8, z: u8 },
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Controls which files are written per layer and which query path is taken:
|
||||||
|
|
||||||
|
| Variant | Files written | False-positive rate |
|
||||||
|
|---|---|---|
|
||||||
|
| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 |
|
||||||
|
| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) |
|
||||||
|
|
||||||
|
`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MphfLayer — autonomous kmer → slot mapping
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct MphfLayer {
|
||||||
|
mphf: PtrHash<…>,
|
||||||
|
ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint }
|
||||||
|
n: usize,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range.
|
||||||
|
|
||||||
|
```
|
||||||
|
find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
|
||||||
|
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)
|
||||||
|
```
|
||||||
|
|
||||||
|
`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`.
|
||||||
|
|
||||||
|
The MPHF and `unitigs.bin` are never rebuilt by any post-build operation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layer\<D\> — MPHF + data payload
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct Layer<D: LayerData = ()> {
|
||||||
|
mphf: MphfLayer,
|
||||||
|
data: D,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`D` selects the attached data payload:
|
||||||
|
|
||||||
|
| `D` | Data directory | `Item` returned by `query` |
|
||||||
|
|---|---|---|
|
||||||
|
| `()` | — | `()` (set membership only) |
|
||||||
|
| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) |
|
||||||
|
| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) |
|
||||||
|
|
||||||
|
`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option<Hit<D::Item>>`.
|
||||||
|
|
||||||
|
Build-time entry points:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
Layer<()>::build(out_dir, block_bits) // set membership
|
||||||
|
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
|
||||||
|
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)
|
||||||
|
|
||||||
|
Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path)
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## DataStore — slot-indexed data
|
## DataStore — slot-indexed data
|
||||||
|
|
||||||
```rust
|
`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs.
|
||||||
trait DataStore {
|
|
||||||
type Item;
|
|
||||||
fn get(&self, slot: usize) -> Self::Item;
|
|
||||||
fn n(&self) -> usize;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Concrete types from `obicompactvec`:
|
| Type | `Item` | Aggregation method | Use |
|
||||||
|
|
||||||
| Type | `Item` | Column stats | Use |
|
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1<u64>` | counts per genome per slot |
|
||||||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
|
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1<u64>` | presence per genome per slot |
|
||||||
|
|
||||||
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
|
|
||||||
|
|
||||||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Distance matrix API on DataStore types
|
## Aggregation traits — `obicompactvec::traits`
|
||||||
|
|
||||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
|
Three traits unify the aggregation API across all hierarchy levels.
|
||||||
|
|
||||||
### Full distance matrices
|
|
||||||
|
|
||||||
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
|
|
||||||
|
|
||||||
```rust
|
|
||||||
// PersistentCompactIntMatrix
|
|
||||||
fn bray_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn euclidean_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn hellinger_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
|
|
||||||
|
|
||||||
// PersistentBitMatrix
|
|
||||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
|
||||||
fn hamming_dist_matrix(&self) -> Array2<u64>
|
|
||||||
```
|
|
||||||
|
|
||||||
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
|
|
||||||
|
|
||||||
### Partial distance matrices
|
|
||||||
|
|
||||||
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
|
|
||||||
|
|
||||||
**Category 1 — self-contained partials**: additive without any external parameter.
|
|
||||||
|
|
||||||
```rust
|
|
||||||
// PersistentCompactIntMatrix
|
|
||||||
fn partial_bray_dist_matrix(&self)
|
|
||||||
-> (Array2<u64>, // sum_min[i,j]
|
|
||||||
Array1<u64>) // col_sums[k]
|
|
||||||
|
|
||||||
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
|
|
||||||
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
|
|
||||||
-> (Array2<u64>, // inter[i,j]
|
|
||||||
Array2<u64>) // union[i,j]
|
|
||||||
|
|
||||||
// PersistentBitMatrix
|
|
||||||
fn partial_jaccard_dist_matrix(&self)
|
|
||||||
-> (Array2<u64>, // inter[i,j]
|
|
||||||
Array2<u64>) // union[i,j]
|
|
||||||
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
|
|
||||||
```
|
|
||||||
|
|
||||||
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
|
|
||||||
|
|
||||||
```rust
|
|
||||||
// PersistentCompactIntMatrix only
|
|
||||||
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
|
|
||||||
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
|
|
||||||
|
|
||||||
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
|
||||||
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
|
|
||||||
|
|
||||||
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
|
||||||
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
|
|
||||||
```
|
|
||||||
|
|
||||||
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Progressive aggregation principle
|
|
||||||
|
|
||||||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
|
||||||
|
|
||||||
```
|
|
||||||
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
|
|
||||||
↓ Σ across layers
|
|
||||||
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
|
|
||||||
↓ Σ across partitions
|
|
||||||
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
|
|
||||||
```
|
|
||||||
|
|
||||||
The same cascade applies to every partial:
|
|
||||||
|
|
||||||
```
|
|
||||||
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
|
|
||||||
↓ element-wise Σ across layers
|
|
||||||
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
|
|
||||||
↓ element-wise Σ across partitions
|
|
||||||
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
|
|
||||||
```
|
|
||||||
|
|
||||||
Each level presents a stable trait surface to the level above; no level reaches two levels down.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Traits — `obicompactvec::traits`
|
|
||||||
|
|
||||||
Three traits unify the aggregation API across all levels of the hierarchy.
|
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
trait ColumnWeights: Send + Sync {
|
trait ColumnWeights: Send + Sync {
|
||||||
@@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync {
|
|||||||
}
|
}
|
||||||
|
|
||||||
trait CountPartials: ColumnWeights {
|
trait CountPartials: ColumnWeights {
|
||||||
// self-contained partials (additive, no parameter)
|
fn partial_bray(&self) -> Array2<u64>;
|
||||||
fn partial_bray(&self) -> Array2<u64>;
|
fn partial_euclidean(&self) -> Array2<f64>;
|
||||||
fn partial_euclidean(&self) -> Array2<f64>;
|
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
||||||
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||||
// normalised partials (global col_weights passed in cascade)
|
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||||
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||||
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
// provided finalisation methods with default impls
|
||||||
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||||
// provided finalisation methods (default implementations)
|
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||||
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
// …
|
||||||
fn euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
|
||||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { … }
|
|
||||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
|
||||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
|
||||||
fn hellinger_dist_matrix(&self) -> Array2<f64> { … }
|
|
||||||
}
|
}
|
||||||
|
|
||||||
trait BitPartials: ColumnWeights {
|
trait BitPartials: ColumnWeights {
|
||||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
||||||
fn partial_hamming(&self) -> Array2<u64>;
|
fn partial_hamming(&self) -> Array2<u64>;
|
||||||
// provided
|
// provided
|
||||||
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
|
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
|
||||||
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
|
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Leaf implementors** (in `obicompactvec`):
|
Leaf implementors:
|
||||||
|
|
||||||
| Type | Traits |
|
| Type | Traits |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
|
| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` |
|
||||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |
|
| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` |
|
||||||
|
|
||||||
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## `LayeredStore<S>` — `obilayeredmap`
|
## LayeredStore\<S\> — recursive aggregation wrapper
|
||||||
|
|
||||||
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
|
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub struct LayeredStore<S>(Vec<S>);
|
pub struct LayeredStore<S>(Vec<S>);
|
||||||
```
|
```
|
||||||
|
|
||||||
Three blanket impls propagate the traits up the hierarchy:
|
Three blanket impls propagate all traits up the hierarchy:
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ across inner stores
|
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }
|
||||||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // same pattern
|
impl<S: CountPartials> CountPartials for LayeredStore<S> { … }
|
||||||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // same pattern
|
impl<S: BitPartials> BitPartials for LayeredStore<S> { … }
|
||||||
```
|
```
|
||||||
|
|
||||||
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:
|
This makes `LayeredStore<LayeredStore<PersistentCompactIntMatrix>>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed:
|
||||||
|
|
||||||
```
|
```
|
||||||
PersistentCompactIntMatrix implements CountPartials
|
PersistentCompactIntMatrix leaf (one layer)
|
||||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
|
LayeredStore<PersistentCompactIntMatrix> one partition (layers are disjoint)
|
||||||
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
|
LayeredStore<LayeredStore<…>> whole index (partitions are independent)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Normalised metrics — two-pass cascade
|
Normalised metrics require global column sums — computed in a two-pass cascade:
|
||||||
|
|
||||||
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
|
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
||||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||||
let global = self.col_weights(); // pass 1 — progressive sum at every level
|
let global = self.col_weights(); // pass 1 — sums up hierarchy
|
||||||
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
|
let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
|
||||||
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
|
p.mapv(|v| 1.0 - v)
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.
|
Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Progressive aggregation principle
|
||||||
|
|
||||||
|
No level reaches two levels down. Each level sums contributions from the level immediately below:
|
||||||
|
|
||||||
|
```
|
||||||
|
PersistentCompactIntMatrix::col_weights() — one (partition, layer)
|
||||||
|
↓ Σ across layers
|
||||||
|
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
|
||||||
|
↓ Σ across partitions
|
||||||
|
LayeredStore<LayeredStore<…>>::col_weights() — global
|
||||||
|
```
|
||||||
|
|
||||||
|
The same cascade applies to every partial method.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Multi-genome column invariant
|
||||||
|
|
||||||
|
After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
||||||
|
|
||||||
|
Maintained by three coordinated operations:
|
||||||
|
|
||||||
|
**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`.
|
||||||
|
|
||||||
|
**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.
|
||||||
|
|
||||||
|
**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1.
|
||||||
|
|
||||||
|
This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Query model
|
||||||
|
|
||||||
|
### Point query
|
||||||
|
|
||||||
|
```
|
||||||
|
minimiser(kmer) → partition p
|
||||||
|
for each layer l in p:
|
||||||
|
if let Some(slot) = MphfLayer_l.find(kmer):
|
||||||
|
return data_l.read(slot)
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`.
|
||||||
|
|
||||||
|
### Aggregation
|
||||||
|
|
||||||
|
```
|
||||||
|
result = reduce(
|
||||||
|
for p in partitions: // parallel
|
||||||
|
for l in layers(p): // parallel
|
||||||
|
partial(data_p_l)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
For normalised metrics, replace with the two-pass cascade.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
|||||||
|
|
||||||
| Level | Unit | Coordination |
|
| Level | Unit | Coordination |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
|
| Across partitions | inner stores of `LayeredStore<LayeredStore<S>>` | none |
|
||||||
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
|
| Across layers within a partition | inner stores of `LayeredStore<S>` | none — disjoint kmer sets |
|
||||||
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
|
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
|
||||||
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
|
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
|
||||||
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
|
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
|
||||||
|
|
||||||
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.
|
---
|
||||||
|
|
||||||
|
## reindex — evidence conversion in place
|
||||||
|
|
||||||
|
`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`:
|
||||||
|
|
||||||
|
- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin`
|
||||||
|
- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx`
|
||||||
|
|
||||||
|
On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Query model
|
## estimate — parameter dry-run
|
||||||
|
|
||||||
### Point query — `kmer → Option<Item>`
|
`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing.
|
||||||
|
|
||||||
```
|
|
||||||
minimiser(kmer) → partition p
|
|
||||||
for each layer l in p:
|
|
||||||
slot = MphfLayer_l.find(kmer)
|
|
||||||
if slot is Some:
|
|
||||||
return DataStore_l.get(slot)
|
|
||||||
return None
|
|
||||||
```
|
|
||||||
|
|
||||||
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
|
|
||||||
|
|
||||||
### Aggregation — `→ Result`
|
|
||||||
|
|
||||||
```
|
|
||||||
result = reduce(
|
|
||||||
for p in partitions: // parallel
|
|
||||||
for l in layers(p): // parallel
|
|
||||||
partial(DataStore_p_l)
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
For normalised metrics replace with the two-pass scheme above.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## DataStore derivation
|
|
||||||
|
|
||||||
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
|
|
||||||
|
|
||||||
```
|
|
||||||
// count → presence/absence, parallel across (partition, layer)
|
|
||||||
for (p, l) in all_partition_layer_pairs().par_iter():
|
|
||||||
count_store = open PersistentCompactIntMatrix at (p, l)
|
|
||||||
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
|
||||||
```
|
|
||||||
|
|
||||||
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Relationship to current implementation
|
|
||||||
|
|
||||||
### What is implemented
|
|
||||||
|
|
||||||
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
|
||||||
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
|
|
||||||
|
|
||||||
### What is not yet implemented
|
|
||||||
|
|
||||||
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
|
|
||||||
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
|
|
||||||
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
|
|
||||||
|
|
||||||
### Planned refactoring
|
|
||||||
|
|
||||||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
|
||||||
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
|
||||||
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
|
||||||
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Multi-genome column invariant
|
|
||||||
|
|
||||||
### The invariant
|
|
||||||
|
|
||||||
After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).
|
|
||||||
|
|
||||||
### How it is maintained
|
|
||||||
|
|
||||||
The invariant is established and preserved by three coordinated operations:
|
|
||||||
|
|
||||||
**1. Existing layers — column append.**
|
|
||||||
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.
|
|
||||||
|
|
||||||
**2. New layers — absent columns prepended.**
|
|
||||||
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.
|
|
||||||
|
|
||||||
**3. First merge, Presence mode — `init_presence_matrix`.**
|
|
||||||
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.
|
|
||||||
|
|
||||||
### Why the invariant is required
|
|
||||||
|
|
||||||
The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.
|
|
||||||
|
|
||||||
The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.
|
|
||||||
|
|||||||
@@ -0,0 +1,22 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: architecture/index_architecture.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obilayeredmap/src/layer.rs` — Layer<D>, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix
|
||||||
|
- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, EvidenceKind (Exact / Approx), LayerEvidence enum
|
||||||
|
- `obilayeredmap/src/map.rs` — LayeredMap<D>
|
||||||
|
- `obilayeredmap/src/meta.rs` — LayerMeta, PartitionMeta
|
||||||
|
- `obikindex/src/meta.rs` — IndexConfig (kmer_size, n_bits, with_counts, evidence, block_bits), IndexMeta
|
||||||
|
- `obikindex/src/index.rs` — KmerIndex, build_layers
|
||||||
|
- `obicompactvec/src/` — PersistentCompactIntMatrix, PersistentBitMatrix (DataStore implementations)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. Nombreux changements récents :
|
||||||
|
- Ajout de `EvidenceKind` (Exact / Approx { b, z }) dans `IndexConfig` et `LayerMeta`
|
||||||
|
- Ajout de `block_bits` dans `IndexConfig`
|
||||||
|
- `LayerEvidence` enum dans `mphf_layer.rs` remplace l'ancienne approche monolithique
|
||||||
|
- Distinction `open()` vs `open_sequential()` dans `UnitigFileReader`
|
||||||
|
- Commandes `reindex` et `estimate` ajoutées
|
||||||
|
Vérifier que la hiérarchie à 3 niveaux décrite est toujours exacte et que les nouveaux paramètres sont documentés.
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: architecture/query.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikmer/src/cmd/query.rs` — commande query, format de sortie
|
||||||
|
- `obikpartitionner/src/query_layer.rs` — routage de la requête à travers les partitions
|
||||||
|
- `obiread/src/lib.rs` — lecture des séquences d'entrée pour la requête
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
RISQUE DE DÉRIVE. Vérifier :
|
||||||
|
- La commande `unitig` a été modifiée pour utiliser `open_sequential()` — vérifier si query est concerné
|
||||||
|
- `find_exact` / `find_approx` / `find` générique ont été ajoutés dans `MphfLayer` — le chemin de requête a changé
|
||||||
|
- Si l'index est approximatif (Approx), la requête peut produire des faux positifs : la doc le mentionne-t-elle ?
|
||||||
|
- Format de sortie CSV (`obikindex/src/csv.rs` ou équivalent) à vérifier
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: architecture/sequences/invariant.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikseq/src/sequence.rs` — invariants de représentation des séquences (ACGT, longueur max)
|
||||||
|
- `obikseq/src/unitig.rs` — type Unitig, contrainte MAX_KMERS_PER_CHUNK (255 kmers par chunk)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document court et stable. Vérifier que la limite de 256 nucléotides (ou 255 kmers) par chunk
|
||||||
|
est toujours la même dans `obiskio::MAX_KMERS_PER_CHUNK`.
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/chunkreader.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obiread/src/chunk.rs` — SeqChunkIter, détection de frontières FASTA/FASTQ, state machines
|
||||||
|
- `obikrope/src/lib.rs` — type Rope (Vec<Bytes>), opérations zero-copy
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document stable (la stratégie de chunking rope ne devrait pas avoir changé).
|
||||||
|
Vérifier que le split FASTA/FASTQ reste correct si de nouveaux formats ont été ajoutés.
|
||||||
@@ -0,0 +1,178 @@
|
|||||||
|
# Approximate evidence: fingerprint-based index
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
|
||||||
|
enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
|
||||||
|
k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
|
||||||
|
|
||||||
|
| file | size/partition | fraction |
|
||||||
|
|---|---|---|
|
||||||
|
| evidence.bin | 132 MB | 66 % |
|
||||||
|
| unitigs.bin | 58 MB | 29 % |
|
||||||
|
| mphf.bin | 10 MB | 5 % |
|
||||||
|
|
||||||
|
`evidence.bin` is a bijection from MPHF-space to unitig-position-space and
|
||||||
|
costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
|
||||||
|
only ~22 % packing headroom. Compression is not a path to elimination.
|
||||||
|
|
||||||
|
The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
|
||||||
|
`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
|
||||||
|
operations still require an exact index; the approximate index targets query
|
||||||
|
workloads that can tolerate a bounded false-positive rate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Findere model
|
||||||
|
|
||||||
|
A B-bit fingerprint stored per MPHF slot provides the discrimination that
|
||||||
|
`evidence.bin` would otherwise provide through full k-mer reconstruction.
|
||||||
|
|
||||||
|
For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
|
||||||
|
stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
|
||||||
|
|
||||||
|
```
|
||||||
|
P(FP per k-mer) = 1 / 2^b
|
||||||
|
```
|
||||||
|
|
||||||
|
The Findere trick raises the effective window to z consecutive k-mers. A query
|
||||||
|
succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
|
||||||
|
|
||||||
|
```
|
||||||
|
P(FP per z-window) = 1 / 2^(b·z)
|
||||||
|
```
|
||||||
|
|
||||||
|
The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer
|
||||||
|
decomposes into z overlapping k-mers, all of which must match.
|
||||||
|
|
||||||
|
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## `FingerprintVec` on disk
|
||||||
|
|
||||||
|
`fingerprint.bin` layout:
|
||||||
|
|
||||||
|
```
|
||||||
|
magic: b"FPVF" (4 bytes)
|
||||||
|
b: u8 (bits per slot, 1..=64)
|
||||||
|
padding: [0u8; 3]
|
||||||
|
n: u64 LE (number of slots)
|
||||||
|
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
|
||||||
|
```
|
||||||
|
|
||||||
|
`FingerprintVec` is memory-mapped. The match check against a query k-mer:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
|
||||||
|
self.get(slot) == (fingerprint & self.mask)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
|
||||||
|
`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
|
||||||
|
and `layer_meta.json`. No `.idx` file is produced; random access into
|
||||||
|
`unitigs.bin` is not needed.
|
||||||
|
|
||||||
|
At build time, `find_approx` in `MphfLayer`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
|
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## `EvidenceKind` and metadata
|
||||||
|
|
||||||
|
`layer_meta.json` records which evidence bundle is present:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum EvidenceKind {
|
||||||
|
Exact,
|
||||||
|
Approx { b: u8, z: u8 },
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
|
||||||
|
`find_approx` transparently. `find_exact` panics on an approximate layer;
|
||||||
|
`find_approx` panics on an exact layer — mode mixing is a programming error.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Parameter resolution (`resolve_approx_params`)
|
||||||
|
|
||||||
|
The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
|
||||||
|
`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
|
||||||
|
rounding:
|
||||||
|
|
||||||
|
| given | derived |
|
||||||
|
|---|---|
|
||||||
|
| b, z | fp = 1/2^(b·z) |
|
||||||
|
| z, fp | b = ⌈−log₂(fp) / z⌉ |
|
||||||
|
| b, fp | z = ⌈−log₂(fp) / b⌉ |
|
||||||
|
| z only | b = 8 (default), fp derived |
|
||||||
|
| b only | z = 1 (default), fp derived |
|
||||||
|
| fp only | b = 8 (default), z derived |
|
||||||
|
| none | b = 8, z = 1, fp = 1/256 |
|
||||||
|
|
||||||
|
When all three are given, b and z are authoritative and fp is recomputed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CLI flags
|
||||||
|
|
||||||
|
Both `index` and `reindex` accept the same flags:
|
||||||
|
|
||||||
|
| flag | type | meaning |
|
||||||
|
|---|---|---|
|
||||||
|
| `--approx` | bool | enable fingerprint evidence |
|
||||||
|
| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
|
||||||
|
| `-z` | u8 | Findere z parameter |
|
||||||
|
| `--fp` | f64 | target FP rate per z-window |
|
||||||
|
| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |
|
||||||
|
|
||||||
|
`--approx` must be set explicitly; the other three flags are optional and
|
||||||
|
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## `reindex` command
|
||||||
|
|
||||||
|
`reindex` converts an existing index between exact and approximate evidence
|
||||||
|
in-place across all partitions and layers, running partitions in parallel via
|
||||||
|
Rayon.
|
||||||
|
|
||||||
|
Conversion to approximate (`--approx`):
|
||||||
|
|
||||||
|
- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
|
||||||
|
- Removes `evidence.bin` and `unitigs.bin.idx`.
|
||||||
|
- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.
|
||||||
|
|
||||||
|
Conversion to exact (default, no `--approx`):
|
||||||
|
|
||||||
|
- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
|
||||||
|
- Removes `fingerprint.bin`.
|
||||||
|
- Updates `layer_meta.json` with `EvidenceKind::Exact`.
|
||||||
|
|
||||||
|
The root `index.meta` is updated with the new evidence kind on success.
|
||||||
|
`mphf.bin` and `unitigs.bin` are never modified.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## `estimate` command
|
||||||
|
|
||||||
|
`estimate` is a dry-run that resolves and prints (b, z, fp) without touching
|
||||||
|
any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
|
||||||
|
additionally accepts `-k` to display the effective indexed k-mer length:
|
||||||
|
|
||||||
|
```
|
||||||
|
k (query): 31
|
||||||
|
k (indexed): 31
|
||||||
|
z: 1
|
||||||
|
evidence bits (b): 8
|
||||||
|
FP per k-mer: 3.906e-3 (1/2^8)
|
||||||
|
FP per z-window: 3.906e-3 (1/2^8)
|
||||||
|
```
|
||||||
|
|
||||||
|
Useful for choosing parameters before committing to an index build.
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/evidence_elimination.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, stockage b bits/slot, matches()
|
||||||
|
- `obilayeredmap/src/mphf_layer.rs` — build_approx_evidence(dir, b, z), find_approx()
|
||||||
|
- `obilayeredmap/src/meta.rs` — EvidenceKind::Approx { b, z }, LayerMeta
|
||||||
|
- `obikindex/src/reindex.rs` — KmerIndex::reindex(), conversion exact↔approx en place
|
||||||
|
- `obikmer/src/cmd/reindex.rs` — CLI reindex, options --approx, -z, --evidence-bits, --fp, --block-size
|
||||||
|
- `obikmer/src/cmd/index.rs` — resolve_approx_params(), options --approx, -z, --evidence-bits, --fp
|
||||||
|
- `obikmer/src/cmd/estimate.rs` — commande estimate (dry-run des paramètres)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Ce document était à l'origine une discussion de design (4 approches). L'implémentation
|
||||||
|
a maintenant convergé vers l'approche fingerprint (Findere-style).
|
||||||
|
FORT RISQUE DE DÉRIVE — le contenu est probablement un mélange de design et d'implémentation :
|
||||||
|
- Le modèle FP = 1/2^(b·z) et les règles de résolution (2-of-3 parmi b, z, fp) sont implémentés
|
||||||
|
- La commande `reindex` permet la conversion a posteriori exact↔approx
|
||||||
|
- La commande `estimate` fait le dry-run des paramètres
|
||||||
|
Cette page doit être réécrite pour documenter l'implémentation Findere réelle plutôt que les alternatives abandonnées.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/kmer.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikseq/src/kmer.rs` — layout mémoire (repr(transparent) u64), encodage/décodage, revcomp, forme canonique
|
||||||
|
- `obikseq/src/params.rs` — k global (set_k / k())
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document d'implémentation stable. L'algorithme de revcomp bit-à-bit est décrit —
|
||||||
|
vérifier qu'il correspond à `revcomp_raw` dans `obiskio/src/unitig_index.rs` (copie locale)
|
||||||
|
et à l'implémentation dans `obikseq/src/kmer.rs`.
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/merge.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikindex/src/merge.rs` — `KmerIndex::merge()`, validation de compatibilité d'évidence, `validate_evidence_compat()`
|
||||||
|
- `obikpartitionner/src/merge_layer.rs` — `merge_partition()`, construction de la nouvelle layer, paramètre `block_bits`
|
||||||
|
- `obikpartitionner/src/rebuild_layer.rs` — `rebuild_partition()`, paramètre `block_bits`
|
||||||
|
- `obilayeredmap/src/layer.rs` — `Layer::append_genome_column()` (PersistentCompactIntMatrix et PersistentBitMatrix)
|
||||||
|
- `obicompactvec/src/intmatrix.rs` — `append_column` pour PersistentCompactIntMatrix
|
||||||
|
- `obicompactvec/src/bitmatrix.rs` — `append_column` pour PersistentBitMatrix
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. Changements récents :
|
||||||
|
- Ajout de la validation de compatibilité d'évidence : merge exact+approx → erreur (OKIError::IncompatibleEvidence)
|
||||||
|
- `merge_partition` reçoit maintenant `block_bits: u8`
|
||||||
|
- La commande `reindex` a été ajoutée comme outil de conversion exact↔approx avant merge
|
||||||
|
Vérifier que la doc décrit la politique de merge mixed-evidence et le recours à `reindex`.
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/mphf.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obilayeredmap/src/mphf_layer.rs` — type Mphf (PtrHash + CubicEps + CachelineEfVec + Xx64), construction en 2 passes, `build()`, `build_exact_evidence()`, `build_approx_evidence()`, `build_evidence()`
|
||||||
|
- `obikpartitionner/src/index_layer.rs` — `build_index_layer()` avec passage de `block_bits`
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. Changements récents :
|
||||||
|
- `build_exact_evidence(dir, block_bits)` — `block_bits` maintenant paramétrisé (défaut 0)
|
||||||
|
- `build_approx_evidence(dir, b, z)` — nouvelle fonction pour l'évidence fingerprint
|
||||||
|
- `build_evidence(dir, kind, block_bits)` — dispatch selon EvidenceKind
|
||||||
|
- Construction en 2 phases : pass 1 (Rayon parallèle) + pass 2 (callback `fill_slot`)
|
||||||
|
Vérifier que la doc décrit correctement les deux nouvelles routes d'évidence et le paramètre `block_bits`.
|
||||||
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
## Purpose
|
## Purpose
|
||||||
|
|
||||||
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Evidence kinds
|
||||||
|
|
||||||
|
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum EvidenceKind {
|
||||||
|
Exact,
|
||||||
|
Approx { b: u8, z: u8 },
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
|
||||||
|
|
||||||
|
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
|
||||||
|
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## MphfLayer — autonomous kmer → slot mapping
|
## MphfLayer — autonomous kmer → slot mapping
|
||||||
|
|
||||||
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
|
`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub struct MphfLayer {
|
pub struct MphfLayer {
|
||||||
mphf: Mphf,
|
mphf: Mphf,
|
||||||
evidence: Evidence,
|
ev: LayerEvidence, // loaded at open() time
|
||||||
unitigs: UnitigFileReader,
|
n: usize,
|
||||||
n: usize, // number of indexed kmers = number of MPHF slots
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Public API:
|
`LayerEvidence` is an internal enum, not public:
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
impl MphfLayer {
|
enum LayerEvidence {
|
||||||
pub fn open(dir: &Path) -> OLMResult<Self>
|
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
|
Approx { fingerprint: FingerprintVec },
|
||||||
pub fn n(&self) -> usize
|
|
||||||
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
|
|
||||||
pub(crate) fn build(
|
|
||||||
dir: &Path,
|
|
||||||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
|
||||||
) -> OLMResult<usize>
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
|
### Query API
|
||||||
|
|
||||||
`build` runs two sequential passes over `unitigs.bin`:
|
Three public query methods, all returning `Option<usize>` (slot index):
|
||||||
|
|
||||||
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
```rust
|
||||||
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||||
|
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||||
|
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||||
|
```
|
||||||
|
|
||||||
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
|
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
|
||||||
|
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
|
||||||
|
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
|
||||||
|
|
||||||
|
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
|
||||||
|
|
||||||
|
### Build surface
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Full MPHF + exact evidence build (two-pass, parallel)
|
||||||
|
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
|
||||||
|
|
||||||
|
// Evidence-only builds (MPHF already present in dir)
|
||||||
|
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
|
||||||
|
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
|
||||||
|
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
|
||||||
|
```
|
||||||
|
|
||||||
|
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
|
||||||
|
|
||||||
|
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
||||||
|
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
||||||
|
|
||||||
|
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
|
||||||
|
|
||||||
|
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -81,7 +120,7 @@ pub struct Hit<T = ()> {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
|
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
|
||||||
|
|
||||||
| Type | `Item` | Description |
|
| Type | `Item` | Description |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
@@ -89,31 +128,118 @@ pub struct Hit<T = ()> {
|
|||||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
|
||||||
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
|
||||||
|
|
||||||
**Build signatures:**
|
### Build signatures
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
// mode 1
|
// mode 1
|
||||||
impl Layer<()> {
|
impl Layer<()> {
|
||||||
pub fn build(out_dir: &Path) -> OLMResult<usize>
|
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||||||
}
|
}
|
||||||
|
|
||||||
// mode 2
|
// mode 2
|
||||||
impl Layer<PersistentCompactIntMatrix> {
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
pub fn build(out_dir: &Path, block_bits: u8,
|
||||||
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||||
|
pub fn build_from_map(out_dir: &Path, block_bits: u8,
|
||||||
|
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||||
}
|
}
|
||||||
|
|
||||||
// mode 3
|
// mode 3
|
||||||
impl Layer<PersistentBitMatrix> {
|
impl Layer<PersistentBitMatrix> {
|
||||||
pub fn build_presence(
|
pub fn build_presence(out_dir: &Path, block_bits: u8,
|
||||||
out_dir: &Path,
|
n_genomes: usize,
|
||||||
n_genomes: usize,
|
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
|
||||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
|
||||||
) -> OLMResult<usize>
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
|
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
|
||||||
|
|
||||||
|
### Evidence build helpers on Layer
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl<D: LayerData> Layer<D> {
|
||||||
|
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||||||
|
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
|
||||||
|
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FingerprintVec and FingerprintVecWriter
|
||||||
|
|
||||||
|
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
|
||||||
|
|
||||||
|
```
|
||||||
|
fingerprint.bin format:
|
||||||
|
magic: b"FPVF" (4 bytes)
|
||||||
|
b: u8 (bits per fingerprint, 1..=64)
|
||||||
|
padding: [0u8; 3]
|
||||||
|
n: u64 LE (number of slots)
|
||||||
|
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||||||
|
```
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl FingerprintVec {
|
||||||
|
pub fn open(path: &Path) -> OLMResult<Self>
|
||||||
|
pub fn get(&self, slot: usize) -> u64
|
||||||
|
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
|
||||||
|
pub fn n(&self) -> usize
|
||||||
|
pub fn b(&self) -> u8
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LayeredMap\<D\> — collection of layers
|
||||||
|
|
||||||
|
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct LayeredMap<D: LayerData = ()> {
|
||||||
|
root: PathBuf,
|
||||||
|
meta: PartitionMeta,
|
||||||
|
layers: Vec<Layer<D>>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
|
||||||
|
|
||||||
|
### Common methods
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub fn open(root: &Path) -> OLMResult<Self>
|
||||||
|
pub fn create(root: &Path) -> OLMResult<Self>
|
||||||
|
pub fn n_layers(&self) -> usize
|
||||||
|
pub fn layer(&self, i: usize) -> &Layer<D>
|
||||||
|
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
|
||||||
|
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
|
||||||
|
```
|
||||||
|
|
||||||
|
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
|
||||||
|
|
||||||
|
### push_layer
|
||||||
|
|
||||||
|
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// mode 1
|
||||||
|
impl LayeredMap<()> {
|
||||||
|
pub fn push_layer(&mut self) -> OLMResult<usize>
|
||||||
|
}
|
||||||
|
|
||||||
|
// mode 2
|
||||||
|
impl LayeredMap<PersistentCompactIntMatrix> {
|
||||||
|
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||||
|
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -131,14 +257,6 @@ impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wi
|
|||||||
|
|
||||||
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
|
||||||
|
|
||||||
**Aggregation hierarchy:**
|
|
||||||
|
|
||||||
```
|
|
||||||
PersistentCompactIntMatrix implements CountPartials
|
|
||||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
|
||||||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Leaf implementors** (in `obicompactvec`):
|
**Leaf implementors** (in `obicompactvec`):
|
||||||
|
|
||||||
| Type | Traits |
|
| Type | Traits |
|
||||||
@@ -146,8 +264,6 @@ LayeredStore<LayeredStore<…>> via blanket impl (partitione
|
|||||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
|
||||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
|
||||||
|
|
||||||
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
|
|
||||||
|
|
||||||
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
|
|||||||
## On-disk structure
|
## On-disk structure
|
||||||
|
|
||||||
```
|
```
|
||||||
index_root/ ← LayeredMap (collection)
|
partition_root/ ← LayeredMap (one partition)
|
||||||
meta.json
|
meta.json — {"n_layers": N}
|
||||||
part_00000/ ← Partition
|
layer_0/ ← Layer
|
||||||
layer_0/ ← Layer
|
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
|
||||||
mphf.bin — ptr_hash MPHF (epserde format)
|
mphf.bin — ptr_hash MPHF (epserde format)
|
||||||
unitigs.bin — packed 2-bit nucleotide sequences
|
unitigs.bin — packed 2-bit nucleotide sequences
|
||||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
unitigs.bin.idx — UIDX index (exact evidence only)
|
||||||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
evidence.bin — [u32; n], LE (exact evidence only)
|
||||||
counts/ [mode 2] PersistentCompactIntMatrix
|
fingerprint.bin — packed b-bit array (approx evidence only)
|
||||||
meta.json {"n": N, "n_cols": 1}
|
counts/ [mode 2] PersistentCompactIntMatrix
|
||||||
col_000000.pciv
|
meta.json
|
||||||
presence/ [mode 3] PersistentBitMatrix
|
col_000000.pciv
|
||||||
meta.json {"n": N, "n_cols": G}
|
presence/ [mode 3] PersistentBitMatrix
|
||||||
col_000000.pbiv
|
meta.json
|
||||||
…
|
col_000000.pbiv …
|
||||||
layer_1/
|
layer_1/
|
||||||
…
|
|
||||||
part_00001/
|
|
||||||
…
|
…
|
||||||
```
|
```
|
||||||
|
|
||||||
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
|
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
|
||||||
|
|
||||||
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Evidence encoding
|
## Evidence encoding (exact)
|
||||||
|
|
||||||
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
|
||||||
|
|
||||||
@@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
|||||||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||||
```
|
```
|
||||||
|
|
||||||
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
|
`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
|
||||||
|
|
||||||
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
|
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the
|
|||||||
type Mphf = PtrHash<
|
type Mphf = PtrHash<
|
||||||
u64, // key type: canonical kmer raw encoding
|
u64, // key type: canonical kmer raw encoding
|
||||||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||||||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
|
CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
|
||||||
Xx64, // hasher: XXH3-64 with seed
|
Xx64, // hasher: XXH3-64 with seed
|
||||||
Vec<u8>, // pilots
|
Vec<u8>, // pilots
|
||||||
>;
|
>;
|
||||||
@@ -211,21 +323,41 @@ type Mphf = PtrHash<
|
|||||||
|
|
||||||
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
|
||||||
|
|
||||||
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
|
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Query path
|
## Column append and merge support
|
||||||
|
|
||||||
|
These methods extend existing layers with new genome columns without touching the MPHF.
|
||||||
|
|
||||||
|
### Layer-level genome column append
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
|
impl Layer<PersistentBitMatrix> {
|
||||||
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
|
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
|
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
|
Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
|
||||||
|
|
||||||
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
|
### Presence matrix initialisation
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl Layer<()> {
|
||||||
|
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
|
||||||
|
|
||||||
|
### Why the MPHF is never rebuilt
|
||||||
|
|
||||||
|
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -235,9 +367,9 @@ When adding dataset B to an existing index:
|
|||||||
|
|
||||||
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
1. For each partition, probe existing layers for kmers of B routed to that partition.
|
||||||
2. Collect kmers absent from all layers → `B \ index`.
|
2. Collect kmers absent from all layers → `B \ index`.
|
||||||
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
|
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
|
||||||
4. Call `Layer<D>::build` on the new directory.
|
4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
|
||||||
5. Update `meta.json`.
|
5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
|
||||||
|
|
||||||
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
Each partition's new layer is built independently; the operation is fully parallel across partitions.
|
||||||
|
|
||||||
@@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall
|
|||||||
| `ptr_hash 1.1` | MPHF per layer |
|
| `ptr_hash 1.1` | MPHF per layer |
|
||||||
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
|
||||||
| `epserde 0.8` | zero-copy MPHF serialisation |
|
| `epserde 0.8` | zero-copy MPHF serialisation |
|
||||||
| `memmap2 0.9` | mmap of evidence and payload files |
|
| `memmap2 0.9` | mmap of evidence and fingerprint files |
|
||||||
| `obiskio` | unitig file writer/reader |
|
| `bitvec` | packed b-bit fingerprint storage |
|
||||||
|
| `obiskio` | unitig file writer/reader + `.idx` build |
|
||||||
| `obicompactvec` | payload types + aggregation traits |
|
| `obicompactvec` | payload types + aggregation traits |
|
||||||
| `rayon 1` | parallel MPHF construction pass |
|
| `rayon 1` | parallel MPHF construction pass |
|
||||||
| `ndarray 0.16` | aggregation output arrays |
|
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Column append and merge support
|
|
||||||
|
|
||||||
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
|
|
||||||
|
|
||||||
### Matrix column append
|
|
||||||
|
|
||||||
```rust
|
|
||||||
impl PersistentCompactIntMatrix {
|
|
||||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
|
||||||
}
|
|
||||||
|
|
||||||
impl PersistentBitMatrix {
|
|
||||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
|
|
||||||
|
|
||||||
### Presence matrix initialisation
|
|
||||||
|
|
||||||
```rust
|
|
||||||
impl Layer<()> {
|
|
||||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
|
|
||||||
|
|
||||||
### Layer-level genome column append
|
|
||||||
|
|
||||||
```rust
|
|
||||||
impl Layer<PersistentBitMatrix> {
|
|
||||||
pub fn append_genome_column(
|
|
||||||
layer_dir: &Path,
|
|
||||||
value_of: impl Fn(usize) -> bool,
|
|
||||||
) -> OLMResult<()>
|
|
||||||
}
|
|
||||||
|
|
||||||
impl Layer<PersistentCompactIntMatrix> {
|
|
||||||
pub fn append_genome_column(
|
|
||||||
layer_dir: &Path,
|
|
||||||
value_of: impl Fn(usize) -> u32,
|
|
||||||
) -> OLMResult<()>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
|
|
||||||
|
|
||||||
### Why the MPHF is never rebuilt
|
|
||||||
|
|
||||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
|
|
||||||
|
|||||||
@@ -0,0 +1,22 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/obilayeredmap.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obilayeredmap/src/mphf_layer.rs` — MphfLayer, LayerEvidence enum (Exact/Approx), find(), find_exact(), find_approx()
|
||||||
|
- `obilayeredmap/src/layer.rs` — Layer<D>, trait LayerData, modes () / PersistentCompactIntMatrix / PersistentBitMatrix, build(), build_evidence(), append_genome_column()
|
||||||
|
- `obilayeredmap/src/map.rs` — LayeredMap<D>, push_layer(), query()
|
||||||
|
- `obilayeredmap/src/evidence.rs` — Evidence, EvidenceWriter, encodage chunk_id:rank
|
||||||
|
- `obilayeredmap/src/fingerprint.rs` — FingerprintVec, FingerprintVecWriter, matches()
|
||||||
|
- `obilayeredmap/src/meta.rs` — LayerMeta, EvidenceKind (Exact / Approx { b, z })
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. C'est le fichier le plus affecté par les changements récents :
|
||||||
|
- EvidenceKind (Exact / Approx) est désormais un concept de premier plan — toute la sémantique de query en dépend
|
||||||
|
- `LayerEvidence` enum interne à `MphfLayer` : dispatch transparent find() → find_exact() ou find_approx()
|
||||||
|
- `fingerprint.rs` : module entièrement nouveau (FingerprintVec + FingerprintVecWriter)
|
||||||
|
- `build_evidence()` / `build_exact_evidence()` / `build_approx_evidence()` sont nouveaux
|
||||||
|
- `block_bits` dans les fonctions build : O(1) garanti avec le chemin chaud explicit pour block_bits=0
|
||||||
|
- Séparation open() (accès aléatoire, requiert .idx) vs open_sequential() (itération seule)
|
||||||
|
Pratiquement toute cette page est à réécrire.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/obipipeline.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obipipeline/src/lib.rs` — WorkerPool, Pipeline, macro make_pipeline!
|
||||||
|
- `obipipeline/src/scheduler.rs` — Scheduler avec Select biaisé sur les entrées de canaux
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document stable (librairie générique, peu de risque de dérive).
|
||||||
|
Vérifier si `obipipeline` est toujours utilisé dans la phase scatter de `obikpartitionner`
|
||||||
|
ou s'il a été remplacé par Rayon dans certains chemins.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/persistent_bit_vec.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obicompactvec/src/bitvec.rs` — PersistentBitVec, opérations mot u64, invariant de padding
|
||||||
|
- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrix, wrapper colonne-major, append_column
|
||||||
|
- `obicompactvec/src/bitmatrix.rs` — PersistentBitMatrixBuilder
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document d'implémentation stable. Vérifier que `PersistentBitMatrixBuilder` et `append_column`
|
||||||
|
sont couverts (utilisés dans `Layer::<PersistentBitMatrix>::build_presence` et `append_genome_column`).
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/persistent_compact_int_vec.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obicompactvec/src/builder.rs` — PersistentCompactIntVecBuilder, cycle de vie
|
||||||
|
- `obicompactvec/src/reader.rs` — PersistentCompactIntVec, accès aléatoire et séquentiel
|
||||||
|
- `obicompactvec/src/intmatrix.rs` — PersistentCompactIntMatrix, wrapper colonne-major, append_column
|
||||||
|
- `obicompactvec/src/format.rs` — format de fichier (magic PCIV, header, primary u8, overflow, index)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document d'implémentation stable. Vérifier que `append_column` (utilisé dans merge et reindex)
|
||||||
|
est décrit. Vérifier que `PersistentCompactIntMatrixBuilder` est couvert (utilisé dans `layer.rs`).
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/pipeline.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikpartitionner/src/partition.rs` — estimation des paramètres (phase 0)
|
||||||
|
- `obiskbuilder/src/iter.rs` — scatter : filtre entropie, extraction superkmers, routage partition (phase 1)
|
||||||
|
- `obikpartitionner/src/filter.rs` — déduplication bucket-sort (phase 2)
|
||||||
|
- `obikpartitionner/src/kmer_sort.rs` — tri externe + agrégation de comptages (phase 3)
|
||||||
|
- `obidebruinj/src/debruijn.rs` — graphe De Bruijn, extraction des unitigs (phase 5)
|
||||||
|
- `obikpartitionner/src/index_layer.rs` — construction MPHF + évidence (phase 6), paramètre `block_bits`
|
||||||
|
- `obikindex/src/index.rs` — `build_layers()`, `dereplicate_and_count()`
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
RISQUE DE DÉRIVE modéré. Vérifier :
|
||||||
|
- Phase 6 : la doc mentionne-t-elle le filtre d'abondance (`min_ab`, `max_ab`) ?
|
||||||
|
- Phase 6 : `block_bits` passé à `build_index_layer` depuis `IndexConfig`
|
||||||
|
- Phase 6 : dispatch exact/approx selon `EvidenceKind` dans `build_index_layer`
|
||||||
@@ -1,5 +1,136 @@
|
|||||||
# On-disk collection structure
|
# On-disk index layout
|
||||||
|
|
||||||
See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout.
|
## Directory tree
|
||||||
|
|
||||||
The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory.
|
```
|
||||||
|
<index_root>/
|
||||||
|
index.meta ← JSON: IndexMeta
|
||||||
|
scatter.done ← sentinel: scatter phase complete
|
||||||
|
count.done ← sentinel: dereplicate + count complete
|
||||||
|
index.done ← sentinel: MPHF index fully built
|
||||||
|
spectrums/
|
||||||
|
<label>.json ← kmer frequency spectrum per genome
|
||||||
|
partitions/
|
||||||
|
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
|
||||||
|
index/
|
||||||
|
meta.json ← PartitionMeta { n_layers }
|
||||||
|
layer_0/
|
||||||
|
unitigs.bin ← binary unitig sequences (2-bit packed)
|
||||||
|
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
|
||||||
|
mphf.bin ← serialised PtrHash MPHF
|
||||||
|
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
|
||||||
|
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
|
||||||
|
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
|
||||||
|
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
|
||||||
|
presence/ ← PersistentBitMatrix (if presence mode, merge)
|
||||||
|
layer_1/ ← added by merge; same structure as layer_0
|
||||||
|
layer_2/ …
|
||||||
|
part_00001/ …
|
||||||
|
```
|
||||||
|
|
||||||
|
## State machine (sentinels)
|
||||||
|
|
||||||
|
The sentinels are touched atomically at the end of each pipeline stage.
|
||||||
|
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
|
||||||
|
detected as the lowest sentinel present.
|
||||||
|
|
||||||
|
| State | Sentinel present | Meaning |
|
||||||
|
|---|---|---|
|
||||||
|
| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
|
||||||
|
| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
|
||||||
|
| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
|
||||||
|
| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |
|
||||||
|
|
||||||
|
## index.meta (IndexMeta)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"version": 1,
|
||||||
|
"config": {
|
||||||
|
"kmer_size": 31,
|
||||||
|
"minimizer_size": 11,
|
||||||
|
"n_bits": 8,
|
||||||
|
"with_counts": false,
|
||||||
|
"evidence": "Exact",
|
||||||
|
"block_bits": 0
|
||||||
|
},
|
||||||
|
"genomes": [
|
||||||
|
{ "label": "genome_A", "meta": { "species": "Homo sapiens" } }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.
|
||||||
|
|
||||||
|
`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.
|
||||||
|
|
||||||
|
`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
|
||||||
|
chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).
|
||||||
|
|
||||||
|
`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
|
||||||
|
taxonomy, sample origin). It is optional; defaults to empty.
|
||||||
|
|
||||||
|
## Layer files
|
||||||
|
|
||||||
|
### unitigs.bin
|
||||||
|
|
||||||
|
2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
|
||||||
|
(nucleotide length − k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
|
||||||
|
packed sequence. Long unitigs are transparently split into overlapping chunks
|
||||||
|
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.
|
||||||
|
|
||||||
|
### unitigs.bin.idx (Exact only)
|
||||||
|
|
||||||
|
Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
|
||||||
|
`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
|
||||||
|
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
|
||||||
|
|
||||||
|
### mphf.bin
|
||||||
|
|
||||||
|
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
|
||||||
|
2-bit) to a slot index in `[0, n_kmers)`.
|
||||||
|
|
||||||
|
### layer_meta.json (LayerMeta)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "evidence": { "type": "exact" } }
|
||||||
|
```
|
||||||
|
or
|
||||||
|
```json
|
||||||
|
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
|
||||||
|
```
|
||||||
|
|
||||||
|
### evidence.bin (Exact)
|
||||||
|
|
||||||
|
One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
|
||||||
|
that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
|
||||||
|
is re-read and compared against the query.
|
||||||
|
|
||||||
|
### fingerprint.bin (Approx)
|
||||||
|
|
||||||
|
`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
|
||||||
|
False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
|
||||||
|
`z` consecutive k-mers must all match, reducing the effective FP rate to
|
||||||
|
approximately `W / 2^(b·z)` per read of length `L`
|
||||||
|
(where `W = L − k − z + 2`).
|
||||||
|
|
||||||
|
### counts/ (PersistentCompactIntMatrix)
|
||||||
|
|
||||||
|
Present when `with_counts=true`. One column per genome; each row holds the
|
||||||
|
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
|
||||||
|
during indexing and merge.
|
||||||
|
|
||||||
|
### presence/ (PersistentBitMatrix)
|
||||||
|
|
||||||
|
Present when the layer was built in presence/absence mode (merge path).
|
||||||
|
One bit per genome per MPHF slot. Written during merge; never present on a
|
||||||
|
freshly indexed single-genome layer.
|
||||||
|
|
||||||
|
## meta.json (PartitionMeta)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "n_layers": 2 }
|
||||||
|
```
|
||||||
|
|
||||||
|
Records how many `layer_N/` directories exist under `index/`. Incremented by
|
||||||
|
each merge that adds a layer.
|
||||||
|
|||||||
@@ -0,0 +1,18 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/storage.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikindex/src/meta.rs` — IndexMeta, IndexConfig (version, config, genomes)
|
||||||
|
- `obikindex/src/index.rs` — layout sur disque : partitions/, index.meta
|
||||||
|
- `obilayeredmap/src/meta.rs` — LayerMeta (evidence kind), PartitionMeta (n_layers)
|
||||||
|
- `obiskio/src/unitig_index.rs` — fichiers unitigs.bin + unitigs.bin.idx
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. Nombreux champs ajoutés :
|
||||||
|
- `IndexConfig` : champs `evidence` (EvidenceKind) et `block_bits` ajoutés
|
||||||
|
- Nouveau fichier `fingerprint.bin` pour l'évidence approximative
|
||||||
|
- `LayerMeta` / `layer_meta.json` introduit pour stocker EvidenceKind par layer
|
||||||
|
- Structure du répertoire layer : `evidence.bin` vs `fingerprint.bin` selon le mode
|
||||||
|
Mettre à jour le schéma de layout sur disque en conséquence.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/superkmer.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikseq/src/superkmer.rs` — layout mémoire SuperKmer (header 32 bits + séquence byte-alignée), encodage ASCII, revcomp, deque minimiseur
|
||||||
|
- `obiskbuilder/src/lib.rs` — fenêtre glissante monotone pour le maintien du minimiseur
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document d'implémentation détaillé. Vérifier que le layout header (longueur, orientation,
|
||||||
|
position minimiseur) n'a pas changé. La doc mentionne un revcomp SIMD — vérifier si c'est
|
||||||
|
toujours le cas ou si l'implémentation est scalaire.
|
||||||
@@ -2,243 +2,139 @@
|
|||||||
|
|
||||||
## Role of unitigs in the index
|
## Role of unitigs in the index
|
||||||
|
|
||||||
The MPHF maps each canonical kmer to an integer slot, but provides no way to reconstruct the kmer from its slot. A downstream operation (query, set operation) that receives a slot index and needs the kmer sequence must be able to retrieve it. The **evidence file** serves this purpose: it stores the kmer sequences in compact form and provides, for each MPHF slot, a pointer to where the corresponding kmer can be decoded.
|
The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
|
||||||
|
|
||||||
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition can be reconstructed from its unitig FASTA file.
|
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Two encoding strategies
|
## Binary file formats
|
||||||
|
|
||||||
### Strategy A — global nucleotide offset
|
### `unitigs.bin` — sequence chunks
|
||||||
|
|
||||||
Each MPHF slot stores a single integer: the byte offset of the kmer's first nucleotide within a packed 2-bit nucleotide array that concatenates all unitigs.
|
A sequence of binary records. Each record:
|
||||||
|
|
||||||
```
|
```
|
||||||
evidence[slot] = global_offset (bits: ⌈log₂ N_nuc⌉)
|
[u8: seql − k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
|
||||||
```
|
```
|
||||||
|
|
||||||
where `N_nuc` is the total number of nucleotides across all unitigs in the partition.
|
- `seql − k` (0–255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
|
||||||
|
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
|
||||||
|
- Byte count for packed sequence: `ceil(seql / 4)`.
|
||||||
|
|
||||||
Decoding: read k nucleotides starting at `global_offset`.
|
Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql − k + 1 ≤ 256`); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:
|
||||||
|
|
||||||
### Strategy B — (unitig_id, rank within unitig)
|
|
||||||
|
|
||||||
Each MPHF slot stores a pair:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
evidence[slot] = (unitig_id, rank)
|
chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k − 2] (256 kmers)
|
||||||
|
chunk 2: nucleotides [256, end] (remaining kmers)
|
||||||
|
overlap: k−1 nucleotides shared between the two chunks
|
||||||
```
|
```
|
||||||
|
|
||||||
- `unitig_id` : index of the unitig in the partition (0-based)
|
### `unitigs.bin.idx` — block-sampled offset index
|
||||||
- `rank` : kmer index within the unitig (0 ≤ rank < n_kmers); kmer i starts at nucleotide i, so the nucleotide offset is identical numerically but the kmer-unit interpretation is the natural one
|
|
||||||
|
|
||||||
Decoding: look up the unitig at `unitig_id`, then read k nucleotides starting at `rank`.
|
```
|
||||||
|
magic : 4 bytes = "UIX3"
|
||||||
|
block_bits: u32 LE — granularity parameter (0–31)
|
||||||
|
n_unitigs : u32 LE — total number of chunks in unitigs.bin
|
||||||
|
n_kmers : u64 LE — total number of kmers across all chunks
|
||||||
|
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
|
||||||
|
```
|
||||||
|
|
||||||
|
One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).
|
||||||
|
|
||||||
|
### `evidence.bin` — per-slot MPHF evidence
|
||||||
|
|
||||||
|
A flat array of u32 values, one per MPHF slot, no header:
|
||||||
|
|
||||||
|
```
|
||||||
|
bits [31:7] = chunk_id (25 bits)
|
||||||
|
bits [6:0] = rank (7 bits, 0–127)
|
||||||
|
```
|
||||||
|
|
||||||
|
File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Bit-cost analysis
|
## Building and reading the index
|
||||||
|
|
||||||
Define for a partition of P kmers with average kmers-per-unitig m:
|
### `build_unitig_idx(path, block_bits)`
|
||||||
|
|
||||||
- total nucleotides: $N_{nuc} = P \cdot \left(1 + \dfrac{k-1}{m}\right)$
|
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
|
||||||
- number of unitigs: $U = P / m$
|
|
||||||
|
|
||||||
**Strategy A**
|
### `open()` vs `open_sequential()`
|
||||||
|
|
||||||
$$
|
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
|
||||||
b_A = \left\lceil \log_2 N_{nuc} \right\rceil = \left\lceil \log_2 P + \log_2\!\left(1 + \frac{k-1}{m}\right) \right\rceil
|
|
||||||
$$
|
|
||||||
|
|
||||||
**Strategy B**
|
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
|
||||||
|
|
||||||
$$
|
### `chunk_start(i)` — random access
|
||||||
b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rceil
|
|
||||||
$$
|
|
||||||
|
|
||||||
where $L_{max}$ is the maximum unitig length (in nucleotides). In practice $L_{max} \ll P$, so the rank field is much cheaper than the full global offset. If unitig lengths are bounded (e.g. by partition structure), the rank field width is a small constant independent of P.
|
```rust
|
||||||
|
fn chunk_start(&self, i: usize) -> usize {
|
||||||
### Empirical bound on unitig length
|
// block_bits=0: single table lookup, O(1) — hot path
|
||||||
|
if self.block_bits == 0 {
|
||||||
Lengths and ranks are expressed in **kmer units** (not nucleotides): the nucleotide length is `n_kmers + k − 1`, so storing `n_kmers` instead of `seq_length` saves k−1 = 30 units of headroom in the same field width.
|
return self.block_offsets[i] as usize;
|
||||||
|
}
|
||||||
Consequence for `u8` capacity:
|
// block_bits>0: lookup block, then scan at most 2^block_bits − 1 records
|
||||||
|
let block = i >> self.block_bits;
|
||||||
| unit | max representable | max nucleotides |
|
let rem = i & self.mask;
|
||||||
|---|---|---|
|
let mut offset = self.block_offsets[block] as usize;
|
||||||
| nucleotides | 255 nuc | 225 kmers |
|
for _ in 0..rem {
|
||||||
| **kmers** | **255 kmers** | **285 nuc** |
|
let seql_minus_k = self.mmap[offset] as usize;
|
||||||
|
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||||
**Structural maximum from superkmer construction.** For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = **21 kmers** (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.
|
}
|
||||||
|
offset
|
||||||
On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).
|
}
|
||||||
|
|
||||||
### Split strategy for long unitigs
|
|
||||||
|
|
||||||
For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a **k−1 nucleotide overlap** at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.
|
|
||||||
|
|
||||||
```
|
|
||||||
original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
|
|
||||||
↑ cut here
|
|
||||||
|
|
||||||
chunk 1: nucleotides 0 … 284 (255 kmers)
|
|
||||||
chunk 2: nucleotides 255 … N+k-1 (N-255+1 kmers)
|
|
||||||
shared: nucleotides 255 … 284 (k-1 = 30 nucleotides, stored in both)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Cost of one split: k−1 = 30 redundant nucleotides = 60 bits. This event is rare in practice (m_u ≈ 38 for *B. nana*, well below the 255-kmer cap). No kmer is lost: kmer i is in chunk 1 if i < 255, in chunk 2 (at rank i−255) otherwise.
|
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
|
||||||
|
|
||||||
### Savings from u8 length fields
|
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
|
||||||
|
|
||||||
Because all chunks are guaranteed ≤ 255 kmers, the per-chunk length array in the binary index is a flat `u8` array — 1 byte per chunk instead of 8 bytes (usize) or 4 bytes (u32). For a partition with 4 M unitigs:
|
### Decoding a kmer from slot `s`
|
||||||
|
|
||||||
| length type | bytes/chunk | total (4 M chunks) |
|
```rust
|
||||||
|---|---|---|
|
let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
|
||||||
| usize (u64) | 8 | 32 MB |
|
let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
|
||||||
| u32 | 4 | 16 MB |
|
```
|
||||||
| **u8** | **1** | **4 MB** |
|
|
||||||
|
|
||||||
Random access to chunk i is recovered at load time by a single prefix-sum pass over the u8 array, computing a u32/u64 offset array in O(n_chunks) time and O(n_chunks × 4) bytes — paid once at open time, cached for the lifetime of the partition handle.
|
Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
|
||||||
|
|
||||||
Bit costs for *Betula nana* (k=31, 256 partitions, P ≈ 10.4 M, U ≈ 275 k, m_u ≈ 37.9):
|
|
||||||
|
|
||||||
| field | strategy A | strategy B |
|
|
||||||
|---|---|---|
|
|
||||||
| offset / id | $\lceil\log_2(P \cdot (1 + 30/m_u))\rceil = 25$ bits | $\lceil\log_2(U)\rceil = 19$ bits |
|
|
||||||
| rank | — | 8 bits (u8, fixed) |
|
|
||||||
| **total** | **25 bits** | **27 bits** |
|
|
||||||
|
|
||||||
Strategy A is 2 bits cheaper. Strategy B's main advantage is **locality**: decoding a kmer touches one unitig's cache lines rather than an arbitrary offset in a large flat array, and the `rank` field doubles as a direct index into the packed nucleotide sequence without pointer arithmetic.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Partition-size tradeoff
|
## Field widths and capacity
|
||||||
|
|
||||||
The total bits/kmer for the index (sequence + evidence + MPHF) as a function of partition size is:
|
| field | bits | range | capacity check (*B. nana*, 256 partitions) |
|
||||||
|
|------------|------|---------------|---------------------------------------------|
|
||||||
|
| `seql − k` | 8 | 0–255 | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
|
||||||
|
| `rank` | 7 | 0–127 | observed max ~46 kmers/chunk; structural max k−m+1 = 21 |
|
||||||
|
| `chunk_id` | 25 | 0–33 554 431 | avg U ≈ 275 k chunks/partition |
|
||||||
|
|
||||||
$$
|
The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
|
||||||
\text{total} = \underbrace{2\!\left(1 + \frac{k-1}{m}\right)}_{\text{sequence}} + \underbrace{\log_2 P + \log_2\!\left(1+\frac{k-1}{m}\right)}_{\text{evidence}} + \underbrace{c_{MPHF}}_{\approx 2\text{–}4}
|
|
||||||
$$
|
|
||||||
|
|
||||||
### Empirical observation: m_u is set by De Bruijn graph topology, not partition count
|
|
||||||
|
|
||||||
Measured on *Betula nana* (k=31, m=11), summing n_kmers and sequence counts across all partition files:
|
|
||||||
|
|
||||||
| N partitions | m_sk | m_u | factor m_u/m_sk | nuc ratio (u/sk) |
|
|
||||||
|---|---|---|---|---|
|
|
||||||
| 1 | 12.13 | **41.89** | 3.45× | 0.273 |
|
|
||||||
| 16 | 12.13 | **38.19** | 3.15× | 0.376 |
|
|
||||||
| 256 | 12.13 | **37.90** | 3.12× | 0.388 |
|
|
||||||
| 1 024 | 12.13 | **37.89** | 3.12× | 0.389 |
|
|
||||||
|
|
||||||
- `m_sk` = avg kmers/super-kmer (invariant — same dataset regardless of partition scheme)
|
|
||||||
- `m_u` = avg kmers/unitig = total_n_kmers / total_unitigs, summed across all partitions
|
|
||||||
- `nuc ratio` = (u_symbols + 30·u_reads) / (sk_symbols + 30·sk_reads)
|
|
||||||
|
|
||||||
X-axis in both charts: partition bits (0 = 1 partition, 10 = 1024 partitions) — each step doubles the partition count.
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
xychart-beta
|
|
||||||
title "m_u (avg kmers/unitig) vs partition bits — B. nana k=31"
|
|
||||||
x-axis "partition bits" 0 --> 10
|
|
||||||
y-axis "m_u" 37 --> 43
|
|
||||||
line [41.89, 40.78, 39.22, 38.52, 38.19, 38.03, 37.96, 37.92, 37.90, 37.89, 37.89]
|
|
||||||
```
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
xychart-beta
|
|
||||||
title "Nucleotide storage: unitigs / super-kmers (%) vs partition bits — B. nana k=31"
|
|
||||||
x-axis "partition bits" 0 --> 10
|
|
||||||
y-axis "%" 25 --> 42
|
|
||||||
line [27.3, 29.7, 33.9, 36.3, 37.6, 38.3, 38.6, 38.7, 38.8, 38.9, 38.9]
|
|
||||||
```
|
|
||||||
|
|
||||||
Key observations:
|
|
||||||
|
|
||||||
1. **Partition boundaries have a small but non-zero effect on m_u.** Going from 1 to 1024 partitions reduces m_u by 10% (41.9 → 37.9). Within the practical range 16–1024, the variation is under 1% — m_u is effectively constant.
|
|
||||||
2. **m_u is a property of the De Bruijn graph, not the partition scheme.** The dominant factor is graph branching (heterozygosity, repeats, sequencing errors).
|
|
||||||
3. **Unitigs provide substantial compaction over super-kmers.** At 256 partitions, unitigs cover the same unique kmers using 39% of the raw nucleotide content of super-kmers (3.1× compaction factor).
|
|
||||||
|
|
||||||
#### Per-partition compaction ratio (sk_symbols / u_symbols)
|
|
||||||
|
|
||||||
The ratio measures how much super-kmer kmer-slots are "shared" across different super-kmer records: a ratio of 1.35 means each unique kmer (counted once in unitigs) appears in 1.35 super-kmer kmer-slots on average.
|
|
||||||
|
|
||||||
| bits | N partitions | median ratio | min ratio | min partition | min u_reads |
|
|
||||||
|---|---|---|---|---|---|
|
|
||||||
| 6 | 64 | 1.355 | 1.073 | — | 4.5 M |
|
|
||||||
| 7 | 128 | 1.352 | 1.037 | — | 4.1 M |
|
|
||||||
| 8 | 256 | **1.350** | **1.012** | **145** | **3.8 M** |
|
|
||||||
| 9 | 512 | 1.350 | 0.998 | 145 | 3.6 M |
|
|
||||||
| 10 | 1024 | 1.351 | 0.992 | 145 | 3.6 M |
|
|
||||||
|
|
||||||
The median stabilises at **1.35** from 64 partitions onward (stdev = 0.027 at 256 partitions). There is one persistent outlier: **partition 145** (at 256-partition resolution) is consistently anomalous across all partition depths — it contains 10–14× more super-kmers and unitigs than the average partition, with a ratio near 1.0, meaning the unitig representation provides almost no kmer deduplication. This is consistent with a highly repetitive or organellar region where the dominant minimiser belongs to a sequence that appears in many reads without forming long overlapping paths in the De Bruijn graph.
|
|
||||||
|
|
||||||
Per-partition parameters at 256 partitions (*B. nana*):
|
|
||||||
|
|
||||||
| quantity | value |
|
|
||||||
|---|---|
|
|
||||||
| P (unique kmers/partition, avg) | ≈ 10.4 M |
|
|
||||||
| U (unitigs/partition, avg) | ≈ 275 k |
|
|
||||||
| m_u | ≈ 37.9 |
|
|
||||||
| Strategy A bits/kmer | ⌈log₂(P·(1+30/m_u))⌉ = 25 |
|
|
||||||
| Strategy B bits/kmer | ⌈log₂(U)⌉ + 8 = 27 |
|
|
||||||
|
|
||||||
Consequence: **the partition count should be as large as memory and parallelism allow.** Each doubling saves 1 bit/kmer in evidence (log₂ P decreases by 1). The sequence term 2·(1 + 30/m_u) ≈ 3.6 bits/kmer is approximately constant.
|
|
||||||
|
|
||||||
Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)` grows more slowly than `log₂(P)` by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Implementation notes
|
## Evidence bit-cost
|
||||||
|
|
||||||
### Evidence file layout (strategy B — implemented)
|
Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
|
||||||
|
|
||||||
`evidence.bin` is a flat `[u32; n]` array with no header:
|
| field | theoretical cost | value |
|
||||||
|
|------------|-------------------------|---------|
|
||||||
|
| chunk_id | ⌈log₂ U⌉ | 19 bits |
|
||||||
|
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
|
||||||
|
| **stored** | aligned u32 | **32 bits/slot** |
|
||||||
|
|
||||||
```
|
The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
|
||||||
evidence.bin: n × 4 bytes, little-endian
|
|
||||||
each u32: bits [31:7] = chunk_id (25 bits)
|
|
||||||
bits [6:0] = rank (7 bits)
|
|
||||||
```
|
|
||||||
|
|
||||||
File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
|
||||||
|
|
||||||
The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
|
|
||||||
|
|
||||||
### Unitig file layout
|
|
||||||
|
|
||||||
Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
|
||||||
|
|
||||||
### Decoding a kmer from slot s
|
|
||||||
|
|
||||||
```
|
|
||||||
(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
|
|
||||||
kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
|
|
||||||
```
|
|
||||||
|
|
||||||
Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
|
|
||||||
|
|
||||||
### Field widths in practice
|
|
||||||
|
|
||||||
Rank is stored in 7 bits (0–127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
|
|
||||||
|
|
||||||
chunk_id is stored in 25 bits (0–33 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
|
|
||||||
|
|
||||||
### Forward vs reverse complement
|
|
||||||
|
|
||||||
The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-determinism of the unitig decomposition
|
## Unitig decomposition non-determinism
|
||||||
|
|
||||||
The unitig extraction is **not deterministic**: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.
|
The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
|
||||||
|
|
||||||
### Source of non-determinism
|
The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
|
||||||
|
|
||||||
The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with `ahash::RandomState::new()`). The `start_iter` first pass emits every node whose `can_extend_left` flag is false — which includes not only true dead-end nodes but also **branch points** (nodes with 2 or more left neighbours, for which `unique_neighbor` returns `None`).
|
|
||||||
|
|
||||||
When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.
|
|
||||||
|
|
||||||
**Example** — fork topology (k = 31):
|
**Example** — fork topology (k = 31):
|
||||||
|
|
||||||
@@ -248,23 +144,37 @@ A → B ← C
|
|||||||
D
|
D
|
||||||
```
|
```
|
||||||
|
|
||||||
All four nodes are in the graph. B has two left neighbours (A and C), so `can_extend_left = false`; B also has one right neighbour D, so `can_extend_right = true`.
|
B has two left neighbours, so `can_extend_left = false`. Two valid tilings:
|
||||||
|
|
||||||
| iteration order | unitigs produced | count |
|
| iteration order | unitigs | count |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| A first, then B, C | ABD · C | 2 |
|
| A first | ABD, C | 2 |
|
||||||
| B first, then A, C | BD · A · C | 3 |
|
| B first | BD, A, C | 3 |
|
||||||
|
|
||||||
Both tilings cover the same 4 canonical k-mers.
|
Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
|
||||||
|
|
||||||
Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.
|
This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
|
||||||
|
|
||||||
### Consequence for MPHF construction
|
---
|
||||||
|
|
||||||
The MPHF is built from the **k-mer set**, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.
|
## Partition-size tradeoff
|
||||||
|
|
||||||
|
Measured on *B. nana* (k=31, m=11), summing across all partitions:
|
||||||
|
|
||||||
|
| N partitions | m_u |
|
||||||
|
|---|---|
|
||||||
|
| 1 | 41.89 |
|
||||||
|
| 16 | 38.19 |
|
||||||
|
| 256 | 37.90 |
|
||||||
|
| 1 024 | 37.89 |
|
||||||
|
|
||||||
|
`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
|
||||||
|
|
||||||
|
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Open questions
|
## Open questions
|
||||||
|
|
||||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
- **Cross-partition set operations**: strategy B allows unitig-level operations (mark entire chunks present/absent) rather than kmer-level, reducing cost by a factor of m_u.
|
||||||
|
- **Eliminating evidence.bin**: at ~66% of per-layer lookup footprint, `evidence.bin` dominates index size. See [Evidence elimination design discussion](evidence_elimination.md).
|
||||||
|
|||||||
@@ -0,0 +1,18 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: implementation/unitig_evidence.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obiskio/src/unitig_index.rs` — format unitigs.bin + unitigs.bin.idx, UnitigFileWriter, UnitigFileReader, build_unitig_idx(), DEFAULT_BLOCK_BITS=0, chemin chaud block_bits=0 dans chunk_start()
|
||||||
|
- `obilayeredmap/src/evidence.rs` — encodage Evidence (chunk_id 25 bits | rank 7 bits), EvidenceWriter
|
||||||
|
- `obidebruinj/src/debruijn.rs` — extraction unitigs, chunking à MAX_KMERS_PER_CHUNK
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
FORT RISQUE DE DÉRIVE. Changements récents :
|
||||||
|
- `DEFAULT_BLOCK_BITS` est passé de 6 à 0 (accès O(1) par défaut)
|
||||||
|
- `block_bits` est maintenant un paramètre runtime de `build_unitig_idx()` et `UnitigFileWriter`
|
||||||
|
- `chunk_start()` a un chemin chaud explicite pour block_bits=0 (accès tableau direct, 0 scan)
|
||||||
|
- `open()` vs `open_sequential()` : distinction nouvelle, importante pour la compréhension du coût
|
||||||
|
- `iter_unitigs()` ajouté comme alias public de `iter_chunks_sequential()`
|
||||||
|
Mettre à jour la description du format .idx et le modèle de coût d'accès aléatoire.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: index.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikmer/src/main.rs` — point d'entrée CLI, contraintes globales
|
||||||
|
- `obikmer/src/cli.rs` — structure des arguments communs
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document de niveau projet (vue d'ensemble, motivations, contraintes fondamentales).
|
||||||
|
Pas de code Rust spécifique à vérifier au-delà des contraintes générales (k impair, formats d'entrée).
|
||||||
|
À mettre à jour si de nouvelles sous-commandes ou formats sont ajoutés.
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: kmers.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikseq/src/kmer.rs` — type Kmer, propriétés, forme canonique
|
||||||
|
- `obikseq/src/superkmer.rs` — type SuperKmer, longueur attendue
|
||||||
|
- `obiskbuilder/src/lib.rs` — extraction de superkmers par minimiseur
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Chevauche `theory/encoding.md` (encodage 2 bits) et `theory/minimizer.md` (choix du minimiseur).
|
||||||
|
Vérifier que la définition de SuperKmer est cohérente avec les invariants actuels de `obikseq`.
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: theory/encoding.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikseq/src/kmer.rs` — encodage 2 bits/base, revcomp, forme canonique
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document purement théorique. Peu de risque de dérive sauf si l'encodage interne de Kmer change.
|
||||||
|
Vérifier que la table d'encodage A=00, C=01, G=10, T=11 est toujours celle du code.
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: theory/entropy.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obiskbuilder/src/entropy_table.rs` — filtre Shannon sur les kmers à basse complexité
|
||||||
|
- `obiskbuilder/src/lib.rs` — application du filtre lors du scatter (phase 1)
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Document théorique stable. Vérifier que les paramètres `theta` et `level_max` dans le CLI
|
||||||
|
(`obikmer/src/cli.rs` → `CommonArgs`) correspondent bien à ce qui est décrit.
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: theory/indexing.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obikpartitionner/src/partition.rs` — routage par hash de minimiseur, choix des paramètres
|
||||||
|
- `obikpartitionner/src/lib.rs` — structure KmerPartition, nombre de partitions
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Vérifier que la doc mentionne bien que le nombre de partitions est une puissance de 2
|
||||||
|
(converti par `partitions_to_bits` dans `obikmer/src/cli.rs`).
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
<!-- coverage sidecar — ne pas ajouter au nav mkdocs -->
|
||||||
|
# Coverage: theory/minimizer.md
|
||||||
|
|
||||||
|
## Code couvert
|
||||||
|
|
||||||
|
- `obiskbuilder/src/lib.rs` — sélection du minimiseur par hash seedé (splitmix64 finalizer)
|
||||||
|
- `obikseq/src/superkmer.rs` — forme canonique du minimiseur, fenêtre glissante
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Vérifier que la fonction de hash décrite (splitmix64 finalizer avec graine) correspond
|
||||||
|
au code actuel. Vérifier aussi que la définition de « minimiseur canonique » est toujours cohérente.
|
||||||
@@ -44,6 +44,7 @@ nav:
|
|||||||
- On-disk storage: implementation/storage.md
|
- On-disk storage: implementation/storage.md
|
||||||
- MPHF selection: implementation/mphf.md
|
- MPHF selection: implementation/mphf.md
|
||||||
- Unitig evidence encoding: implementation/unitig_evidence.md
|
- Unitig evidence encoding: implementation/unitig_evidence.md
|
||||||
|
- Evidence elimination (discussion): implementation/evidence_elimination.md
|
||||||
- obilayeredmap crate: implementation/obilayeredmap.md
|
- obilayeredmap crate: implementation/obilayeredmap.md
|
||||||
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
||||||
- PersistentBitVec: implementation/persistent_bit_vec.md
|
- PersistentBitVec: implementation/persistent_bit_vec.md
|
||||||
|
|||||||
Generated
+2
@@ -1529,6 +1529,7 @@ dependencies = [
|
|||||||
"obikpartitionner",
|
"obikpartitionner",
|
||||||
"obikrope",
|
"obikrope",
|
||||||
"obikseq",
|
"obikseq",
|
||||||
|
"obilayeredmap",
|
||||||
"obipipeline",
|
"obipipeline",
|
||||||
"obiread",
|
"obiread",
|
||||||
"obiskbuilder",
|
"obiskbuilder",
|
||||||
@@ -1592,6 +1593,7 @@ dependencies = [
|
|||||||
name = "obilayeredmap"
|
name = "obilayeredmap"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
|
"bitvec",
|
||||||
"cacheline-ef",
|
"cacheline-ef",
|
||||||
"epserde",
|
"epserde",
|
||||||
"memmap2",
|
"memmap2",
|
||||||
|
|||||||
@@ -18,6 +18,8 @@ pub enum OKIError {
|
|||||||
DuplicateGenomeLabel(String),
|
DuplicateGenomeLabel(String),
|
||||||
/// Operation not valid for this index configuration.
|
/// Operation not valid for this index configuration.
|
||||||
InvalidInput(String),
|
InvalidInput(String),
|
||||||
|
/// Sources mix exact and approximate evidence, or use incompatible approx parameters.
|
||||||
|
IncompatibleEvidence(String),
|
||||||
}
|
}
|
||||||
|
|
||||||
pub type OKIResult<T> = Result<T, OKIError>;
|
pub type OKIResult<T> = Result<T, OKIError>;
|
||||||
@@ -32,7 +34,8 @@ impl fmt::Display for OKIError {
|
|||||||
OKIError::IncompatibleConfig => write!(f, "incompatible index configurations"),
|
OKIError::IncompatibleConfig => write!(f, "incompatible index configurations"),
|
||||||
OKIError::MismatchedMode => write!(f, "count mode requires all sources to have with_counts=true"),
|
OKIError::MismatchedMode => write!(f, "count mode requires all sources to have with_counts=true"),
|
||||||
OKIError::DuplicateGenomeLabel(l) => write!(f, "duplicate genome label across sources: {l}"),
|
OKIError::DuplicateGenomeLabel(l) => write!(f, "duplicate genome label across sources: {l}"),
|
||||||
OKIError::InvalidInput(m) => write!(f, "invalid input: {m}"),
|
OKIError::InvalidInput(m) => write!(f, "invalid input: {m}"),
|
||||||
|
OKIError::IncompatibleEvidence(m) => write!(f, "incompatible evidence: {m}"),
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -144,6 +144,8 @@ impl KmerIndex {
|
|||||||
let n = self.partition.n_partitions();
|
let n = self.partition.n_partitions();
|
||||||
let t = Stage::start("index");
|
let t = Stage::start("index");
|
||||||
let with_counts = self.meta.config.with_counts;
|
let with_counts = self.meta.config.with_counts;
|
||||||
|
let evidence = self.meta.config.evidence.clone();
|
||||||
|
let block_bits = self.meta.config.block_bits;
|
||||||
let total_kmers = AtomicUsize::new(0);
|
let total_kmers = AtomicUsize::new(0);
|
||||||
|
|
||||||
let pb = Arc::new(Mutex::new(
|
let pb = Arc::new(Mutex::new(
|
||||||
@@ -153,7 +155,7 @@ impl KmerIndex {
|
|||||||
));
|
));
|
||||||
|
|
||||||
(0..n).into_par_iter().for_each(|i| {
|
(0..n).into_par_iter().for_each(|i| {
|
||||||
match self.partition.build_index_layer(i, min_ab, max_ab, with_counts) {
|
match self.partition.build_index_layer(i, min_ab, max_ab, with_counts, &evidence, block_bits) {
|
||||||
Ok(0) => {}
|
Ok(0) => {}
|
||||||
Ok(n_kmers) => {
|
Ok(n_kmers) => {
|
||||||
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
|
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
|
||||||
|
|||||||
@@ -6,6 +6,7 @@ mod dump;
|
|||||||
mod index;
|
mod index;
|
||||||
mod merge;
|
mod merge;
|
||||||
mod rebuild;
|
mod rebuild;
|
||||||
|
mod reindex;
|
||||||
|
|
||||||
pub use error::{OKIError, OKIResult};
|
pub use error::{OKIError, OKIResult};
|
||||||
pub use distance::{DistanceMetric, DistanceOutput};
|
pub use distance::{DistanceMetric, DistanceOutput};
|
||||||
|
|||||||
@@ -9,6 +9,8 @@ use obisys::{Reporter, Stage};
|
|||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
|
use obilayeredmap::EvidenceKind;
|
||||||
|
|
||||||
use crate::error::{OKIError, OKIResult};
|
use crate::error::{OKIError, OKIResult};
|
||||||
use crate::index::KmerIndex;
|
use crate::index::KmerIndex;
|
||||||
use crate::meta::{GenomeInfo, IndexMeta};
|
use crate::meta::{GenomeInfo, IndexMeta};
|
||||||
@@ -61,6 +63,9 @@ impl KmerIndex {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Validate evidence compatibility ───────────────────────────────────
|
||||||
|
let evidence = validate_evidence_compat(sources)?;
|
||||||
|
|
||||||
// ── Compute final genome labels (rename duplicates if requested) ───────
|
// ── Compute final genome labels (rename duplicates if requested) ───────
|
||||||
let (source_labels, all_genomes) = compute_labels(sources, rename_duplicates)?;
|
let (source_labels, all_genomes) = compute_labels(sources, rename_duplicates)?;
|
||||||
|
|
||||||
@@ -91,6 +96,7 @@ impl KmerIndex {
|
|||||||
let mut meta = IndexMeta::read(output).map_err(OKIError::Io)?;
|
let mut meta = IndexMeta::read(output).map_err(OKIError::Io)?;
|
||||||
meta.genomes = all_genomes;
|
meta.genomes = all_genomes;
|
||||||
meta.config.with_counts = mode == MergeMode::Count;
|
meta.config.with_counts = mode == MergeMode::Count;
|
||||||
|
meta.config.evidence = evidence;
|
||||||
meta.write(output)?;
|
meta.write(output)?;
|
||||||
|
|
||||||
// In presence/absence mode, purge counts/ directories inherited from
|
// In presence/absence mode, purge counts/ directories inherited from
|
||||||
@@ -134,13 +140,14 @@ impl KmerIndex {
|
|||||||
let pb = partition_bar(n_partitions as u64);
|
let pb = partition_bar(n_partitions as u64);
|
||||||
|
|
||||||
let dst_partition = &dst.partition;
|
let dst_partition = &dst.partition;
|
||||||
|
let block_bits = dst.meta.config.block_bits;
|
||||||
|
|
||||||
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.filter_map(|i| {
|
.filter_map(|i| {
|
||||||
let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> =
|
let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> =
|
||||||
remaining_sources.iter().map(|s| (&s.partition, s.meta.genomes.len())).collect();
|
remaining_sources.iter().map(|s| (&s.partition, s.meta.genomes.len())).collect();
|
||||||
let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes).err();
|
let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes, block_bits).err();
|
||||||
pb.inc(1);
|
pb.inc(1);
|
||||||
result
|
result
|
||||||
})
|
})
|
||||||
@@ -258,6 +265,34 @@ fn partition_bar(n: u64) -> ProgressBar {
|
|||||||
pb
|
pb
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Check that all sources share the same evidence kind.
|
||||||
|
///
|
||||||
|
/// Rules:
|
||||||
|
/// - all `Exact` → OK, returns `Exact`
|
||||||
|
/// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }`
|
||||||
|
/// - mixed exact/approx or different approx params → `IncompatibleEvidence`
|
||||||
|
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind> {
|
||||||
|
let ref_ev = &sources[0].meta.config.evidence;
|
||||||
|
for src in &sources[1..] {
|
||||||
|
let ev = &src.meta.config.evidence;
|
||||||
|
let compat = match (ref_ev, ev) {
|
||||||
|
(EvidenceKind::Exact, EvidenceKind::Exact) => true,
|
||||||
|
(EvidenceKind::Approx { b: b1, z: z1 },
|
||||||
|
EvidenceKind::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
|
||||||
|
_ => false,
|
||||||
|
};
|
||||||
|
if !compat {
|
||||||
|
return Err(OKIError::IncompatibleEvidence(format!(
|
||||||
|
"source {:?} has evidence {:?}, expected {:?} — \
|
||||||
|
convert all sources to the same evidence kind first \
|
||||||
|
(use the `reindex` command)",
|
||||||
|
src.root_path.display(), ev, ref_ev,
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Ok(ref_ev.clone())
|
||||||
|
}
|
||||||
|
|
||||||
fn copy_dir_all(src: &Path, dst: &Path) -> io::Result<()> {
|
fn copy_dir_all(src: &Path, dst: &Path) -> io::Result<()> {
|
||||||
fs::create_dir_all(dst)?;
|
fs::create_dir_all(dst)?;
|
||||||
for entry in fs::read_dir(src)? {
|
for entry in fs::read_dir(src)? {
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ use std::fs;
|
|||||||
use std::io;
|
use std::io;
|
||||||
use std::path::Path;
|
use std::path::Path;
|
||||||
|
|
||||||
|
use obilayeredmap::EvidenceKind;
|
||||||
use serde::{Deserialize, Serialize};
|
use serde::{Deserialize, Serialize};
|
||||||
|
|
||||||
pub const META_FILENAME: &str = "index.meta";
|
pub const META_FILENAME: &str = "index.meta";
|
||||||
@@ -28,6 +29,13 @@ pub struct IndexConfig {
|
|||||||
pub minimizer_size: usize,
|
pub minimizer_size: usize,
|
||||||
pub n_bits: usize,
|
pub n_bits: usize,
|
||||||
pub with_counts: bool,
|
pub with_counts: bool,
|
||||||
|
#[serde(default)]
|
||||||
|
pub evidence: EvidenceKind,
|
||||||
|
/// Block size for the unitig index as a power-of-two exponent.
|
||||||
|
/// The `.idx` block covers 2^block_bits consecutive unitigs.
|
||||||
|
/// 0 = one entry per unitig (O(1) access, largest `.idx`).
|
||||||
|
#[serde(default)]
|
||||||
|
pub block_bits: u8,
|
||||||
}
|
}
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
|||||||
@@ -88,12 +88,13 @@ impl KmerIndex {
|
|||||||
pb.enable_steady_tick(Duration::from_millis(100));
|
pb.enable_steady_tick(Duration::from_millis(100));
|
||||||
|
|
||||||
let src_partition = &src.partition;
|
let src_partition = &src.partition;
|
||||||
|
let block_bits = meta.config.block_bits;
|
||||||
|
|
||||||
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.filter_map(|i| {
|
.filter_map(|i| {
|
||||||
let result = dst_partition
|
let result = dst_partition
|
||||||
.rebuild_partition(src_partition, i, filters, mode, n_genomes)
|
.rebuild_partition(src_partition, i, filters, mode, n_genomes, block_bits)
|
||||||
.err();
|
.err();
|
||||||
pb.inc(1);
|
pb.inc(1);
|
||||||
result
|
result
|
||||||
|
|||||||
@@ -0,0 +1,126 @@
|
|||||||
|
use std::fs;
|
||||||
|
use std::path::Path;
|
||||||
|
use std::time::Duration;
|
||||||
|
|
||||||
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
|
use obilayeredmap::{EvidenceKind, layer::Layer};
|
||||||
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
use obisys::{Reporter, Stage};
|
||||||
|
use rayon::prelude::*;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
use crate::error::{OKIError, OKIResult};
|
||||||
|
use crate::index::KmerIndex;
|
||||||
|
use crate::state::IndexState;
|
||||||
|
|
||||||
|
const EVIDENCE_FILE: &str = "evidence.bin";
|
||||||
|
const FINGERPRINT_FILE: &str = "fingerprint.bin";
|
||||||
|
const UNITIG_IDX_FILE: &str = "unitigs.bin.idx";
|
||||||
|
|
||||||
|
fn olm_to_oki(e: obilayeredmap::OLMError) -> OKIError {
|
||||||
|
OKIError::InvalidInput(e.to_string())
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerIndex {
|
||||||
|
/// Convert every layer's evidence bundle to `target` in-place.
|
||||||
|
///
|
||||||
|
/// - `Exact` → builds `evidence.bin` + `unitigs.bin.idx`, removes `fingerprint.bin`
|
||||||
|
/// - `Approx` → builds `fingerprint.bin`, removes `evidence.bin` + `unitigs.bin.idx`
|
||||||
|
///
|
||||||
|
/// The MPHF (`mphf.bin`) and unitigs (`unitigs.bin`) are never touched.
|
||||||
|
/// `index.meta` is updated with the new evidence kind on success.
|
||||||
|
pub fn reindex(
|
||||||
|
&mut self,
|
||||||
|
target: EvidenceKind,
|
||||||
|
block_bits: u8,
|
||||||
|
rep: &mut Reporter,
|
||||||
|
) -> OKIResult<()> {
|
||||||
|
if self.state() != IndexState::Indexed {
|
||||||
|
return Err(OKIError::NotIndexed(self.root_path.clone()));
|
||||||
|
}
|
||||||
|
|
||||||
|
let n = self.partition.n_partitions();
|
||||||
|
info!(
|
||||||
|
"reindex {} partition(s): {:?} → {:?}",
|
||||||
|
n, self.meta.config.evidence, target,
|
||||||
|
);
|
||||||
|
|
||||||
|
let t = Stage::start("reindex");
|
||||||
|
let pb = ProgressBar::new(n as u64).with_style(
|
||||||
|
ProgressStyle::with_template(
|
||||||
|
"reindex — [{bar:20}] {pos}/{len} | {msg}",
|
||||||
|
)
|
||||||
|
.unwrap()
|
||||||
|
.tick_strings(&["⠋","⠙","⠹","⠸","⠼","⠴","⠦","⠧","⠇","⠏"]),
|
||||||
|
);
|
||||||
|
pb.enable_steady_tick(Duration::from_millis(80));
|
||||||
|
|
||||||
|
let errors: Vec<String> = (0..n)
|
||||||
|
.into_par_iter()
|
||||||
|
.filter_map(|i| {
|
||||||
|
let res = reindex_partition(
|
||||||
|
&self.partition.part_dir(i).join("index"),
|
||||||
|
&target,
|
||||||
|
block_bits,
|
||||||
|
);
|
||||||
|
pb.inc(1);
|
||||||
|
res.err().map(|e| format!("partition {i}: {e}"))
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
pb.finish_and_clear();
|
||||||
|
|
||||||
|
if let Some(e) = errors.into_iter().next() {
|
||||||
|
return Err(OKIError::InvalidInput(e));
|
||||||
|
}
|
||||||
|
|
||||||
|
self.meta.config.evidence = target;
|
||||||
|
if matches!(self.meta.config.evidence, EvidenceKind::Exact) {
|
||||||
|
self.meta.config.block_bits = block_bits;
|
||||||
|
}
|
||||||
|
self.meta.write(&self.root_path)?;
|
||||||
|
rep.push(t.stop());
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Process all layers of one partition's index directory.
|
||||||
|
fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> {
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
let pm = PartitionMeta::load(index_dir).map_err(olm_to_oki)?;
|
||||||
|
for layer_idx in 0..pm.n_layers {
|
||||||
|
let layer_dir = index_dir.join(format!("layer_{layer_idx}"));
|
||||||
|
reindex_layer(&layer_dir, target, block_bits)?;
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
fn reindex_layer(layer_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> {
|
||||||
|
Layer::<()>::build_evidence(layer_dir, target, block_bits).map_err(olm_to_oki)?;
|
||||||
|
remove_stale_evidence(layer_dir, target)
|
||||||
|
}
|
||||||
|
|
||||||
|
fn remove_stale_evidence(layer_dir: &Path, target: &EvidenceKind) -> OKIResult<()> {
|
||||||
|
match target {
|
||||||
|
EvidenceKind::Exact => {
|
||||||
|
// fingerprint.bin is no longer valid
|
||||||
|
remove_if_exists(&layer_dir.join(FINGERPRINT_FILE));
|
||||||
|
}
|
||||||
|
EvidenceKind::Approx { .. } => {
|
||||||
|
// exact bundle is no longer valid
|
||||||
|
remove_if_exists(&layer_dir.join(EVIDENCE_FILE));
|
||||||
|
remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
fn remove_if_exists(path: &Path) {
|
||||||
|
if let Err(e) = fs::remove_file(path) {
|
||||||
|
if e.kind() != std::io::ErrorKind::NotFound {
|
||||||
|
eprintln!("warning: could not remove {}: {e}", path.display());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -18,6 +18,7 @@ obikpartitionner = { path = "../obikpartitionner" }
|
|||||||
obisys = { path = "../obisys" }
|
obisys = { path = "../obisys" }
|
||||||
obiskio = { path = "../obiskio" }
|
obiskio = { path = "../obiskio" }
|
||||||
obikindex = { path = "../obikindex" }
|
obikindex = { path = "../obikindex" }
|
||||||
|
obilayeredmap = { path = "../obilayeredmap" }
|
||||||
clap = { version = "4", features = ["derive"] }
|
clap = { version = "4", features = ["derive"] }
|
||||||
serde_json = "1"
|
serde_json = "1"
|
||||||
csv = "1"
|
csv = "1"
|
||||||
|
|||||||
@@ -55,6 +55,12 @@ pub fn partitions_to_bits(n: usize) -> usize {
|
|||||||
n.max(1).next_power_of_two().trailing_zeros() as usize
|
n.max(1).next_power_of_two().trailing_zeros() as usize
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Convert a block size (number of unitigs per block) to its `block_bits` exponent.
|
||||||
|
/// `block_size=1` → `block_bits=0` (one entry per unitig, O(1) random access).
|
||||||
|
pub fn block_size_to_bits(n: usize) -> u8 {
|
||||||
|
n.max(1).next_power_of_two().trailing_zeros() as u8
|
||||||
|
}
|
||||||
|
|
||||||
impl CommonArgs {
|
impl CommonArgs {
|
||||||
pub fn seqfile_paths(&self) -> obiread::PathIter {
|
pub fn seqfile_paths(&self) -> obiread::PathIter {
|
||||||
let paths = self.inputs.iter().map(PathBuf::from).collect();
|
let paths = self.inputs.iter().map(PathBuf::from).collect();
|
||||||
|
|||||||
@@ -0,0 +1,38 @@
|
|||||||
|
use clap::Args;
|
||||||
|
|
||||||
|
use super::index::resolve_approx_params;
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct EstimateArgs {
|
||||||
|
/// k-mer size used for querying (same as --kmer-size in index)
|
||||||
|
#[arg(short = 'k', long, default_value_t = 31)]
|
||||||
|
pub kmer_size: usize,
|
||||||
|
|
||||||
|
/// Findere z parameter: number of consecutive k-mers that must all match.
|
||||||
|
/// Effective indexed k-mer size is kmer_size - z + 1.
|
||||||
|
#[arg(short = 'z', long, default_value = None)]
|
||||||
|
pub findere_z: Option<u8>,
|
||||||
|
|
||||||
|
/// Fingerprint bits per slot (b). FP per z-window = 1/2^(b·z).
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub evidence_bits: Option<u8>,
|
||||||
|
|
||||||
|
/// Target false-positive rate per z-window (e.g. 0.01).
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub fp: Option<f64>,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: EstimateArgs) {
|
||||||
|
let (z, b, fp_window) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
||||||
|
|
||||||
|
let k_query = args.kmer_size;
|
||||||
|
let k_index = k_query.saturating_sub(z as usize - 1);
|
||||||
|
let fp_kmer = 1.0_f64 / 2_f64.powi(b as i32);
|
||||||
|
|
||||||
|
println!("{:<22} {}", "k (query):", k_query);
|
||||||
|
println!("{:<22} {}", "k (indexed):", k_index);
|
||||||
|
println!("{:<22} {}", "z:", z);
|
||||||
|
println!("{:<22} {}", "evidence bits (b):", b);
|
||||||
|
println!("{:<22} {:.3e} (1/2^{})", "FP per k-mer:", fp_kmer, b);
|
||||||
|
println!("{:<22} {:.3e} (1/2^{})", "FP per z-window:", fp_window, b as u32 * z as u32);
|
||||||
|
}
|
||||||
@@ -2,6 +2,7 @@ use std::path::PathBuf;
|
|||||||
|
|
||||||
use clap::Args;
|
use clap::Args;
|
||||||
use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex};
|
use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex};
|
||||||
|
use obilayeredmap::EvidenceKind;
|
||||||
|
|
||||||
fn parse_key_value(s: &str) -> Result<(String, String), String> {
|
fn parse_key_value(s: &str) -> Result<(String, String), String> {
|
||||||
let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?;
|
let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?;
|
||||||
@@ -11,7 +12,7 @@ use obikseq::{set_k, set_m};
|
|||||||
use obisys::Reporter;
|
use obisys::Reporter;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
use crate::cli::{CommonArgs, partitions_to_bits};
|
use crate::cli::{CommonArgs, block_size_to_bits, partitions_to_bits};
|
||||||
use crate::steps::scatter;
|
use crate::steps::scatter;
|
||||||
|
|
||||||
#[derive(Args)]
|
#[derive(Args)]
|
||||||
@@ -48,14 +49,121 @@ pub struct IndexArgs {
|
|||||||
#[arg(long, default_value_t = false)]
|
#[arg(long, default_value_t = false)]
|
||||||
pub keep_intermediate: bool,
|
pub keep_intermediate: bool,
|
||||||
|
|
||||||
|
/// Use approximate (fingerprint-based) evidence instead of exact evidence.
|
||||||
|
/// False-positive rate per z-window: 1/2^(b·z).
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub approx: bool,
|
||||||
|
|
||||||
|
/// Findere z parameter: number of consecutive k-mers that must all match.
|
||||||
|
/// Effective indexed k-mer size is kmer_size - z + 1.
|
||||||
|
#[arg(short = 'z', long, default_value = None)]
|
||||||
|
pub findere_z: Option<u8>,
|
||||||
|
|
||||||
|
/// Fingerprint bits per slot (b). FP per z-window = 1/2^(b·z).
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub evidence_bits: Option<u8>,
|
||||||
|
|
||||||
|
/// Target false-positive rate per z-window (e.g. 0.01).
|
||||||
|
/// Used to derive missing b or z.
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub fp: Option<f64>,
|
||||||
|
|
||||||
|
/// Block size for exact evidence `.idx` (number of unitigs per block).
|
||||||
|
/// Must be a power of two; rounded up if not. Default 1 = O(1) random access.
|
||||||
|
#[arg(long, default_value_t = 1)]
|
||||||
|
pub block_size: usize,
|
||||||
|
|
||||||
#[command(flatten)]
|
#[command(flatten)]
|
||||||
pub common: CommonArgs,
|
pub common: CommonArgs,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Resolve the (z, b, fp) triplet from the user-supplied subset.
|
||||||
|
///
|
||||||
|
/// Model: FP = 1/2^(b·z) ⟹ b·z = ⌈-log₂(fp)⌉
|
||||||
|
///
|
||||||
|
/// Rules when one value is missing (conservative = ceiling):
|
||||||
|
/// given z, b → fp = 1/2^(b·z)
|
||||||
|
/// given z, fp → b = ⌈-log₂(fp) / z⌉
|
||||||
|
/// given b, fp → z = ⌈-log₂(fp) / b⌉
|
||||||
|
/// given z only → b = 8 (default), fp derived
|
||||||
|
/// given b only → z = 1 (default), fp derived
|
||||||
|
/// given fp only → b = 8 (default), z derived
|
||||||
|
/// none given → z = 1, b = 8, fp = 1/256
|
||||||
|
pub(crate) fn resolve_approx_params(
|
||||||
|
z_opt: Option<u8>,
|
||||||
|
b_opt: Option<u8>,
|
||||||
|
fp_opt: Option<f64>,
|
||||||
|
) -> (u8, u8, f64) {
|
||||||
|
const DEFAULT_B: u8 = 8;
|
||||||
|
const DEFAULT_Z: u8 = 1;
|
||||||
|
|
||||||
|
let bits_needed = |fp: f64| -> u8 {
|
||||||
|
(-fp.log2()).ceil() as u8
|
||||||
|
};
|
||||||
|
|
||||||
|
match (z_opt, b_opt, fp_opt) {
|
||||||
|
// All three given: use b and z, recompute fp conservatively.
|
||||||
|
(Some(z), Some(b), Some(_fp)) => {
|
||||||
|
let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, fp)
|
||||||
|
}
|
||||||
|
// Two given, derive third.
|
||||||
|
(Some(z), Some(b), None) => {
|
||||||
|
let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, fp)
|
||||||
|
}
|
||||||
|
(Some(z), None, Some(fp)) => {
|
||||||
|
let bz = (-fp.log2()).ceil() as u32;
|
||||||
|
let b = ((bz + z as u32 - 1) / z as u32).max(1) as u8;
|
||||||
|
let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, actual_fp)
|
||||||
|
}
|
||||||
|
(None, Some(b), Some(fp)) => {
|
||||||
|
let bz = (-fp.log2()).ceil() as u32;
|
||||||
|
let z = ((bz + b as u32 - 1) / b as u32).max(1) as u8;
|
||||||
|
let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, actual_fp)
|
||||||
|
}
|
||||||
|
// One given, apply defaults for the other.
|
||||||
|
(Some(z), None, None) => {
|
||||||
|
let b = DEFAULT_B;
|
||||||
|
let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, fp)
|
||||||
|
}
|
||||||
|
(None, Some(b), None) => {
|
||||||
|
let z = DEFAULT_Z;
|
||||||
|
let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, fp)
|
||||||
|
}
|
||||||
|
(None, None, Some(fp)) => {
|
||||||
|
let b = DEFAULT_B;
|
||||||
|
let z = ((bits_needed(fp) as u32 + b as u32 - 1) / b as u32).max(1) as u8;
|
||||||
|
let actual_fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, actual_fp)
|
||||||
|
}
|
||||||
|
// None given: defaults.
|
||||||
|
(None, None, None) => {
|
||||||
|
let b = DEFAULT_B;
|
||||||
|
let z = DEFAULT_Z;
|
||||||
|
let fp = 1.0_f64 / (1u64 << (b as u32 * z as u32)) as f64;
|
||||||
|
(z, b, fp)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
pub fn run(args: IndexArgs) {
|
pub fn run(args: IndexArgs) {
|
||||||
let output = args.output.clone();
|
let output = args.output.clone();
|
||||||
let mut rep = Reporter::new();
|
let mut rep = Reporter::new();
|
||||||
|
|
||||||
|
// ── Resolve evidence kind ────────────────────────────────────────────────
|
||||||
|
let evidence = if args.approx {
|
||||||
|
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
||||||
|
info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}");
|
||||||
|
EvidenceKind::Approx { b, z }
|
||||||
|
} else {
|
||||||
|
EvidenceKind::Exact
|
||||||
|
};
|
||||||
|
|
||||||
// ── Open or create the index ─────────────────────────────────────────────
|
// ── Open or create the index ─────────────────────────────────────────────
|
||||||
let mut idx = if KmerIndex::exists(&output) && !args.force {
|
let mut idx = if KmerIndex::exists(&output) && !args.force {
|
||||||
info!("resuming from existing index at {}", output.display());
|
info!("resuming from existing index at {}", output.display());
|
||||||
@@ -76,11 +184,14 @@ pub fn run(args: IndexArgs) {
|
|||||||
if effective != args.common.partitions {
|
if effective != args.common.partitions {
|
||||||
info!("partitions: {} → {} (next power of 2)", args.common.partitions, effective);
|
info!("partitions: {} → {} (next power of 2)", args.common.partitions, effective);
|
||||||
}
|
}
|
||||||
|
let block_bits = block_size_to_bits(args.block_size);
|
||||||
let config = IndexConfig {
|
let config = IndexConfig {
|
||||||
kmer_size: args.common.kmer_size,
|
kmer_size: args.common.kmer_size,
|
||||||
minimizer_size: args.common.minimizer_size,
|
minimizer_size: args.common.minimizer_size,
|
||||||
n_bits,
|
n_bits,
|
||||||
with_counts: args.with_counts,
|
with_counts: args.with_counts,
|
||||||
|
evidence: evidence.clone(),
|
||||||
|
block_bits,
|
||||||
};
|
};
|
||||||
let genome_info = args.label.as_ref().map(|label| {
|
let genome_info = args.label.as_ref().map(|label| {
|
||||||
let mut info = GenomeInfo::new(label.clone());
|
let mut info = GenomeInfo::new(label.clone());
|
||||||
|
|||||||
@@ -1,9 +1,11 @@
|
|||||||
pub mod annotate;
|
pub mod annotate;
|
||||||
pub mod distance;
|
pub mod distance;
|
||||||
pub mod dump;
|
pub mod dump;
|
||||||
|
pub mod estimate;
|
||||||
pub mod index;
|
pub mod index;
|
||||||
pub mod merge;
|
pub mod merge;
|
||||||
pub mod query;
|
pub mod query;
|
||||||
pub mod rebuild;
|
pub mod rebuild;
|
||||||
|
pub mod reindex;
|
||||||
pub mod superkmer;
|
pub mod superkmer;
|
||||||
pub mod unitig;
|
pub mod unitig;
|
||||||
|
|||||||
@@ -0,0 +1,68 @@
|
|||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use obikindex::KmerIndex;
|
||||||
|
use obilayeredmap::EvidenceKind;
|
||||||
|
use obisys::Reporter;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
use crate::cli::block_size_to_bits;
|
||||||
|
use super::index::resolve_approx_params;
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct ReindexArgs {
|
||||||
|
/// Index directory to convert (modified in-place)
|
||||||
|
pub index: PathBuf,
|
||||||
|
|
||||||
|
/// Convert to approximate evidence (default: convert to exact).
|
||||||
|
/// Requires --evidence-bits and/or -z and/or --fp.
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub approx: bool,
|
||||||
|
|
||||||
|
/// Findere z parameter (≥1).
|
||||||
|
#[arg(short = 'z', long, default_value = None)]
|
||||||
|
pub findere_z: Option<u8>,
|
||||||
|
|
||||||
|
/// Fingerprint bits per slot (b).
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub evidence_bits: Option<u8>,
|
||||||
|
|
||||||
|
/// Target false-positive rate per z-window.
|
||||||
|
#[arg(long, default_value = None)]
|
||||||
|
pub fp: Option<f64>,
|
||||||
|
|
||||||
|
/// Block size for exact evidence `.idx` (number of unitigs per block).
|
||||||
|
/// Ignored when converting to approximate evidence.
|
||||||
|
#[arg(long, default_value_t = 1)]
|
||||||
|
pub block_size: usize,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: ReindexArgs) {
|
||||||
|
let target = if args.approx {
|
||||||
|
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
||||||
|
info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}");
|
||||||
|
EvidenceKind::Approx { b, z }
|
||||||
|
} else {
|
||||||
|
info!("target: exact evidence");
|
||||||
|
EvidenceKind::Exact
|
||||||
|
};
|
||||||
|
|
||||||
|
let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
info!(
|
||||||
|
"current evidence: {:?}",
|
||||||
|
idx.meta().config.evidence,
|
||||||
|
);
|
||||||
|
|
||||||
|
let block_bits = block_size_to_bits(args.block_size);
|
||||||
|
let mut rep = Reporter::new();
|
||||||
|
idx.reindex(target, block_bits, &mut rep).unwrap_or_else(|e| {
|
||||||
|
eprintln!("reindex error: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
rep.print();
|
||||||
|
}
|
||||||
@@ -35,13 +35,13 @@ pub fn run(args: UnitigArgs) {
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
let reader = UnitigFileReader::open(&path).unwrap_or_else(|e| {
|
// open_sequential: works with and without .idx (approx or exact index)
|
||||||
|
let reader = UnitigFileReader::open_sequential(&path).unwrap_or_else(|e| {
|
||||||
eprintln!("error opening unitigs (partition {i}): {e}");
|
eprintln!("error opening unitigs (partition {i}): {e}");
|
||||||
std::process::exit(1)
|
std::process::exit(1)
|
||||||
});
|
});
|
||||||
|
|
||||||
for j in 0..reader.len() {
|
for (j, unitig) in reader.iter_unitigs() {
|
||||||
let unitig = reader.unitig(j);
|
|
||||||
let mut out = stdout.lock().unwrap();
|
let mut out = stdout.lock().unwrap();
|
||||||
write_unitig(&unitig, k, i, j, &mut *out).unwrap_or_else(|e| {
|
write_unitig(&unitig, k, i, j, &mut *out).unwrap_or_else(|e| {
|
||||||
eprintln!("write error: {e}");
|
eprintln!("write error: {e}");
|
||||||
|
|||||||
@@ -32,6 +32,10 @@ enum Commands {
|
|||||||
Distance(cmd::distance::DistanceArgs),
|
Distance(cmd::distance::DistanceArgs),
|
||||||
/// Dump unitigs from a built index to stdout (debug)
|
/// Dump unitigs from a built index to stdout (debug)
|
||||||
Unitig(cmd::unitig::UnitigArgs),
|
Unitig(cmd::unitig::UnitigArgs),
|
||||||
|
/// Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing
|
||||||
|
Estimate(cmd::estimate::EstimateArgs),
|
||||||
|
/// Convert an index's evidence in-place: exact ↔ approx
|
||||||
|
Reindex(cmd::reindex::ReindexArgs),
|
||||||
}
|
}
|
||||||
|
|
||||||
fn main() {
|
fn main() {
|
||||||
@@ -62,6 +66,8 @@ fn main() {
|
|||||||
Commands::Annotate(args) => cmd::annotate::run(args),
|
Commands::Annotate(args) => cmd::annotate::run(args),
|
||||||
Commands::Distance(args) => cmd::distance::run(args),
|
Commands::Distance(args) => cmd::distance::run(args),
|
||||||
Commands::Unitig(args) => cmd::unitig::run(args),
|
Commands::Unitig(args) => cmd::unitig::run(args),
|
||||||
|
Commands::Estimate(args) => cmd::estimate::run(args),
|
||||||
|
Commands::Reindex(args) => cmd::reindex::run(args),
|
||||||
}
|
}
|
||||||
|
|
||||||
#[cfg(feature = "profiling")]
|
#[cfg(feature = "profiling")]
|
||||||
|
|||||||
@@ -1,4 +1,6 @@
|
|||||||
use std::path::PathBuf;
|
use std::path::PathBuf;
|
||||||
|
use std::sync::atomic::{AtomicU64, Ordering};
|
||||||
|
use std::sync::Arc;
|
||||||
use std::time::{Duration, Instant};
|
use std::time::{Duration, Instant};
|
||||||
|
|
||||||
use indicatif::{ProgressBar, ProgressStyle};
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
@@ -42,15 +44,21 @@ pub fn scatter(
|
|||||||
// Throttle in the source thread — never in a worker — to prevent deadlock.
|
// Throttle in the source thread — never in a worker — to prevent deadlock.
|
||||||
let throttled = throttle_paths(path_source, max_open);
|
let throttled = throttle_paths(path_source, max_open);
|
||||||
|
|
||||||
|
let file_count = Arc::new(AtomicU64::new(0));
|
||||||
|
|
||||||
let t = Stage::start("scatter");
|
let t = Stage::start("scatter");
|
||||||
let pipe = obipipeline::make_pipe! {
|
let pipe = obipipeline::make_pipe! {
|
||||||
PipelineData : PathWithSlot => Vec<RoutableSuperKmer>,
|
PipelineData : PathWithSlot => Vec<RoutableSuperKmer>,
|
||||||
||? { move |pw: PathWithSlot| {
|
||? {
|
||||||
let PathWithSlot { path, _guard } = pw;
|
let file_count = Arc::clone(&file_count);
|
||||||
info!("indexing: {}", path.display());
|
move |pw: PathWithSlot| {
|
||||||
// _guard travels into GuardedIter; released when all chunks are read
|
let PathWithSlot { path, _guard } = pw;
|
||||||
open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard })
|
let n = file_count.fetch_add(1, Ordering::Relaxed) + 1;
|
||||||
}} : Path => RawChunk,
|
info!("indexing [{}]: {}", n, path.display());
|
||||||
|
// _guard travels into GuardedIter; released when all chunks are read
|
||||||
|
open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard })
|
||||||
|
}
|
||||||
|
} : Path => RawChunk,
|
||||||
|? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk,
|
|? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk,
|
||||||
| { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch,
|
| { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch,
|
||||||
};
|
};
|
||||||
@@ -84,7 +92,8 @@ pub fn scatter(
|
|||||||
} else {
|
} else {
|
||||||
(format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6))
|
(format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6))
|
||||||
};
|
};
|
||||||
pb.set_message(format!("{count_str} {rate_str}"));
|
let n_files = file_count.load(Ordering::Relaxed);
|
||||||
|
pb.set_message(format!("{count_str} {rate_str} {n_files} files"));
|
||||||
}
|
}
|
||||||
kp.write_batch(batch).unwrap_or_else(|e| {
|
kp.write_batch(batch).unwrap_or_else(|e| {
|
||||||
eprintln!("error: {e}");
|
eprintln!("error: {e}");
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ use cacheline_ef::{CachelineEf, CachelineEfVec};
|
|||||||
use epserde::prelude::*;
|
use epserde::prelude::*;
|
||||||
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
||||||
use obidebruinj::GraphDeBruijn;
|
use obidebruinj::GraphDeBruijn;
|
||||||
use obilayeredmap::{OLMError, layer::Layer};
|
use obilayeredmap::{EvidenceKind, OLMError, layer::Layer};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
use obiskio::{SKError, SKFileMeta, SKFileReader};
|
use obiskio::{SKError, SKFileMeta, SKFileReader};
|
||||||
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
||||||
@@ -44,6 +44,8 @@ impl KmerPartition {
|
|||||||
min_ab: u32,
|
min_ab: u32,
|
||||||
max_ab: Option<u32>,
|
max_ab: Option<u32>,
|
||||||
with_counts: bool,
|
with_counts: bool,
|
||||||
|
evidence: &EvidenceKind,
|
||||||
|
block_bits: u8,
|
||||||
) -> Result<usize, SKError> {
|
) -> Result<usize, SKError> {
|
||||||
let part_dir = self.part_dir(i);
|
let part_dir = self.part_dir(i);
|
||||||
let dedup_path = part_dir.join("dereplicated.skmer.zst");
|
let dedup_path = part_dir.join("dereplicated.skmer.zst");
|
||||||
@@ -108,7 +110,7 @@ impl KmerPartition {
|
|||||||
uw.close()?;
|
uw.close()?;
|
||||||
|
|
||||||
if with_counts {
|
if with_counts {
|
||||||
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, |kmer| {
|
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, block_bits, |kmer| {
|
||||||
match (&mphf1_opt, &counts1_opt) {
|
match (&mphf1_opt, &counts1_opt) {
|
||||||
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
||||||
_ => 1,
|
_ => 1,
|
||||||
@@ -116,7 +118,13 @@ impl KmerPartition {
|
|||||||
})
|
})
|
||||||
.map_err(olm_to_sk)?;
|
.map_err(olm_to_sk)?;
|
||||||
} else {
|
} else {
|
||||||
Layer::<()>::build(&layer_dir).map_err(olm_to_sk)?;
|
Layer::<()>::build(&layer_dir, block_bits).map_err(olm_to_sk)?;
|
||||||
|
}
|
||||||
|
|
||||||
|
// For approximate evidence: replace the exact evidence bundle with a
|
||||||
|
// fingerprint. For exact evidence, build() already wrote it.
|
||||||
|
if let EvidenceKind::Approx { b, z } = evidence {
|
||||||
|
Layer::<()>::build_approx_evidence(&layer_dir, *b, *z).map_err(olm_to_sk)?;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Write meta.json in the index/ directory so LayeredMap::open works
|
// Write meta.json in the index/ directory so LayeredMap::open works
|
||||||
|
|||||||
@@ -161,6 +161,7 @@ impl KmerPartition {
|
|||||||
sources: &[(&KmerPartition, usize)],
|
sources: &[(&KmerPartition, usize)],
|
||||||
mode: MergeMode,
|
mode: MergeMode,
|
||||||
n_dst_genomes: usize,
|
n_dst_genomes: usize,
|
||||||
|
block_bits: u8,
|
||||||
) -> SKResult<()> {
|
) -> SKResult<()> {
|
||||||
let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
|
let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
|
||||||
if !dst_index_dir.exists() {
|
if !dst_index_dir.exists() {
|
||||||
@@ -216,7 +217,7 @@ impl KmerPartition {
|
|||||||
uw.write(&unitig)?;
|
uw.write(&unitig)?;
|
||||||
}
|
}
|
||||||
uw.close()?;
|
uw.close()?;
|
||||||
Layer::<()>::build(&new_layer_dir).map_err(olm_to_sk)?;
|
Layer::<()>::build(&new_layer_dir, block_bits).map_err(olm_to_sk)?;
|
||||||
}
|
}
|
||||||
drop(g);
|
drop(g);
|
||||||
|
|
||||||
|
|||||||
@@ -96,6 +96,7 @@ impl KmerPartition {
|
|||||||
filters: &[Box<dyn KmerFilter>],
|
filters: &[Box<dyn KmerFilter>],
|
||||||
mode: MergeMode,
|
mode: MergeMode,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
|
block_bits: u8,
|
||||||
) -> SKResult<()> {
|
) -> SKResult<()> {
|
||||||
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
|
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
|
||||||
if !src_index_dir.exists() {
|
if !src_index_dir.exists() {
|
||||||
@@ -145,7 +146,7 @@ impl KmerPartition {
|
|||||||
uw.close()?;
|
uw.close()?;
|
||||||
drop(g);
|
drop(g);
|
||||||
|
|
||||||
Layer::<()>::build(&dst_layer_dir).map_err(olm_to_sk)?;
|
Layer::<()>::build(&dst_layer_dir, block_bits).map_err(olm_to_sk)?;
|
||||||
let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?;
|
let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
// ── Prepare matrix builders (one column per genome) ───────────────────
|
// ── Prepare matrix builders (one column per genome) ───────────────────
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ cacheline-ef = "1.1"
|
|||||||
epserde = "0.8"
|
epserde = "0.8"
|
||||||
rayon = "1"
|
rayon = "1"
|
||||||
ndarray = "0.16"
|
ndarray = "0.16"
|
||||||
|
bitvec = "1"
|
||||||
memmap2 = "0.9"
|
memmap2 = "0.9"
|
||||||
serde = { version = "1", features = ["derive"] }
|
serde = { version = "1", features = ["derive"] }
|
||||||
serde_json = "1"
|
serde_json = "1"
|
||||||
|
|||||||
@@ -0,0 +1,151 @@
|
|||||||
|
// Packed B-bit fingerprint vector, one entry per MPHF slot.
|
||||||
|
//
|
||||||
|
// File format (fingerprint.bin):
|
||||||
|
// magic: b"FPVF" (4 bytes)
|
||||||
|
// b: u8 (bits per fingerprint, 1..=64)
|
||||||
|
// padding: [0u8; 3]
|
||||||
|
// n: u64 LE (number of slots)
|
||||||
|
// data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||||||
|
|
||||||
|
use std::fs::File;
|
||||||
|
use std::io::{BufWriter, Write};
|
||||||
|
use std::path::Path;
|
||||||
|
|
||||||
|
use bitvec::prelude::*;
|
||||||
|
use memmap2::Mmap;
|
||||||
|
|
||||||
|
use crate::error::{OLMError, OLMResult};
|
||||||
|
|
||||||
|
const MAGIC: &[u8; 4] = b"FPVF";
|
||||||
|
const HEADER: usize = 16;
|
||||||
|
|
||||||
|
// ── Reader ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
pub struct FingerprintVec {
|
||||||
|
_mmap: Mmap,
|
||||||
|
bits: &'static BitSlice<u8, Lsb0>,
|
||||||
|
n: usize,
|
||||||
|
b: u8,
|
||||||
|
mask: u64,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl FingerprintVec {
|
||||||
|
pub fn open(path: &Path) -> OLMResult<Self> {
|
||||||
|
let f = File::open(path)?;
|
||||||
|
let mmap = unsafe { Mmap::map(&f)? };
|
||||||
|
if mmap.len() < HEADER || &mmap[..4] != MAGIC {
|
||||||
|
return Err(OLMError::InvalidLayer("bad fingerprint magic".into()));
|
||||||
|
}
|
||||||
|
let b = mmap[4];
|
||||||
|
if b == 0 || b > 64 {
|
||||||
|
return Err(OLMError::InvalidLayer("invalid fingerprint width".into()));
|
||||||
|
}
|
||||||
|
let n = u64::from_le_bytes(mmap[8..16].try_into().unwrap()) as usize;
|
||||||
|
let mask: u64 = if b == 64 { u64::MAX } else { (1u64 << b) - 1 };
|
||||||
|
// SAFETY: the mmap lives as long as Self (kept via _mmap); data is read-only.
|
||||||
|
let data: &'static [u8] = unsafe {
|
||||||
|
std::slice::from_raw_parts(mmap[HEADER..].as_ptr(), (n * b as usize + 7) / 8)
|
||||||
|
};
|
||||||
|
let bits = BitSlice::<u8, Lsb0>::from_slice(data);
|
||||||
|
Ok(Self { _mmap: mmap, bits, n, b, mask })
|
||||||
|
}
|
||||||
|
|
||||||
|
#[inline]
|
||||||
|
pub fn get(&self, slot: usize) -> u64 {
|
||||||
|
debug_assert!(slot < self.n);
|
||||||
|
let lo = slot * self.b as usize;
|
||||||
|
self.bits[lo .. lo + self.b as usize].load_le::<u64>()
|
||||||
|
}
|
||||||
|
|
||||||
|
#[inline]
|
||||||
|
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool {
|
||||||
|
self.get(slot) == (fingerprint & self.mask)
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn n(&self) -> usize { self.n }
|
||||||
|
pub fn b(&self) -> u8 { self.b }
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Writer ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
pub struct FingerprintVecWriter {
|
||||||
|
buf: Vec<u8>,
|
||||||
|
n: usize,
|
||||||
|
b: u8,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl FingerprintVecWriter {
|
||||||
|
pub fn new(n: usize, b: u8) -> Self {
|
||||||
|
assert!(b > 0 && b <= 64, "fingerprint width must be 1..=64");
|
||||||
|
let data_bytes = (n * b as usize + 7) / 8;
|
||||||
|
Self { buf: vec![0u8; data_bytes], n, b }
|
||||||
|
}
|
||||||
|
|
||||||
|
#[inline]
|
||||||
|
pub fn set(&mut self, slot: usize, fingerprint: u64) {
|
||||||
|
debug_assert!(slot < self.n);
|
||||||
|
let lo = slot * self.b as usize;
|
||||||
|
let bits = BitSlice::<u8, Lsb0>::from_slice_mut(&mut self.buf);
|
||||||
|
bits[lo .. lo + self.b as usize].store_le(fingerprint);
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn write(self, path: &Path) -> OLMResult<()> {
|
||||||
|
let mut f = BufWriter::new(File::create(path).map_err(OLMError::Io)?);
|
||||||
|
f.write_all(MAGIC).map_err(OLMError::Io)?;
|
||||||
|
f.write_all(&[self.b, 0, 0, 0]).map_err(OLMError::Io)?;
|
||||||
|
f.write_all(&(self.n as u64).to_le_bytes()).map_err(OLMError::Io)?;
|
||||||
|
f.write_all(&self.buf).map_err(OLMError::Io)?;
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
#[cfg(test)]
|
||||||
|
mod tests {
|
||||||
|
use super::*;
|
||||||
|
use tempfile::tempdir;
|
||||||
|
|
||||||
|
fn roundtrip(n: usize, b: u8, values: &[u64]) {
|
||||||
|
let dir = tempdir().unwrap();
|
||||||
|
let path = dir.path().join("fp.bin");
|
||||||
|
let mask: u64 = if b == 64 { u64::MAX } else { (1u64 << b) - 1 };
|
||||||
|
let mut w = FingerprintVecWriter::new(n, b);
|
||||||
|
for (i, &v) in values.iter().enumerate() {
|
||||||
|
w.set(i, v);
|
||||||
|
}
|
||||||
|
w.write(&path).unwrap();
|
||||||
|
let r = FingerprintVec::open(&path).unwrap();
|
||||||
|
assert_eq!(r.n(), n);
|
||||||
|
assert_eq!(r.b(), b);
|
||||||
|
for (i, &v) in values.iter().enumerate() {
|
||||||
|
assert_eq!(r.get(i), v & mask, "slot {i} b={b}");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn roundtrip_b1() { roundtrip(8, 1, &[1, 0, 1, 1, 0, 0, 1, 0]); }
|
||||||
|
#[test]
|
||||||
|
fn roundtrip_b8() { roundtrip(4, 8, &[0, 127, 200, 255]); }
|
||||||
|
#[test]
|
||||||
|
fn roundtrip_b16() { roundtrip(3, 16, &[0, 0xABCD, 0xFFFF]); }
|
||||||
|
#[test]
|
||||||
|
fn roundtrip_b7() { roundtrip(5, 7, &[0, 63, 127, 1, 42]); }
|
||||||
|
#[test]
|
||||||
|
fn roundtrip_b13_unaligned() {
|
||||||
|
let vals: Vec<u64> = (0..20).map(|i| (i * 317) % (1 << 13)).collect();
|
||||||
|
roundtrip(20, 13, &vals);
|
||||||
|
}
|
||||||
|
#[test]
|
||||||
|
fn matches_returns_true_for_exact() {
|
||||||
|
let dir = tempdir().unwrap();
|
||||||
|
let path = dir.path().join("fp.bin");
|
||||||
|
let mut w = FingerprintVecWriter::new(4, 8);
|
||||||
|
w.set(0, 42); w.set(1, 0); w.set(2, 255); w.set(3, 17);
|
||||||
|
w.write(&path).unwrap();
|
||||||
|
let r = FingerprintVec::open(&path).unwrap();
|
||||||
|
assert!( r.matches(0, 42));
|
||||||
|
assert!(!r.matches(0, 43));
|
||||||
|
assert!( r.matches(2, 255));
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -10,6 +10,7 @@ use obikseq::CanonicalKmer;
|
|||||||
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
||||||
|
|
||||||
use crate::error::{OLMError, OLMResult};
|
use crate::error::{OLMError, OLMResult};
|
||||||
|
use crate::meta::EvidenceKind;
|
||||||
use crate::mphf_layer::MphfLayer;
|
use crate::mphf_layer::MphfLayer;
|
||||||
pub(crate) use crate::mphf_layer::UNITIGS_FILE;
|
pub(crate) use crate::mphf_layer::UNITIGS_FILE;
|
||||||
|
|
||||||
@@ -76,19 +77,36 @@ impl<D: LayerData> Layer<D> {
|
|||||||
pub fn unitig_writer(out_dir: &Path) -> OLMResult<UnitigFileWriter> {
|
pub fn unitig_writer(out_dir: &Path) -> OLMResult<UnitigFileWriter> {
|
||||||
MphfLayer::unitig_writer(out_dir)
|
MphfLayer::unitig_writer(out_dir)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Build `unitigs.bin.idx` and `evidence.bin` from `unitigs.bin` and
|
||||||
|
/// `mphf.bin` already present in `layer_dir`.
|
||||||
|
/// `block_bits` controls the `.idx` block size (2^block_bits chunks/block).
|
||||||
|
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize> {
|
||||||
|
MphfLayer::build_exact_evidence(layer_dir, block_bits)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Build `fingerprint.bin` from `unitigs.bin` and `mphf.bin` already
|
||||||
|
/// present in `layer_dir`. `b` — fingerprint bits (1..=64); `z` — Findere
|
||||||
|
/// consecutive k-mer parameter (≥1).
|
||||||
|
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize> {
|
||||||
|
MphfLayer::build_approx_evidence(layer_dir, b, z)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Dispatch to `build_exact_evidence` or `build_approx_evidence`.
|
||||||
|
/// `block_bits` is forwarded to exact evidence only.
|
||||||
|
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
|
||||||
|
MphfLayer::build_evidence(layer_dir, kind, block_bits)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 1 — set membership ───────────────────────────────────────────────────
|
// ── Mode 1 — set membership ───────────────────────────────────────────────────
|
||||||
|
|
||||||
impl Layer<()> {
|
impl Layer<()> {
|
||||||
pub fn build(out_dir: &Path) -> OLMResult<usize> {
|
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize> {
|
||||||
MphfLayer::build(out_dir, &mut |_, _| Ok(()))
|
MphfLayer::build(out_dir, block_bits, &mut |_, _| Ok(()))
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Create a presence matrix for a set-membership layer (first merge).
|
/// Create a presence matrix for a set-membership layer (first merge).
|
||||||
///
|
|
||||||
/// All `n_kmers` slots are set to `true`: every kmer in this layer belongs
|
|
||||||
/// to genome_0, so genome_0 is present at every slot.
|
|
||||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> {
|
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> {
|
||||||
let presence_dir = layer_dir.join(PRESENCE_DIR);
|
let presence_dir = layer_dir.join(PRESENCE_DIR);
|
||||||
fs::create_dir_all(&presence_dir).map_err(OLMError::Io)?;
|
fs::create_dir_all(&presence_dir).map_err(OLMError::Io)?;
|
||||||
@@ -102,16 +120,20 @@ impl Layer<()> {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 2 — count matrix (1 column per layer) ────────────────────────────────
|
// ── Mode 2 — count matrix ─────────────────────────────────────────────────────
|
||||||
|
|
||||||
impl Layer<PersistentCompactIntMatrix> {
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
pub fn build(
|
||||||
let n = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
out_dir: &Path,
|
||||||
|
block_bits: u8,
|
||||||
|
count_of: impl Fn(CanonicalKmer) -> u32,
|
||||||
|
) -> OLMResult<usize> {
|
||||||
|
let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
||||||
let counts_dir = out_dir.join(COUNTS_DIR);
|
let counts_dir = out_dir.join(COUNTS_DIR);
|
||||||
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
|
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
|
||||||
.map_err(OLMError::Io)?;
|
.map_err(OLMError::Io)?;
|
||||||
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
||||||
let n_built = MphfLayer::build(out_dir, &mut |slot, kmer| {
|
let n_built = MphfLayer::build(out_dir, block_bits, &mut |slot, kmer| {
|
||||||
col.set(slot, count_of(kmer));
|
col.set(slot, count_of(kmer));
|
||||||
Ok(())
|
Ok(())
|
||||||
})?;
|
})?;
|
||||||
@@ -122,16 +144,16 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
|
|
||||||
pub fn build_from_map(
|
pub fn build_from_map(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
|
block_bits: u8,
|
||||||
counts: &HashMap<CanonicalKmer, u32>,
|
counts: &HashMap<CanonicalKmer, u32>,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
Self::build(out_dir, |kmer| counts.get(&kmer).copied().unwrap_or(0))
|
Self::build(out_dir, block_bits, |kmer| counts.get(&kmer).copied().unwrap_or(0))
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 2 — count matrix column append ──────────────────────────────────────
|
// ── Mode 2 — count matrix column append ──────────────────────────────────────
|
||||||
|
|
||||||
impl Layer<PersistentCompactIntMatrix> {
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
/// Append a genome column to an existing count matrix.
|
|
||||||
pub fn append_genome_column(
|
pub fn append_genome_column(
|
||||||
layer_dir: &Path,
|
layer_dir: &Path,
|
||||||
value_of: impl Fn(usize) -> u32,
|
value_of: impl Fn(usize) -> u32,
|
||||||
@@ -141,10 +163,9 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 3 — presence/absence matrix (1 column per genome) ───────────────────
|
// ── Mode 3 — presence/absence matrix ─────────────────────────────────────────
|
||||||
|
|
||||||
impl Layer<PersistentBitMatrix> {
|
impl Layer<PersistentBitMatrix> {
|
||||||
/// Append a genome column to an existing presence matrix.
|
|
||||||
pub fn append_genome_column(
|
pub fn append_genome_column(
|
||||||
layer_dir: &Path,
|
layer_dir: &Path,
|
||||||
value_of: impl Fn(usize) -> bool,
|
value_of: impl Fn(usize) -> bool,
|
||||||
@@ -155,16 +176,17 @@ impl Layer<PersistentBitMatrix> {
|
|||||||
|
|
||||||
pub fn build_presence(
|
pub fn build_presence(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
|
block_bits: u8,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
let n = UnitigFileReader::open(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
||||||
let presence_dir = out_dir.join(PRESENCE_DIR);
|
let presence_dir = out_dir.join(PRESENCE_DIR);
|
||||||
let mut mb = PersistentBitMatrixBuilder::new(n, &presence_dir).map_err(OLMError::Io)?;
|
let mut mb = PersistentBitMatrixBuilder::new(n, &presence_dir).map_err(OLMError::Io)?;
|
||||||
let mut cols: Vec<_> = (0..n_genomes)
|
let mut cols: Vec<_> = (0..n_genomes)
|
||||||
.map(|_| mb.add_col().map_err(OLMError::Io))
|
.map(|_| mb.add_col().map_err(OLMError::Io))
|
||||||
.collect::<OLMResult<_>>()?;
|
.collect::<OLMResult<_>>()?;
|
||||||
let n_built = MphfLayer::build(out_dir, &mut |slot, kmer| {
|
let n_built = MphfLayer::build(out_dir, block_bits, &mut |slot, kmer| {
|
||||||
for (g, col) in cols.iter_mut().enumerate() {
|
for (g, col) in cols.iter_mut().enumerate() {
|
||||||
col.set(slot, present_in(kmer, g));
|
col.set(slot, present_in(kmer, g));
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,5 +1,6 @@
|
|||||||
pub mod error;
|
pub mod error;
|
||||||
pub mod evidence;
|
pub mod evidence;
|
||||||
|
pub mod fingerprint;
|
||||||
pub mod layer;
|
pub mod layer;
|
||||||
pub mod layered_store;
|
pub mod layered_store;
|
||||||
pub mod map;
|
pub mod map;
|
||||||
@@ -10,4 +11,5 @@ pub use error::{OLMError, OLMResult};
|
|||||||
pub use layer::{Hit, Layer, LayerData};
|
pub use layer::{Hit, Layer, LayerData};
|
||||||
pub use layered_store::LayeredStore;
|
pub use layered_store::LayeredStore;
|
||||||
pub use map::LayeredMap;
|
pub use map::LayeredMap;
|
||||||
|
pub use meta::{EvidenceKind, LayerMeta};
|
||||||
pub use mphf_layer::MphfLayer;
|
pub use mphf_layer::MphfLayer;
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ use std::path::{Path, PathBuf};
|
|||||||
|
|
||||||
use obicompactvec::PersistentCompactIntMatrix;
|
use obicompactvec::PersistentCompactIntMatrix;
|
||||||
use obikseq::CanonicalKmer;
|
use obikseq::CanonicalKmer;
|
||||||
use obiskio::UnitigFileWriter;
|
use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS};
|
||||||
|
|
||||||
use crate::error::OLMResult;
|
use crate::error::OLMResult;
|
||||||
use crate::layer::{Hit, Layer, LayerData};
|
use crate::layer::{Hit, Layer, LayerData};
|
||||||
@@ -90,7 +90,7 @@ impl LayeredMap<()> {
|
|||||||
pub fn push_layer(&mut self) -> OLMResult<usize> {
|
pub fn push_layer(&mut self) -> OLMResult<usize> {
|
||||||
let i = self.layers.len();
|
let i = self.layers.len();
|
||||||
let dir = layer_dir(&self.root, i);
|
let dir = layer_dir(&self.root, i);
|
||||||
Layer::<()>::build(&dir)?;
|
Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS)?;
|
||||||
self.append_layer()?;
|
self.append_layer()?;
|
||||||
Ok(i)
|
Ok(i)
|
||||||
}
|
}
|
||||||
@@ -102,7 +102,7 @@ impl LayeredMap<PersistentCompactIntMatrix> {
|
|||||||
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
||||||
let i = self.layers.len();
|
let i = self.layers.len();
|
||||||
let dir = layer_dir(&self.root, i);
|
let dir = layer_dir(&self.root, i);
|
||||||
Layer::<PersistentCompactIntMatrix>::build(&dir, count_of)?;
|
Layer::<PersistentCompactIntMatrix>::build(&dir, DEFAULT_BLOCK_BITS, count_of)?;
|
||||||
self.append_layer()?;
|
self.append_layer()?;
|
||||||
Ok(i)
|
Ok(i)
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -5,7 +5,56 @@ use serde::{Deserialize, Serialize};
|
|||||||
|
|
||||||
use crate::error::OLMResult;
|
use crate::error::OLMResult;
|
||||||
|
|
||||||
const META_FILE: &str = "meta.json";
|
const META_FILE: &str = "meta.json";
|
||||||
|
const LAYER_META_FILE: &str = "layer_meta.json";
|
||||||
|
|
||||||
|
// ── Layer-level metadata ──────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Describes the evidence bundle stored alongside the MPHF for one layer.
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
#[serde(tag = "type", rename_all = "snake_case")]
|
||||||
|
pub enum EvidenceKind {
|
||||||
|
/// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
|
||||||
|
Exact,
|
||||||
|
/// Approximate evidence: `fingerprint.bin` only.
|
||||||
|
/// `b` — fingerprint bits; false-positive rate per k-mer query = 1/2^b.
|
||||||
|
/// `z` — consecutive k-mers that must all match (Findere trick);
|
||||||
|
/// effective FP rate per sequencing read ≈ W / 2^(b·z)
|
||||||
|
/// where W = L - k - z + 2 is the number of windows in a read of length L.
|
||||||
|
Approx { b: u8, z: u8 },
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
pub struct LayerMeta {
|
||||||
|
pub evidence: EvidenceKind,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Default for EvidenceKind {
|
||||||
|
fn default() -> Self { Self::Exact }
|
||||||
|
}
|
||||||
|
|
||||||
|
impl LayerMeta {
|
||||||
|
pub fn exact() -> Self {
|
||||||
|
Self { evidence: EvidenceKind::Exact }
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn approx(b: u8, z: u8) -> Self {
|
||||||
|
Self { evidence: EvidenceKind::Approx { b, z } }
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn load(layer_dir: &Path) -> OLMResult<Self> {
|
||||||
|
let f = File::open(layer_dir.join(LAYER_META_FILE))?;
|
||||||
|
Ok(serde_json::from_reader(f)?)
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn save(&self, layer_dir: &Path) -> OLMResult<()> {
|
||||||
|
let f = File::create(layer_dir.join(LAYER_META_FILE))?;
|
||||||
|
serde_json::to_writer_pretty(f, self)?;
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Partition-level metadata ──────────────────────────────────────────────────
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
pub struct PartitionMeta {
|
pub struct PartitionMeta {
|
||||||
|
|||||||
@@ -4,70 +4,215 @@ use std::path::Path;
|
|||||||
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
||||||
use epserde::prelude::*;
|
use epserde::prelude::*;
|
||||||
use obikseq::CanonicalKmer;
|
use obikseq::CanonicalKmer;
|
||||||
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
use obiskio::{UnitigFileReader, UnitigFileWriter, build_unitig_idx};
|
||||||
use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
||||||
|
|
||||||
use crate::error::{OLMError, OLMResult};
|
use crate::error::{OLMError, OLMResult};
|
||||||
use crate::evidence::{Evidence, EvidenceWriter};
|
use crate::evidence::{Evidence, EvidenceWriter};
|
||||||
|
use crate::fingerprint::{FingerprintVec, FingerprintVecWriter};
|
||||||
|
use crate::meta::{EvidenceKind, LayerMeta};
|
||||||
|
|
||||||
pub(crate) const MPHF_FILE: &str = "mphf.bin";
|
pub(crate) const MPHF_FILE: &str = "mphf.bin";
|
||||||
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
|
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
|
||||||
pub(crate) const EVIDENCE_FILE: &str = "evidence.bin";
|
pub(crate) const EVIDENCE_FILE: &str = "evidence.bin";
|
||||||
|
pub(crate) const FINGERPRINT_FILE: &str = "fingerprint.bin";
|
||||||
|
|
||||||
pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
||||||
|
|
||||||
|
// ── Evidence store ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
enum LayerEvidence {
|
||||||
|
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||||||
|
Approx { fingerprint: FingerprintVec },
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── MphfLayer ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Autonomous kmer → slot mapping for one layer.
|
/// Autonomous kmer → slot mapping for one layer.
|
||||||
///
|
///
|
||||||
/// Answers presence/absence queries without any attached DataStore.
|
/// Dispatches queries to exact or approximate evidence transparently based on
|
||||||
/// Build once, never rebuilt — data stores are attached and derived externally.
|
/// the `layer_meta.json` written at build time.
|
||||||
pub struct MphfLayer {
|
pub struct MphfLayer {
|
||||||
mphf: Mphf,
|
mphf: Mphf,
|
||||||
evidence: Evidence,
|
ev: LayerEvidence,
|
||||||
unitigs: UnitigFileReader,
|
n: usize,
|
||||||
n: usize,
|
|
||||||
}
|
}
|
||||||
|
|
||||||
impl MphfLayer {
|
impl MphfLayer {
|
||||||
pub fn open(dir: &Path) -> OLMResult<Self> {
|
pub fn open(dir: &Path) -> OLMResult<Self> {
|
||||||
|
let meta = LayerMeta::load(dir)?;
|
||||||
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
let (ev, n) = match meta.evidence {
|
||||||
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
|
EvidenceKind::Exact => {
|
||||||
let n = evidence.len();
|
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
|
||||||
Ok(Self { mphf, evidence, unitigs, n })
|
let n = evidence.len();
|
||||||
|
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
||||||
|
(LayerEvidence::Exact { evidence, unitigs }, n)
|
||||||
|
}
|
||||||
|
EvidenceKind::Approx { .. } => {
|
||||||
|
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
let n = fingerprint.n();
|
||||||
|
(LayerEvidence::Approx { fingerprint }, n)
|
||||||
|
}
|
||||||
|
};
|
||||||
|
Ok(Self { mphf, ev, n })
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Returns `Some(slot)` if `kmer` belongs to this layer, `None` otherwise.
|
// ── Query API ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Transparent dispatch: routes to `find_exact` or `find_approx` based on
|
||||||
|
/// the evidence loaded at `open` time.
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
|
match &self.ev {
|
||||||
|
LayerEvidence::Exact { .. } => self.find_exact(kmer),
|
||||||
|
LayerEvidence::Approx { .. } => self.find_approx(kmer),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Exact lookup: zero false positives. Panics if the layer was opened with
|
||||||
|
/// approximate evidence.
|
||||||
|
#[inline]
|
||||||
|
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
|
let LayerEvidence::Exact { evidence, unitigs } = &self.ev else {
|
||||||
|
panic!("find_exact called on an approximate layer");
|
||||||
|
};
|
||||||
let slot = self.mphf.index(&kmer.raw());
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
// PtrHash guarantees slot < n only for its key set; arbitrary queries may exceed bounds.
|
|
||||||
if slot >= self.n { return None; }
|
if slot >= self.n { return None; }
|
||||||
let (chunk_id, rank) = self.evidence.decode(slot);
|
let (chunk_id, rank) = evidence.decode(slot);
|
||||||
if self.unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
||||||
Some(slot)
|
Some(slot)
|
||||||
} else {
|
} else {
|
||||||
None
|
None
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Approximate lookup: false-positive rate 1/2^b per k-mer query. Panics
|
||||||
|
/// if the layer was opened with exact evidence.
|
||||||
|
#[inline]
|
||||||
|
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
|
let LayerEvidence::Approx { fingerprint } = &self.ev else {
|
||||||
|
panic!("find_approx called on an exact layer");
|
||||||
|
};
|
||||||
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
|
if slot >= self.n { return None; }
|
||||||
|
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
|
||||||
|
}
|
||||||
|
|
||||||
pub fn n(&self) -> usize { self.n }
|
pub fn n(&self) -> usize { self.n }
|
||||||
|
|
||||||
|
// ── Build helpers ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter> {
|
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter> {
|
||||||
fs::create_dir_all(dir)?;
|
fs::create_dir_all(dir)?;
|
||||||
Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?)
|
Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Builds the MPHF and evidence from the unitigs file already present in `dir`.
|
/// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on
|
||||||
/// Calls `fill_slot(slot, kmer)` once per kmer (second pass) for DataStore population.
|
/// `kind`. `block_bits` is forwarded to exact evidence only.
|
||||||
/// Returns the number of kmers indexed.
|
pub fn build_evidence(dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
|
||||||
|
match kind {
|
||||||
|
EvidenceKind::Exact => Self::build_exact_evidence(dir, block_bits),
|
||||||
|
EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
|
||||||
|
///
|
||||||
|
/// `block_bits` controls the `.idx` block size (2^block_bits chunks per block).
|
||||||
|
/// Uses sequential iteration — no `.idx` required on entry.
|
||||||
|
pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult<usize> {
|
||||||
|
let unitig_path = dir.join(UNITIGS_FILE);
|
||||||
|
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
|
||||||
|
let n = unitigs.n_kmers();
|
||||||
|
|
||||||
|
if n == 0 {
|
||||||
|
fs::File::create(dir.join(EVIDENCE_FILE))?;
|
||||||
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
|
LayerMeta::exact().save(dir)?;
|
||||||
|
return Ok(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
||||||
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
|
|
||||||
|
let mut ev = EvidenceWriter::new(n);
|
||||||
|
let mut seen = vec![0u8; (n + 7) / 8];
|
||||||
|
|
||||||
|
for (kmer, chunk_id, rank) in unitigs.iter_indexed_canonical_kmers() {
|
||||||
|
let slot = mphf.index(&kmer.raw());
|
||||||
|
if slot >= n {
|
||||||
|
return Err(OLMError::Mphf("slot out of bounds".into()));
|
||||||
|
}
|
||||||
|
let byte = slot / 8;
|
||||||
|
let bit = 1u8 << (slot % 8);
|
||||||
|
if seen[byte] & bit != 0 {
|
||||||
|
return Err(OLMError::Mphf("duplicate slot".into()));
|
||||||
|
}
|
||||||
|
seen[byte] |= bit;
|
||||||
|
ev.set(slot, chunk_id as u32, rank as u8);
|
||||||
|
}
|
||||||
|
|
||||||
|
ev.write(&dir.join(EVIDENCE_FILE))?;
|
||||||
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
|
LayerMeta::exact().save(dir)?;
|
||||||
|
Ok(n)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
|
||||||
|
/// `b` — fingerprint bits (1..=64); `z` — Findere consecutive k-mer
|
||||||
|
/// parameter (≥1). No `.idx` is written.
|
||||||
|
pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult<usize> {
|
||||||
|
if b == 0 || b > 64 {
|
||||||
|
return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into()));
|
||||||
|
}
|
||||||
|
if z == 0 {
|
||||||
|
return Err(OLMError::InvalidLayer("z must be ≥ 1".into()));
|
||||||
|
}
|
||||||
|
let unitig_path = dir.join(UNITIGS_FILE);
|
||||||
|
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
|
||||||
|
let n = unitigs.n_kmers();
|
||||||
|
|
||||||
|
if n == 0 {
|
||||||
|
FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
LayerMeta::approx(b, z).save(dir)?;
|
||||||
|
return Ok(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
||||||
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
|
|
||||||
|
let mut fw = FingerprintVecWriter::new(n, b);
|
||||||
|
|
||||||
|
for kmer in unitigs.iter_canonical_kmers() {
|
||||||
|
let slot = mphf.index(&kmer.raw());
|
||||||
|
if slot >= n {
|
||||||
|
return Err(OLMError::Mphf("slot out of bounds".into()));
|
||||||
|
}
|
||||||
|
fw.set(slot, kmer.seq_hash());
|
||||||
|
}
|
||||||
|
|
||||||
|
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
LayerMeta::approx(b, z).save(dir)?;
|
||||||
|
Ok(n)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Build MPHF and exact evidence from the unitigs file already present in
|
||||||
|
/// `dir`. Calls `fill_slot(slot, kmer)` once per kmer for DataStore
|
||||||
|
/// population. Returns the number of kmers indexed.
|
||||||
pub(crate) fn build(
|
pub(crate) fn build(
|
||||||
dir: &Path,
|
dir: &Path,
|
||||||
|
block_bits: u8,
|
||||||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
|
|
||||||
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
let unitig_path = dir.join(UNITIGS_FILE);
|
||||||
|
|
||||||
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
|
|
||||||
|
let unitigs = UnitigFileReader::open(&unitig_path)?;
|
||||||
let n = unitigs.n_kmers();
|
let n = unitigs.n_kmers();
|
||||||
|
|
||||||
if n == 0 {
|
if n == 0 {
|
||||||
@@ -77,10 +222,11 @@ impl MphfLayer {
|
|||||||
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
|
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
|
||||||
mphf.store(&dir.join(MPHF_FILE))
|
mphf.store(&dir.join(MPHF_FILE))
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
|
LayerMeta::exact().save(dir)?;
|
||||||
return Ok(0);
|
return Ok(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Pass 1 — build MPHF
|
// Pass 1 — build MPHF (parallel, random access via .idx)
|
||||||
let keys = (0..unitigs.len())
|
let keys = (0..unitigs.len())
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.flat_map_iter(|ci| unitigs.unitig(ci).into_canonical_kmers().map(|km| km.raw()));
|
.flat_map_iter(|ci| unitigs.unitig(ci).into_canonical_kmers().map(|km| km.raw()));
|
||||||
@@ -90,7 +236,7 @@ impl MphfLayer {
|
|||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
|
|
||||||
// Pass 2 — fill evidence + mode-specific data via callback
|
// Pass 2 — fill evidence + mode-specific data via callback
|
||||||
let unitigs2 = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
let unitigs2 = UnitigFileReader::open(&unitig_path)?;
|
||||||
let mut ev = EvidenceWriter::new(n);
|
let mut ev = EvidenceWriter::new(n);
|
||||||
let mut seen = vec![0u8; (n + 7) / 8];
|
let mut seen = vec![0u8; (n + 7) / 8];
|
||||||
|
|
||||||
@@ -110,6 +256,7 @@ impl MphfLayer {
|
|||||||
}
|
}
|
||||||
|
|
||||||
ev.write(&dir.join(EVIDENCE_FILE))?;
|
ev.write(&dir.join(EVIDENCE_FILE))?;
|
||||||
|
LayerMeta::exact().save(dir)?;
|
||||||
Ok(n)
|
Ok(n)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
use super::*;
|
use super::*;
|
||||||
use obicompactvec::PersistentCompactIntMatrix;
|
use obicompactvec::PersistentCompactIntMatrix;
|
||||||
use obikseq::{set_k, Kmer, Sequence as _, Unitig};
|
use obikseq::{set_k, Kmer, Sequence as _, Unitig};
|
||||||
|
use obiskio::DEFAULT_BLOCK_BITS;
|
||||||
use tempfile::tempdir;
|
use tempfile::tempdir;
|
||||||
|
|
||||||
fn write_unitigs(dir: &Path, seqs: &[&[u8]]) {
|
fn write_unitigs(dir: &Path, seqs: &[&[u8]]) {
|
||||||
@@ -11,16 +12,10 @@ fn write_unitigs(dir: &Path, seqs: &[&[u8]]) {
|
|||||||
w.close().unwrap();
|
w.close().unwrap();
|
||||||
}
|
}
|
||||||
|
|
||||||
fn all_canonical_kmers(dir: &Path, k: usize) -> Vec<CanonicalKmer> {
|
fn all_canonical_kmers(dir: &Path) -> Vec<CanonicalKmer> {
|
||||||
let r = UnitigFileReader::open(&dir.join(UNITIGS_FILE)).unwrap();
|
UnitigFileReader::open_sequential(&dir.join(UNITIGS_FILE)).unwrap()
|
||||||
let mut out = Vec::new();
|
.iter_canonical_kmers()
|
||||||
for ci in 0..r.len() {
|
.collect()
|
||||||
let n = r.seql(ci) - k + 1;
|
|
||||||
for rank in 0..n {
|
|
||||||
out.push(Kmer::from_raw(r.raw_kmer(ci, rank)).canonical());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
out
|
|
||||||
}
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
@@ -28,8 +23,8 @@ fn build_and_query_all_kmers_found() {
|
|||||||
set_k(4);
|
set_k(4);
|
||||||
let dir = tempdir().unwrap();
|
let dir = tempdir().unwrap();
|
||||||
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
||||||
let kmers = all_canonical_kmers(dir.path(), 4);
|
let kmers = all_canonical_kmers(dir.path());
|
||||||
Layer::<()>::build(dir.path()).unwrap();
|
Layer::<()>::build(dir.path(), DEFAULT_BLOCK_BITS).unwrap();
|
||||||
let layer = Layer::<()>::open(dir.path()).unwrap();
|
let layer = Layer::<()>::open(dir.path()).unwrap();
|
||||||
for kmer in kmers {
|
for kmer in kmers {
|
||||||
assert!(layer.query(kmer).is_some(), "kmer should be present");
|
assert!(layer.query(kmer).is_some(), "kmer should be present");
|
||||||
@@ -41,11 +36,12 @@ fn counts_are_stored_and_retrieved() {
|
|||||||
set_k(4);
|
set_k(4);
|
||||||
let dir = tempdir().unwrap();
|
let dir = tempdir().unwrap();
|
||||||
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
||||||
let kmers = all_canonical_kmers(dir.path(), 4);
|
let kmers = all_canonical_kmers(dir.path());
|
||||||
let count_map: HashMap<CanonicalKmer, u32> =
|
let count_map: HashMap<CanonicalKmer, u32> =
|
||||||
kmers.iter().enumerate().map(|(i, &k)| (k, i as u32 + 1)).collect();
|
kmers.iter().enumerate().map(|(i, &k)| (k, i as u32 + 1)).collect();
|
||||||
Layer::<PersistentCompactIntMatrix>::build(
|
Layer::<PersistentCompactIntMatrix>::build(
|
||||||
dir.path(),
|
dir.path(),
|
||||||
|
DEFAULT_BLOCK_BITS,
|
||||||
|kmer| count_map.get(&kmer).copied().unwrap_or(0),
|
|kmer| count_map.get(&kmer).copied().unwrap_or(0),
|
||||||
).unwrap();
|
).unwrap();
|
||||||
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
||||||
@@ -60,7 +56,7 @@ fn query_absent_returns_none() {
|
|||||||
set_k(4);
|
set_k(4);
|
||||||
let dir = tempdir().unwrap();
|
let dir = tempdir().unwrap();
|
||||||
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
||||||
Layer::<()>::build(dir.path()).unwrap();
|
Layer::<()>::build(dir.path(), DEFAULT_BLOCK_BITS).unwrap();
|
||||||
let layer = Layer::<()>::open(dir.path()).unwrap();
|
let layer = Layer::<()>::open(dir.path()).unwrap();
|
||||||
let absent = Kmer::from_ascii(b"CCCC").unwrap().canonical();
|
let absent = Kmer::from_ascii(b"CCCC").unwrap().canonical();
|
||||||
assert!(layer.query(absent).is_none());
|
assert!(layer.query(absent).is_none());
|
||||||
@@ -71,7 +67,7 @@ fn open_after_build_is_consistent() {
|
|||||||
set_k(4);
|
set_k(4);
|
||||||
let dir = tempdir().unwrap();
|
let dir = tempdir().unwrap();
|
||||||
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
write_unitigs(dir.path(), &[b"AAAACGT"]);
|
||||||
let n = Layer::<PersistentCompactIntMatrix>::build(dir.path(), |_| 7).unwrap();
|
let n = Layer::<PersistentCompactIntMatrix>::build(dir.path(), DEFAULT_BLOCK_BITS, |_| 7).unwrap();
|
||||||
assert_eq!(n, 4);
|
assert_eq!(n, 4);
|
||||||
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
let layer = Layer::<PersistentCompactIntMatrix>::open(dir.path()).unwrap();
|
||||||
let kmer = Kmer::from_ascii(b"AAAA").unwrap().canonical();
|
let kmer = Kmer::from_ascii(b"AAAA").unwrap().canonical();
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ pub use error::{SKError, SKResult};
|
|||||||
pub use meta::SKFileMeta;
|
pub use meta::SKFileMeta;
|
||||||
pub use pool::{SKFilePool, SKFileWriter, SharedPool, create_token, create_token_with};
|
pub use pool::{SKFilePool, SKFileWriter, SharedPool, create_token, create_token_with};
|
||||||
pub use reader::{SKFileIter, SKFileReader};
|
pub use reader::{SKFileIter, SKFileReader};
|
||||||
pub use unitig_index::{UnitigFileReader, UnitigFileWriter};
|
pub use unitig_index::{UnitigFileReader, UnitigFileWriter, build_unitig_idx, DEFAULT_BLOCK_BITS};
|
||||||
|
|
||||||
use std::path::{Path, PathBuf};
|
use std::path::{Path, PathBuf};
|
||||||
|
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ fn write_read(seqs: &[&[u8]]) -> (tempfile::TempDir, UnitigFileReader) {
|
|||||||
w.write(&make_unitig(s)).unwrap();
|
w.write(&make_unitig(s)).unwrap();
|
||||||
}
|
}
|
||||||
w.close().unwrap();
|
w.close().unwrap();
|
||||||
|
super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap();
|
||||||
let r = UnitigFileReader::open(&path).unwrap();
|
let r = UnitigFileReader::open(&path).unwrap();
|
||||||
(dir, r)
|
(dir, r)
|
||||||
}
|
}
|
||||||
@@ -31,6 +32,7 @@ fn roundtrip_empty_index() {
|
|||||||
let path = dir.path().join("unitigs.bin");
|
let path = dir.path().join("unitigs.bin");
|
||||||
let w = UnitigFileWriter::create(&path).unwrap();
|
let w = UnitigFileWriter::create(&path).unwrap();
|
||||||
w.close().unwrap();
|
w.close().unwrap();
|
||||||
|
super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap();
|
||||||
let r = UnitigFileReader::open(&path).unwrap();
|
let r = UnitigFileReader::open(&path).unwrap();
|
||||||
assert_eq!(r.len(), 0);
|
assert_eq!(r.len(), 0);
|
||||||
}
|
}
|
||||||
@@ -145,6 +147,7 @@ fn iter_kmers_empty_file() {
|
|||||||
let dir = tempdir().unwrap();
|
let dir = tempdir().unwrap();
|
||||||
let path = dir.path().join("unitigs.bin");
|
let path = dir.path().join("unitigs.bin");
|
||||||
UnitigFileWriter::create(&path).unwrap().close().unwrap();
|
UnitigFileWriter::create(&path).unwrap().close().unwrap();
|
||||||
|
super::build_unitig_idx(&path, super::DEFAULT_BLOCK_BITS).unwrap();
|
||||||
let r = UnitigFileReader::open(&path).unwrap();
|
let r = UnitigFileReader::open(&path).unwrap();
|
||||||
assert_eq!(r.iter_kmers().count(), 0);
|
assert_eq!(r.iter_kmers().count(), 0);
|
||||||
}
|
}
|
||||||
@@ -269,3 +272,96 @@ fn long_unitig_split_no_kmer_lost() {
|
|||||||
// First kmer of chunk 1 = original nucl 256..260 — a different, adjacent kmer.
|
// First kmer of chunk 1 = original nucl 256..260 — a different, adjacent kmer.
|
||||||
assert!(r.verify_canonical_kmer(1, 0, canonical_of(&seq[256..260])));
|
assert!(r.verify_canonical_kmer(1, 0, canonical_of(&seq[256..260])));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── block_bits parametrisation ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn write_read_bb(seqs: &[&[u8]], block_bits: u8) -> (tempfile::TempDir, UnitigFileReader) {
|
||||||
|
let dir = tempdir().unwrap();
|
||||||
|
let path = dir.path().join("unitigs.bin");
|
||||||
|
let mut w = UnitigFileWriter::create_with_block_bits(&path, block_bits).unwrap();
|
||||||
|
for s in seqs {
|
||||||
|
w.write(&make_unitig(s)).unwrap();
|
||||||
|
}
|
||||||
|
w.close().unwrap();
|
||||||
|
super::build_unitig_idx(&path, block_bits).unwrap();
|
||||||
|
let r = UnitigFileReader::open(&path).unwrap();
|
||||||
|
(dir, r)
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_stored_and_read_back() {
|
||||||
|
set_k(4);
|
||||||
|
for bb in [0u8, 1, 2, 3, 6] {
|
||||||
|
let dir = tempdir().unwrap();
|
||||||
|
let path = dir.path().join("unitigs.bin");
|
||||||
|
let w = UnitigFileWriter::create_with_block_bits(&path, bb).unwrap();
|
||||||
|
assert_eq!(w.block_bits(), bb);
|
||||||
|
w.close().unwrap();
|
||||||
|
super::build_unitig_idx(&path, bb).unwrap();
|
||||||
|
let r = UnitigFileReader::open(&path).unwrap();
|
||||||
|
assert_eq!(r.block_bits(), bb, "block_bits={bb} not preserved");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_zero_exact_offsets() {
|
||||||
|
// block_bits=0 → one offset per chunk, no sequential scan in chunk_start
|
||||||
|
set_k(4);
|
||||||
|
let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA", b"AAAACG"];
|
||||||
|
let (_dir, r) = write_read_bb(seqs, 0);
|
||||||
|
assert_eq!(r.len(), seqs.len());
|
||||||
|
for (i, s) in seqs.iter().enumerate() {
|
||||||
|
assert_eq!(r.unitig(i), make_unitig(s), "block_bits=0: unitig {i} mismatch");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_one_block_size_two() {
|
||||||
|
// block_bits=1 → BLOCK_SIZE=2: every other chunk gets an offset
|
||||||
|
set_k(4);
|
||||||
|
let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA", b"AAAACG", b"CCCCGT"];
|
||||||
|
let (_dir, r) = write_read_bb(seqs, 1);
|
||||||
|
assert_eq!(r.len(), seqs.len());
|
||||||
|
for (i, s) in seqs.iter().enumerate() {
|
||||||
|
assert_eq!(r.unitig(i), make_unitig(s), "block_bits=1: unitig {i} mismatch");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_larger_than_chunk_count() {
|
||||||
|
// block_bits=6 (BLOCK_SIZE=64) with only 3 chunks: all in block 0
|
||||||
|
set_k(4);
|
||||||
|
let seqs: &[&[u8]] = &[b"ACGTACGT", b"TTTTCCCC", b"GGGAAA"];
|
||||||
|
let (_dir, r) = write_read_bb(seqs, 6);
|
||||||
|
assert_eq!(r.len(), seqs.len());
|
||||||
|
for (i, s) in seqs.iter().enumerate() {
|
||||||
|
assert_eq!(r.unitig(i), make_unitig(s), "block_bits=6: unitig {i} mismatch");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_random_access_matches_sequential() {
|
||||||
|
// Write many chunks with block_bits=2 (BLOCK_SIZE=4), verify random access
|
||||||
|
// against sequential iteration for every chunk.
|
||||||
|
set_k(4);
|
||||||
|
let seqs: Vec<Vec<u8>> = (0..20_usize)
|
||||||
|
.map(|i| (0..8_usize).map(|j| b"ACGT"[(i + j) % 4]).collect())
|
||||||
|
.collect();
|
||||||
|
let seq_refs: Vec<&[u8]> = seqs.iter().map(|s| s.as_slice()).collect();
|
||||||
|
let (_dir, r) = write_read_bb(&seq_refs, 2);
|
||||||
|
assert_eq!(r.len(), seqs.len());
|
||||||
|
let sequential: Vec<Unitig> = r.iter_chunks_sequential().map(|(_, u)| u).collect();
|
||||||
|
for i in 0..seqs.len() {
|
||||||
|
assert_eq!(r.unitig(i), sequential[i], "random vs sequential mismatch at chunk {i}");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn block_bits_kmer_count_preserved() {
|
||||||
|
set_k(4);
|
||||||
|
// "AAAACG" → 3 kmers, "CCCCAG" → 3 kmers; total = 6
|
||||||
|
for bb in [0u8, 1, 2, 6] {
|
||||||
|
let (_dir, r) = write_read_bb(&[b"AAAACG", b"CCCCAG"], bb);
|
||||||
|
assert_eq!(r.n_kmers(), 6, "block_bits={bb}: n_kmers mismatch");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
+252
-145
@@ -9,21 +9,19 @@ pub use obikseq::MAX_KMERS_PER_CHUNK;
|
|||||||
|
|
||||||
use crate::error::{SKError, SKResult};
|
use crate::error::{SKError, SKResult};
|
||||||
|
|
||||||
// ── Index file format ─────────────────────────────────────────────────────────
|
// ── Block index parameters ────────────────────────────────────────────────────
|
||||||
//
|
//
|
||||||
// magic: [u8; 4] = b"UIDX"
|
// BLOCK_SIZE = 1 << block_bits chunks share one offset entry in the index.
|
||||||
// n_unitigs: u32 LE
|
// block_bits=0 → one entry per chunk (exact offsets, no scan).
|
||||||
// n_kmers: u64 LE total kmer count across all chunks
|
// block_bits=6 → one entry per 64 chunks (default; O(64) scan per lookup).
|
||||||
// seqls: [u8; n_unitigs] max kmer index per chunk (= n_kmers − 1)
|
|
||||||
// packed_offsets: [u32; n_unitigs + 1] byte offsets to packed bytes in the
|
|
||||||
// sequence file; last entry is sentinel
|
|
||||||
//
|
//
|
||||||
// Each sequence record in the binary file: [u8: n_kmers−1][packed bytes].
|
// block_bits is stored in the index file so the reader derives all parameters
|
||||||
// Offsets point to the first packed byte of each record, past the leading u8.
|
// at runtime — no compile-time constant constrains the format.
|
||||||
// Unitigs with more than MAX_KMERS_PER_CHUNK kmers are transparently split by the
|
|
||||||
// writer into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost.
|
|
||||||
|
|
||||||
const MAGIC: [u8; 4] = *b"UIDX";
|
const MAGIC: [u8; 4] = *b"UIX3";
|
||||||
|
|
||||||
|
/// Default block granularity used by [`UnitigFileWriter::create`].
|
||||||
|
pub const DEFAULT_BLOCK_BITS: u8 = 0;
|
||||||
|
|
||||||
fn idx_path(path: &Path) -> PathBuf {
|
fn idx_path(path: &Path) -> PathBuf {
|
||||||
crate::append_path_suffix(path, ".idx")
|
crate::append_path_suffix(path, ".idx")
|
||||||
@@ -32,38 +30,52 @@ fn idx_path(path: &Path) -> PathBuf {
|
|||||||
// ── Writer ────────────────────────────────────────────────────────────────────
|
// ── Writer ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds
|
/// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds
|
||||||
/// an offset index at close time.
|
/// a block-sampled offset index at close time.
|
||||||
///
|
///
|
||||||
/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] kmers are transparently split
|
/// One offset is stored every `1 << block_bits` chunks; random access to chunk
|
||||||
/// into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost.
|
/// `i` costs at most `(1 << block_bits) − 1` sequential chunk scans after the
|
||||||
|
/// block lookup.
|
||||||
///
|
///
|
||||||
/// The companion index file (`path.idx`) is written on [`close`].
|
/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split
|
||||||
/// The binary format per record is `[u8: n_kmers−1][packed 2-bit bytes]`.
|
/// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost.
|
||||||
pub struct UnitigFileWriter {
|
pub struct UnitigFileWriter {
|
||||||
path: PathBuf,
|
file: BufWriter<File>,
|
||||||
file: BufWriter<File>,
|
block_offsets: Vec<u32>,
|
||||||
seqls: Vec<u8>,
|
chunk_count: usize,
|
||||||
packed_offsets: Vec<u32>,
|
next_offset: u32,
|
||||||
next_offset: u32,
|
n_kmers: usize,
|
||||||
n_kmers: usize,
|
k: usize,
|
||||||
k: usize,
|
block_bits: u8,
|
||||||
|
mask: usize, // (1 << block_bits) - 1
|
||||||
}
|
}
|
||||||
|
|
||||||
impl UnitigFileWriter {
|
impl UnitigFileWriter {
|
||||||
|
/// Create a writer with the default block size (`DEFAULT_BLOCK_BITS = 6`).
|
||||||
pub fn create(path: &Path) -> SKResult<Self> {
|
pub fn create(path: &Path) -> SKResult<Self> {
|
||||||
|
Self::create_with_block_bits(path, DEFAULT_BLOCK_BITS)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Create a writer with a custom block size.
|
||||||
|
///
|
||||||
|
/// `block_bits` must be in 0..=31. `block_bits=0` stores one offset per
|
||||||
|
/// chunk (exact, no scan); larger values trade index size for scan length.
|
||||||
|
pub fn create_with_block_bits(path: &Path, block_bits: u8) -> SKResult<Self> {
|
||||||
|
assert!(block_bits <= 31, "block_bits must be ≤ 31");
|
||||||
let file = File::create(path).map_err(SKError::Io)?;
|
let file = File::create(path).map_err(SKError::Io)?;
|
||||||
Ok(Self {
|
Ok(Self {
|
||||||
path: path.to_owned(),
|
|
||||||
file: BufWriter::new(file),
|
file: BufWriter::new(file),
|
||||||
seqls: Vec::new(),
|
block_offsets: Vec::new(),
|
||||||
packed_offsets: Vec::new(),
|
chunk_count: 0,
|
||||||
next_offset: 0,
|
next_offset: 0,
|
||||||
n_kmers: 0,
|
n_kmers: 0,
|
||||||
k: obikseq::params::k(),
|
k: obikseq::params::k(),
|
||||||
|
block_bits,
|
||||||
|
mask: (1usize << block_bits) - 1,
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Write a unitig, splitting it into chunks if it exceeds [`MAX_KMERS_PER_CHUNK`].
|
/// Write a unitig, splitting into overlapping chunks if it exceeds
|
||||||
|
/// [`MAX_KMERS_PER_CHUNK`].
|
||||||
pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> {
|
pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> {
|
||||||
let seql = unitig.seql();
|
let seql = unitig.seql();
|
||||||
let k = self.k;
|
let k = self.k;
|
||||||
@@ -77,17 +89,13 @@ impl UnitigFileWriter {
|
|||||||
return self.write_chunk(unitig);
|
return self.write_chunk(unitig);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Split into overlapping chunks of MAX_KMERS_PER_CHUNK kmers.
|
|
||||||
// Overlap of k-1 nucleotides ensures no kmer is lost at boundaries.
|
|
||||||
let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1;
|
let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1;
|
||||||
let stride = MAX_KMERS_PER_CHUNK;
|
let stride = MAX_KMERS_PER_CHUNK;
|
||||||
let mut start = 0;
|
let mut start = 0;
|
||||||
while start < seql {
|
while start < seql {
|
||||||
let end = (start + chunk_nucl).min(seql);
|
let end = (start + chunk_nucl).min(seql);
|
||||||
self.write_chunk(&unitig.sub(start, end))?;
|
self.write_chunk(&unitig.sub(start, end))?;
|
||||||
if end == seql {
|
if end == seql { break; }
|
||||||
break;
|
|
||||||
}
|
|
||||||
start += stride;
|
start += stride;
|
||||||
}
|
}
|
||||||
Ok(())
|
Ok(())
|
||||||
@@ -97,162 +105,257 @@ impl UnitigFileWriter {
|
|||||||
let seql = unitig.seql();
|
let seql = unitig.seql();
|
||||||
let byte_len = (seql + 3) / 4;
|
let byte_len = (seql + 3) / 4;
|
||||||
|
|
||||||
// Header is 1 byte (u8: n_kmers − 1 = seql − k); packed bytes follow.
|
|
||||||
debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK");
|
debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK");
|
||||||
self.packed_offsets.push(self.next_offset + 1);
|
|
||||||
self.seqls.push((seql - self.k) as u8);
|
|
||||||
self.n_kmers += seql - self.k + 1;
|
|
||||||
|
|
||||||
unitig
|
if self.chunk_count & self.mask == 0 {
|
||||||
.write_to_binary(&mut self.file)
|
self.block_offsets.push(self.next_offset);
|
||||||
.map_err(SKError::Io)?;
|
}
|
||||||
|
|
||||||
|
self.n_kmers += seql - self.k + 1;
|
||||||
|
self.chunk_count += 1;
|
||||||
|
|
||||||
|
unitig.write_to_binary(&mut self.file).map_err(SKError::Io)?;
|
||||||
|
|
||||||
self.next_offset += 1 + byte_len as u32;
|
self.next_offset += 1 + byte_len as u32;
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Flush the sequence file and write the companion `.idx`.
|
/// Flush and close the binary sequence file.
|
||||||
|
///
|
||||||
|
/// The companion `.idx` file is **not** written here; call
|
||||||
|
/// [`build_unitig_idx`] separately when exact evidence is needed.
|
||||||
pub fn close(mut self) -> SKResult<()> {
|
pub fn close(mut self) -> SKResult<()> {
|
||||||
self.file.flush().map_err(SKError::Io)?;
|
self.file.flush().map_err(SKError::Io)?;
|
||||||
drop(self.file);
|
drop(self.file);
|
||||||
|
Ok(())
|
||||||
// Sentinel: byte offset past the last record's packed bytes.
|
|
||||||
let sentinel = match (self.packed_offsets.last(), self.seqls.last()) {
|
|
||||||
(Some(&last_off), Some(&last_seql)) => {
|
|
||||||
let seql = last_seql as u32 + self.k as u32;
|
|
||||||
last_off + (seql + 3) / 4
|
|
||||||
}
|
|
||||||
_ => 0,
|
|
||||||
};
|
|
||||||
self.packed_offsets.push(sentinel);
|
|
||||||
|
|
||||||
write_idx(&idx_path(&self.path), &self.seqls, &self.packed_offsets, self.n_kmers)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn len(&self) -> usize {
|
pub fn len(&self) -> usize { self.chunk_count }
|
||||||
self.seqls.len()
|
pub fn is_empty(&self) -> bool { self.chunk_count == 0 }
|
||||||
}
|
pub fn block_bits(&self) -> u8 { self.block_bits }
|
||||||
|
|
||||||
pub fn is_empty(&self) -> bool {
|
|
||||||
self.seqls.is_empty()
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) -> SKResult<()> {
|
fn write_idx(
|
||||||
|
path: &Path,
|
||||||
|
n_unitigs: u32,
|
||||||
|
n_kmers: u64,
|
||||||
|
block_bits: u8,
|
||||||
|
block_offsets: &[u32],
|
||||||
|
) -> SKResult<()> {
|
||||||
let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?);
|
let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?);
|
||||||
w.write_all(&MAGIC).map_err(SKError::Io)?;
|
w.write_all(&MAGIC).map_err(SKError::Io)?;
|
||||||
w.write_all(&(seqls.len() as u32).to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&(block_bits as u32).to_le_bytes()).map_err(SKError::Io)?;
|
||||||
w.write_all(&(n_kmers as u64).to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&n_unitigs.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
w.write_all(seqls).map_err(SKError::Io)?;
|
w.write_all(&n_kmers.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
for &off in packed_offsets {
|
for &off in block_offsets {
|
||||||
w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
}
|
}
|
||||||
w.flush().map_err(SKError::Io)
|
w.flush().map_err(SKError::Io)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Scan an existing `unitigs.bin` file and write its companion `.idx`.
|
||||||
|
///
|
||||||
|
/// Called by the exact-evidence construction route after the sequence file is
|
||||||
|
/// closed. `block_bits` controls index granularity (1 << block_bits chunks per
|
||||||
|
/// offset entry); use [`DEFAULT_BLOCK_BITS`] for the default.
|
||||||
|
pub fn build_unitig_idx(unitigs_path: &Path, block_bits: u8) -> SKResult<()> {
|
||||||
|
assert!(block_bits <= 31, "block_bits must be ≤ 31");
|
||||||
|
|
||||||
|
let file = File::open(unitigs_path).map_err(SKError::Io)?;
|
||||||
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
|
|
||||||
|
let k = obikseq::params::k();
|
||||||
|
let block_size = 1usize << block_bits;
|
||||||
|
let mask = block_size - 1;
|
||||||
|
|
||||||
|
let mut block_offsets: Vec<u32> = Vec::new();
|
||||||
|
let mut offset = 0usize;
|
||||||
|
let mut chunk_count = 0usize;
|
||||||
|
let mut n_kmers = 0usize;
|
||||||
|
|
||||||
|
while offset < mmap.len() {
|
||||||
|
if chunk_count & mask == 0 {
|
||||||
|
block_offsets.push(offset as u32);
|
||||||
|
}
|
||||||
|
let seql_minus_k = mmap[offset] as usize;
|
||||||
|
let byte_len = (seql_minus_k + k + 3) / 4;
|
||||||
|
n_kmers += seql_minus_k + 1;
|
||||||
|
offset += 1 + byte_len;
|
||||||
|
chunk_count += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
block_offsets.push(offset as u32); // sentinel
|
||||||
|
|
||||||
|
write_idx(
|
||||||
|
&idx_path(unitigs_path),
|
||||||
|
chunk_count as u32,
|
||||||
|
n_kmers as u64,
|
||||||
|
block_bits,
|
||||||
|
&block_offsets,
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
// ── Reader ────────────────────────────────────────────────────────────────────
|
// ── Reader ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Read-only random-access view of a unitig file.
|
/// Read-only random-access view of a unitig file.
|
||||||
///
|
///
|
||||||
/// The sequence file is memory-mapped; the index is loaded into RAM on open.
|
/// The sequence file is memory-mapped; the block offset table is loaded into RAM
|
||||||
/// All per-kmer operations are O(1) and allocation-free.
|
/// on open. Random access to chunk `i`: O(1 << block_bits) sequential mmap
|
||||||
|
/// reads. Sequential iteration: O(n) via a running-offset cursor.
|
||||||
pub struct UnitigFileReader {
|
pub struct UnitigFileReader {
|
||||||
mmap: Mmap,
|
mmap: Mmap,
|
||||||
seqls: Vec<u8>,
|
block_offsets: Vec<u32>,
|
||||||
packed_offsets: Vec<u32>,
|
n_unitigs: usize,
|
||||||
n_kmers: usize,
|
n_kmers: usize,
|
||||||
k: usize,
|
k: usize,
|
||||||
|
block_bits: u8,
|
||||||
|
mask: usize, // (1 << block_bits) - 1
|
||||||
}
|
}
|
||||||
|
|
||||||
impl UnitigFileReader {
|
impl UnitigFileReader {
|
||||||
|
/// Open with `.idx` — enables both sequential iteration and random access.
|
||||||
pub fn open(path: &Path) -> SKResult<Self> {
|
pub fn open(path: &Path) -> SKResult<Self> {
|
||||||
let file = File::open(path).map_err(SKError::Io)?;
|
let file = File::open(path).map_err(SKError::Io)?;
|
||||||
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
let (seqls, packed_offsets, n_kmers) = read_idx(&idx_path(path))?;
|
let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?;
|
||||||
let k = obikseq::params::k();
|
let k = obikseq::params::k();
|
||||||
Ok(Self { mmap, seqls, packed_offsets, n_kmers, k })
|
Ok(Self {
|
||||||
|
mmap,
|
||||||
|
block_offsets,
|
||||||
|
n_unitigs,
|
||||||
|
n_kmers,
|
||||||
|
k,
|
||||||
|
block_bits,
|
||||||
|
mask: (1usize << block_bits) - 1,
|
||||||
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn len(&self) -> usize {
|
/// Open without `.idx` — sequential iteration only, no random access.
|
||||||
self.seqls.len()
|
///
|
||||||
|
/// Scans the binary file once to count chunks and k-mers. Use when only
|
||||||
|
/// [`Self::iter_kmers`], [`Self::iter_canonical_kmers`], or
|
||||||
|
/// [`Self::iter_indexed_canonical_kmers`] are needed.
|
||||||
|
pub fn open_sequential(path: &Path) -> SKResult<Self> {
|
||||||
|
let file = File::open(path).map_err(SKError::Io)?;
|
||||||
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
|
let k = obikseq::params::k();
|
||||||
|
|
||||||
|
let mut offset = 0usize;
|
||||||
|
let mut n_unitigs = 0usize;
|
||||||
|
let mut n_kmers = 0usize;
|
||||||
|
while offset < mmap.len() {
|
||||||
|
let seql_minus_k = mmap[offset] as usize;
|
||||||
|
n_kmers += seql_minus_k + 1;
|
||||||
|
offset += 1 + (seql_minus_k + k + 3) / 4;
|
||||||
|
n_unitigs += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(Self {
|
||||||
|
mmap,
|
||||||
|
block_offsets: Vec::new(), // empty → random access disabled
|
||||||
|
n_unitigs,
|
||||||
|
n_kmers,
|
||||||
|
k,
|
||||||
|
block_bits: DEFAULT_BLOCK_BITS,
|
||||||
|
mask: (1usize << DEFAULT_BLOCK_BITS) - 1,
|
||||||
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn is_empty(&self) -> bool {
|
pub fn len(&self) -> usize { self.n_unitigs }
|
||||||
self.seqls.is_empty()
|
pub fn is_empty(&self) -> bool { self.n_unitigs == 0 }
|
||||||
|
pub fn n_kmers(&self) -> usize { self.n_kmers }
|
||||||
|
pub fn block_bits(&self) -> u8 { self.block_bits }
|
||||||
|
|
||||||
|
/// Byte offset of the START of record `i` (the seql byte) in the mmap.
|
||||||
|
#[inline]
|
||||||
|
fn chunk_start(&self, i: usize) -> usize {
|
||||||
|
assert!(!self.block_offsets.is_empty(),
|
||||||
|
"random access requires UnitigFileReader::open(); use open_sequential() for iteration only");
|
||||||
|
if self.block_bits == 0 {
|
||||||
|
return self.block_offsets[i] as usize;
|
||||||
|
}
|
||||||
|
let block = i >> self.block_bits;
|
||||||
|
let rem = i & self.mask;
|
||||||
|
let mut offset = self.block_offsets[block] as usize;
|
||||||
|
for _ in 0..rem {
|
||||||
|
let seql_minus_k = self.mmap[offset] as usize;
|
||||||
|
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||||
|
}
|
||||||
|
offset
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Total number of kmers across all chunks.
|
/// Nucleotide length of chunk `i`.
|
||||||
pub fn n_kmers(&self) -> usize {
|
|
||||||
self.n_kmers
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Return the nucleotide length of chunk `i`.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn seql(&self, i: usize) -> usize {
|
pub fn seql(&self, i: usize) -> usize {
|
||||||
self.seqls[i] as usize + self.k
|
self.mmap[self.chunk_start(i)] as usize + self.k
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Reconstruct chunk `i` as a [`Unitig`]. Allocates a copy of the packed bytes.
|
/// Reconstruct chunk `i` as a [`Unitig`].
|
||||||
pub fn unitig(&self, i: usize) -> Unitig {
|
pub fn unitig(&self, i: usize) -> Unitig {
|
||||||
let seql = self.seqls[i] as usize + self.k;
|
let offset = self.chunk_start(i);
|
||||||
let start = self.packed_offsets[i] as usize;
|
let seql = self.mmap[offset] as usize + self.k;
|
||||||
let byte_len = (seql + 3) / 4;
|
let byte_len = (seql + 3) / 4;
|
||||||
let tail = (seql % 4) as u8;
|
let bytes = self.mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice();
|
||||||
let bytes = self.mmap[start..start + byte_len].to_vec().into_boxed_slice();
|
Unitig::new((seql % 4) as u8, bytes)
|
||||||
Unitig::new(tail, bytes)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Extract the raw left-aligned u64 of the kmer at position `j` within chunk `i`.
|
/// Raw left-aligned u64 of the k-mer at position `j` within chunk `i`.
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn raw_kmer(&self, i: usize, j: usize) -> u64 {
|
pub fn raw_kmer(&self, i: usize, j: usize) -> u64 {
|
||||||
let start = self.packed_offsets[i] as usize;
|
let offset = self.chunk_start(i);
|
||||||
extract_kmer_raw(&self.mmap[start..], j, self.k)
|
extract_kmer_raw(&self.mmap[offset + 1..], j, self.k)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Return `true` iff the kmer at position `j` of chunk `i` equals `query`.
|
/// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical).
|
||||||
///
|
|
||||||
/// O(1), zero allocation. The chunk may store either orientation of the kmer;
|
|
||||||
/// canonicalization is applied before comparison.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
||||||
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over all kmers in file order (all positions of chunk 0, then chunk 1, …).
|
// ── Sequential iterators (O(n) running-offset cursor) ─────────────────────
|
||||||
///
|
|
||||||
/// Each chunk is copied from the mmap once; iteration within the chunk is
|
fn iter_chunks_sequential(&self) -> impl Iterator<Item = (usize, Unitig)> + '_ {
|
||||||
/// zero-allocation (sliding-window via [`OwnedPackedSeqKmerIter`]).
|
let k = self.k;
|
||||||
|
let mmap = &*self.mmap;
|
||||||
|
let n = self.n_unitigs;
|
||||||
|
let mut offset = 0usize;
|
||||||
|
(0..n).map(move |chunk_id| {
|
||||||
|
let seql = mmap[offset] as usize + k;
|
||||||
|
let byte_len = (seql + 3) / 4;
|
||||||
|
let bytes = mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice();
|
||||||
|
offset += 1 + byte_len;
|
||||||
|
(chunk_id, Unitig::new((seql % 4) as u8, bytes))
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Iterate all unitigs sequentially. Works without `.idx` (sequential open).
|
||||||
|
pub fn iter_unitigs(&self) -> impl Iterator<Item = (usize, Unitig)> + '_ {
|
||||||
|
self.iter_chunks_sequential()
|
||||||
|
}
|
||||||
|
|
||||||
pub fn iter_kmers(&self) -> impl Iterator<Item = Kmer> + '_ {
|
pub fn iter_kmers(&self) -> impl Iterator<Item = Kmer> + '_ {
|
||||||
(0..self.len()).flat_map(move |i| self.unitig(i).into_kmers())
|
self.iter_chunks_sequential()
|
||||||
|
.flat_map(|(_, u)| u.into_kmers())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over all canonical kmers in file order.
|
|
||||||
///
|
|
||||||
/// Equivalent to `iter_kmers().map(|km| km.canonical())` but uses the
|
|
||||||
/// built-in canonical iterator on each chunk, which avoids a separate
|
|
||||||
/// canonicalization pass.
|
|
||||||
pub fn iter_canonical_kmers(&self) -> impl Iterator<Item = CanonicalKmer> + '_ {
|
pub fn iter_canonical_kmers(&self) -> impl Iterator<Item = CanonicalKmer> + '_ {
|
||||||
(0..self.len()).flat_map(move |i| self.unitig(i).into_canonical_kmers())
|
self.iter_chunks_sequential()
|
||||||
|
.flat_map(|(_, u)| u.into_canonical_kmers())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over `(kmer, chunk_id, rank)` for every canonical kmer in the file.
|
|
||||||
///
|
|
||||||
/// `chunk_id` is the index of the chunk within this file; `rank` is the
|
|
||||||
/// 0-based position of the kmer within that chunk. Used to build the
|
|
||||||
/// evidence table in `obilayeredmap`.
|
|
||||||
pub fn iter_indexed_canonical_kmers(
|
pub fn iter_indexed_canonical_kmers(
|
||||||
&self,
|
&self,
|
||||||
) -> impl Iterator<Item = (CanonicalKmer, usize, usize)> + '_ {
|
) -> impl Iterator<Item = (CanonicalKmer, usize, usize)> + '_ {
|
||||||
(0..self.len()).flat_map(move |chunk_id| {
|
self.iter_chunks_sequential()
|
||||||
self.unitig(chunk_id)
|
.flat_map(|(chunk_id, u)| {
|
||||||
.into_canonical_kmers()
|
u.into_canonical_kmers()
|
||||||
.enumerate()
|
.enumerate()
|
||||||
.map(move |(rank, kmer)| (kmer, chunk_id, rank))
|
.map(move |(rank, kmer)| (kmer, chunk_id, rank))
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
fn read_idx(path: &Path) -> SKResult<(usize, usize, u8, Vec<u32>)> {
|
||||||
let data = std::fs::read(path).map_err(SKError::Io)?;
|
let data = std::fs::read(path).map_err(SKError::Io)?;
|
||||||
let mut pos = 0;
|
let mut pos = 0;
|
||||||
|
|
||||||
@@ -260,15 +363,27 @@ fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
|||||||
.ok_or(SKError::Truncated { context: "unitig index: magic" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: magic" })?;
|
||||||
if magic_bytes != &MAGIC {
|
if magic_bytes != &MAGIC {
|
||||||
return Err(SKError::BadMagic {
|
return Err(SKError::BadMagic {
|
||||||
expected: "UIDX",
|
expected: "UIX3",
|
||||||
got: magic_bytes.try_into().unwrap(),
|
got: magic_bytes.try_into().unwrap(),
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
pos += 4;
|
pos += 4;
|
||||||
|
|
||||||
|
let bb_bytes = data.get(pos..pos + 4)
|
||||||
|
.ok_or(SKError::Truncated { context: "unitig index: block_bits" })?;
|
||||||
|
let block_bits_u32 = u32::from_le_bytes(bb_bytes.try_into().unwrap());
|
||||||
|
if block_bits_u32 > 31 {
|
||||||
|
return Err(SKError::InvalidData {
|
||||||
|
context: "unitig index",
|
||||||
|
detail: format!("block_bits out of range: {block_bits_u32}"),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
let block_bits = block_bits_u32 as u8;
|
||||||
|
pos += 4;
|
||||||
|
|
||||||
let n_bytes = data.get(pos..pos + 4)
|
let n_bytes = data.get(pos..pos + 4)
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?;
|
||||||
let n = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize;
|
let n_unitigs = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize;
|
||||||
pos += 4;
|
pos += 4;
|
||||||
|
|
||||||
let nk_bytes = data.get(pos..pos + 8)
|
let nk_bytes = data.get(pos..pos + 8)
|
||||||
@@ -276,25 +391,22 @@ fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
|||||||
let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize;
|
let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize;
|
||||||
pos += 8;
|
pos += 8;
|
||||||
|
|
||||||
let seqls = data.get(pos..pos + n)
|
let block_size = 1usize << block_bits;
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: seqls" })?
|
let n_blocks = (n_unitigs + block_size - 1) >> block_bits;
|
||||||
.to_vec();
|
let n_offsets = n_blocks + 1;
|
||||||
pos += n;
|
let mut block_offsets = Vec::with_capacity(n_offsets);
|
||||||
|
for _ in 0..n_offsets {
|
||||||
let mut packed_offsets = Vec::with_capacity(n + 1);
|
|
||||||
for _ in 0..=n {
|
|
||||||
let off_bytes = data.get(pos..pos + 4)
|
let off_bytes = data.get(pos..pos + 4)
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: packed_offsets" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: block_offsets" })?;
|
||||||
packed_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap()));
|
block_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap()));
|
||||||
pos += 4;
|
pos += 4;
|
||||||
}
|
}
|
||||||
|
|
||||||
Ok((seqls, packed_offsets, n_kmers))
|
Ok((n_unitigs, n_kmers, block_bits, block_offsets))
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Kmer utilities ────────────────────────────────────────────────────────────
|
// ── Kmer utilities ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Reverse complement of a left-aligned 2-bit kmer (same algorithm as [`KmerOf::revcomp`]).
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
||||||
let x = !raw;
|
let x = !raw;
|
||||||
@@ -304,22 +416,17 @@ fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
|||||||
x << (64 - 2 * k)
|
x << (64 - 2 * k)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Canonical form of a left-aligned 2-bit kmer: `min(kmer, revcomp(kmer))`.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn canonical_raw(raw: u64, k: usize) -> u64 {
|
fn canonical_raw(raw: u64, k: usize) -> u64 {
|
||||||
raw.min(revcomp_raw(raw, k))
|
raw.min(revcomp_raw(raw, k))
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Bit extraction ────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
/// Extract the kmer at nucleotide position `j` from MSB-first 2-bit packed `bytes`.
|
|
||||||
/// Returns a left-aligned u64 matching [`KmerOf`]'s internal representation.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 {
|
fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 {
|
||||||
let bit_start = j * 2;
|
let bit_start = j * 2;
|
||||||
let byte_start = bit_start / 8;
|
let byte_start = bit_start / 8;
|
||||||
let bit_offset = bit_start % 8; // always 0, 2, 4, or 6
|
let bit_offset = bit_start % 8;
|
||||||
let bytes_needed = (bit_offset + 2 * k + 7) / 8; // ≤ 9 for k ≤ 32
|
let bytes_needed = (bit_offset + 2 * k + 7) / 8;
|
||||||
|
|
||||||
let mut acc = 0u128;
|
let mut acc = 0u128;
|
||||||
for idx in 0..bytes_needed {
|
for idx in 0..bytes_needed {
|
||||||
@@ -327,8 +434,8 @@ fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 {
|
|||||||
}
|
}
|
||||||
|
|
||||||
let shift = bytes_needed * 8 - bit_offset - 2 * k;
|
let shift = bytes_needed * 8 - bit_offset - 2 * k;
|
||||||
let mask = !0u64 >> (64 - 2 * k);
|
let mask = !0u64 >> (64 - 2 * k);
|
||||||
let raw = (acc >> shift) as u64 & mask;
|
let raw = (acc >> shift) as u64 & mask;
|
||||||
raw << (64 - 2 * k)
|
raw << (64 - 2 * k)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user