FORT RISQUE DE DÉRIVE. Nombreux changements récents :
+- Ajout de EvidenceKind (Exact / Approx { b, z }) dans IndexConfig et LayerMeta
+- Ajout de block_bits dans IndexConfig
+- LayerEvidence enum dans mphf_layer.rs remplace l'ancienne approche monolithique
+- Distinction open() vs open_sequential() dans UnitigFileReader
+- Commandes reindex et estimate ajoutées
+Vérifier que la hiérarchie à 3 niveaux décrite est toujours exacte et que les nouveaux paramètres sont documentés.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/architecture/index_architecture/index.html b/doc/architecture/index_architecture/index.html
index 24d05a5..5565517 100644
--- a/doc/architecture/index_architecture/index.html
+++ b/doc/architecture/index_architecture/index.html
@@ -645,6 +645,34 @@
+
A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
+
A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.
Three-level hierarchy
-
PartitionedIndex
-├── LayeredPartition (one per minimiser bucket)
-│ ├── MphfLayer 0 kmer → slot (immutable bijection)
-│ │ ├── DataStore A slot → T (e.g. counts)
-│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
-│ ├── MphfLayer 1
-│ │ └── DataStore A
-│ └── ...
-├── LayeredPartition
-│ └── ...
+
PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
-
LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
-
MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
-
DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.
+
KmerIndex: root entry point. Owns IndexMeta (written to index.meta) and a KmerPartition that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.
+
Partition directory: one directory per minimiser bucket. PartitionMeta (stored as meta.json) records n_layers. Layers within a partition cover disjoint kmer sets.
+
Layer directory: one MphfLayer plus optional data stores. LayerMeta (stored as layer_meta.json) records which EvidenceKind was used. The MPHF and unitigs.bin are immutable once built; evidence files are the only part replaced by reindex.
-
MphfLayer — autonomous mapping layer
-
MphfLayer::find(kmer:CanonicalKmer)->Option<usize>// slot, or None if absent
-MphfLayer::n()->usize// number of slots
-MphfLayer::build(dir:&Path)->OLMResult<(Self,usize)>// from unitigs.bin
-MphfLayer::open(dir:&Path)->OLMResult<Self>
-
-
find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.
-
The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.
pubstructIndexConfig{
+pubkmer_size:usize,
+pubminimizer_size:usize,
+pubn_bits:usize,// log2(n_partitions)
+pubwith_counts:bool,
+pubevidence:EvidenceKind,
+pubblock_bits:u8,// .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
+}
+
+pubstructIndexMeta{
+pubversion:u32,
+pubconfig:IndexConfig,
+pubgenomes:Vec<GenomeInfo>,// ordered; index = genome column number
+}
+
+pubstructGenomeInfo{
+publabel:String,
+pubmeta:HashMap<String,String>,// arbitrary categorical metadata}
-
Concrete types from obicompactvec:
+
IndexMeta is serialised as index.meta (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.
Controls which files are written per layer and which query path is taken:
+
+
+
+
Variant
+
Files written
+
False-positive rate
+
+
+
+
+
Exact
+
evidence.bin, unitigs.bin.idx
+
0
+
+
+
Approx { b, z }
+
fingerprint.bin
+
≈ W / 2^(b·z) per read (Findere)
+
+
+
+
EvidenceKind is stored both in IndexConfig (index-wide default, updated by reindex) and in each LayerMeta (per-layer record of what was actually built).
MphfLayer::find(kmer) dispatches transparently to find_exact or find_approx based on the evidence loaded at open time (read from layer_meta.json). Returns Some(slot) only if the kmer is confirmed present; None for absent or out-of-range.
block_bits controls the .idx file written alongside evidence.bin. At block_bits = 0, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller .idx.
+
The MPHF and unitigs.bin are never rebuilt by any post-build operation.
Layer::query(kmer) delegates to MphfLayer::find, then calls data.read(slot) if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only Option<Hit<D::Item>>.
+
Build-time entry points:
+
Layer<()>::build(out_dir,block_bits)// set membership
+Layer<PersistentCompactIntMatrix>::build(out_dir,block_bits,count_of)
+Layer<PersistentBitMatrix>::build_presence(out_dir,block_bits,n_genomes,present_in)
+
+Layer::<()>::build_evidence(layer_dir,kind,block_bits)// evidence only (reindex path)
+
+
+
DataStore — slot-indexed data
+
PersistentCompactIntMatrix and PersistentBitMatrix are slot-indexed stores. They know nothing about kmers or MPHFs.
Type
Item
-
Column stats
+
Aggregation method
Use
@@ -1471,120 +1552,46 @@
PersistentCompactIntMatrix
Box<[u32]>
-
sum() -> Array1<u64>
-
count per sample per slot
+
sum() → Array1<u64>
+
counts per genome per slot
PersistentBitMatrix
Box<[bool]>
-
count_ones() -> Array1<u64>
-
presence per sample per slot
+
count_ones() → Array1<u64>
+
presence per genome per slot
-
sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
-
A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.
-
Distance matrix API on DataStore types
-
Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.
-
Full distance matrices
-
Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.
-
Partial distance matrices
-
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
-
Category 1 — self-contained partials: additive without any external parameter.
The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
-
-
Progressive aggregation principle
-
Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
-
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
- ↓ Σ across layers
-LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
- ↓ Σ across partitions
-LayeredStore<LayeredStore<…>>::col_weights() — global column sums
-
-
The same cascade applies to every partial:
-
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
- ↓ element-wise Σ across layers
-LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
- ↓ element-wise Σ across partitions
-LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
-
-
Each level presents a stable trait surface to the level above; no level reaches two levels down.
-
-
Traits — obicompactvec::traits
-
Three traits unify the aggregation API across all levels of the hierarchy.
+
Aggregation traits — obicompactvec::traits
+
Three traits unify the aggregation API across all hierarchy levels.
PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.
-
LayeredStore<S> — obilayeredmap
-
A single generic wrapper replaces the need for named LayeredDataStore and PartitionedDataStore types:
+
LayeredStore\<S> — recursive aggregation wrapper
pubstructLayeredStore<S>(Vec<S>);
-
Three blanket impls propagate the traits up the hierarchy:
-
impl<S:ColumnWeights>ColumnWeightsforLayeredStore<S>{…}// Σ across inner stores
-impl<S:CountPartials>CountPartialsforLayeredStore<S>{…}// same pattern
-impl<S:BitPartials>BitPartialsforLayeredStore<S>{…}// same pattern
+
Three blanket impls propagate all traits up the hierarchy:
Because the blanket impl is recursive, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — no separate PartitionedStore type is needed:
-
PersistentCompactIntMatrix implements CountPartials
-LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
-LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
+
This makes LayeredStore<LayeredStore<PersistentCompactIntMatrix>> automatically implement CountPartials — no separate PartitionedStore type is needed:
+
PersistentCompactIntMatrix leaf (one layer)
+LayeredStore<PersistentCompactIntMatrix> one partition (layers are disjoint)
+LayeredStore<LayeredStore<…>> whole index (partitions are independent)
-
Normalised metrics — two-pass cascade
-
The normalised finalisation methods call col_weights() first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
-
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
+
Normalised metrics require global column sums — computed in a two-pass cascade:
+
// on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>fnrelfreq_bray_dist_matrix(&self)->Array2<f64>{
-letglobal=self.col_weights();// pass 1 — progressive sum at every level
-letp=self.partial_relfreq_bray(&global);// pass 2 — global passed in cascade
-p.mapv(|v|1.0-v)// finalise (diagonal zeroed separately)
+letglobal=self.col_weights();// pass 1 — sums up hierarchy
+letp=self.partial_relfreq_bray(&global);// pass 2 — global broadcast read-only
+p.mapv(|v|1.0-v)}
-
global is exact: each kmer belongs to exactly one (partition, layer) pair, so there is no double-counting across the hierarchy.
+
Because each kmer belongs to exactly one (partition, layer) pair, col_weights() has no double-counting across the hierarchy.
+
+
Progressive aggregation principle
+
No level reaches two levels down. Each level sums contributions from the level immediately below:
+
PersistentCompactIntMatrix::col_weights() — one (partition, layer)
+ ↓ Σ across layers
+LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
+ ↓ Σ across partitions
+LayeredStore<LayeredStore<…>>::col_weights() — global
+
+
The same cascade applies to every partial method.
+
+
Multi-genome column invariant
+
After any merge, every layer in every partition has exactly n_genomes columns, where n_genomes is the current total in index.meta. This holds for both PersistentCompactIntMatrix and PersistentBitMatrix.
+
Maintained by three coordinated operations:
+
Existing layers — column append.Layer::append_genome_column appends one column to each existing layer. Slots matching the incoming genome receive its count or true; all other slots receive 0 or false.
+
New layers — absent columns prepended. When a new layer is created for kmers unique to the incoming genome, n_existing_genomes absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.
+
First merge, Presence mode — init_presence_matrix. The initial single-genome index has no presence/ directory (presence is implicit). On the first merge, Layer<()>::init_presence_matrix materialises genome 0's presence column (all true) retroactively, raising the column count from 0 to 1 before appending column 1.
+
This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.
+
+
Query model
+
Point query
+
minimiser(kmer) → partition p
+for each layer l in p:
+ if let Some(slot) = MphfLayer_l.find(kmer):
+ return data_l.read(slot)
+return None
+
+
O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one (partition, layer).
+
Aggregation
+
result = reduce(
+ for p in partitions: // parallel
+ for l in layers(p): // parallel
+ partial(data_p_l)
+)
+
+
For normalised metrics, replace with the two-pass cascade.
Parallelism model
@@ -1642,12 +1682,12 @@ LayeredStore<LayeredStore<…>> via blanket impl (= par
Across partitions
-
LayeredStore<LayeredStore<S>> inner stores
-
none — fully independent
+
inner stores of LayeredStore<LayeredStore<S>>
+
none
Across layers within a partition
-
LayeredStore<S> inner stores
+
inner stores of LayeredStore<S>
none — disjoint kmer sets
@@ -1667,55 +1707,17 @@ LayeredStore<LayeredStore<…>> via blanket impl (= par
-
All levels use rayon par_iter internally; reduce_with performs a parallel tree reduction.
-
Query model
-
Point query — kmer → Option<Item>
-
minimiser(kmer) → partition p
-for each layer l in p:
- slot = MphfLayer_l.find(kmer)
- if slot is Some:
- return DataStore_l.get(slot)
-return None
-
-
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
-
Aggregation — → Result
-
result = reduce(
- for p in partitions: // parallel
- for l in layers(p): // parallel
- partial(DataStore_p_l)
-)
-
-
For normalised metrics replace with the two-pass scheme above.
-
-
DataStore derivation
-
Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
-
// count → presence/absence, parallel across (partition, layer)
-for (p, l) in all_partition_layer_pairs().par_iter():
- count_store = open PersistentCompactIntMatrix at (p, l)
- presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
-
-
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.
-
-
Relationship to current implementation
-
What is implemented
+
reindex — evidence conversion in place
+
KmerIndex::reindex(target, block_bits) converts every layer's evidence bundle to target without touching the MPHF or unitigs.bin:
-
obicompactvec::traits: ColumnWeights, CountPartials, BitPartials are defined and implemented on PersistentCompactIntMatrix and PersistentBitMatrix.
-
obilayeredmap::LayeredStore<S>: generic wrapper with blanket impls for all three traits. LayeredStore<LayeredStore<S>> is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
Layer<D: LayerData> still fuses MphfLayer and one DataStore. Multiple data stores on the same MPHF are not supported.
-
LayeredMap is a single-partition structure without distance matrix API; it does not yet use LayeredStore.
-
No PartitionedIndex type for point queries with parallel partition dispatch.
-
-
Planned refactoring
-
-
Extract MphfLayer from Layer<D> as an autonomous type.
-
Replace LayerData trait with the DataStore / ColumnWeights / CountPartials / BitPartials system.
-
Rewire LayeredMap to hold LayeredStore<PersistentCompactIntMatrix> (or bit variant) alongside the MPHF layers.
-
Implement PartitionedIndex using LayeredStore<LayeredStore<S>> for data and parallel dispatch for queries.
-
+
On success, IndexConfig::evidence and IndexConfig::block_bits are updated in index.meta. Each layer's layer_meta.json is also rewritten with the new EvidenceKind.
+
+
estimate — parameter dry-run
+
estimate resolves approximate-evidence parameters (z, b, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate Approx { b, z } before building or reindexing.
obikmer/src/cmd/query.rs — commande query, format de sortie
+
obikpartitionner/src/query_layer.rs — routage de la requête à travers les partitions
+
obiread/src/lib.rs — lecture des séquences d'entrée pour la requête
+
+
Notes
+
RISQUE DE DÉRIVE. Vérifier :
+- La commande unitig a été modifiée pour utiliser open_sequential() — vérifier si query est concerné
+- find_exact / find_approx / find générique ont été ajoutés dans MphfLayer — le chemin de requête a changé
+- Si l'index est approximatif (Approx), la requête peut produire des faux positifs : la doc le mentionne-t-elle ?
+- Format de sortie CSV (obikindex/src/csv.rs ou équivalent) à vérifier
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
+
+
Input
+
+
Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
+
Sequences shorter than k bases are silently skipped.
+
Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
+
+
+
Algorithm
+
The query follows the same superkmer-based partitioning strategy used at indexing time.
+
for each chunk of sequences (parallel workers via obipipeline):
+ build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
+ allocate seq_results[seq_idx][smer_pos] = None ← per-sequence s-mer result vectors
+ split superkmers by partition via minimiser hash
+ for each partition p:
+ query_partition(p, superkmers_routed_to_p)
+ → load QueryLayer(s) for p
+ → for each s-mer in each superkmer: MphfLayer::find(smer)
+ fill seq_results[seq_idx][kmer_offset + j] from partition results
+ for each sequence:
+ apply_findere(seq_results[seq_idx], effective_z) ← per full sequence
+ accumulate confirmed k-mer results into acc and cov
+ emit annotated sequences
+
+
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique RoutableSuperKmer is queried once per partition, and the result is broadcast to every SKDesc entry that references it.
+
Findere requires full-sequence aggregation.apply_findere is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.
+
Batches are processed in parallel via obipipeline workers; the --threads flag controls the number of worker threads.
+
+
Findere z-window filter
+
For approximate index modes, the index physically stores s-mers of size s = k_user − z + 1. At query time, set_k(s) is in effect, so queries naturally produce s-mer results. apply_findere then aggregates z consecutive s-mer results into one k_user-mer answer:
+
fnapply_findere(
+results:&[Option<Box<[u32]>>],// N s-mer results
+z:usize,
+n_genomes:usize,
+)->Vec<Option<Box<[u32]>>>// N − z + 1 k_user-mer results
+
+
Input length N (s-mers), output length N − z + 1 (k_user-mers).
+
For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values results[i..i+z][g] are nonzero (None counts as zero for all genomes). The scan is O(n) per genome.
+
Output values come from results[i] (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as None.
+
Short sequences: when the s-mer count is less than z, no complete window can form — apply_findere returns an empty vector. K-mers from sequences shorter than k_user are not emitted.
+
Exact indexes: z = 1, apply_findere is a passthrough (output length = input length).
The -z CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.
+
+
Layer lookup: MphfLayer::find
+
MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read. The caller (QueryLayer) never chooses the dispatch path: it is fixed at open time by LayerEvidence. See obilayeredmap for the full find / find_strict API.
+
QueryLayer variant selection
+
QueryLayer::open in query_layer.rs selects the data matrix to pair with MphfLayer:
+
+
+
+
Condition
+
Variant
+
Data returned per k-mer
+
+
+
+
+
with_counts=true and counts/ exists
+
Count
+
raw count per genome
+
+
+
presence/ exists
+
Presence
+
0/1 per genome (bit matrix)
+
+
+
only counts/ exists
+
Count
+
counts used as-is
+
+
+
neither exists
+
SetOnly
+
1 for every genome
+
+
+
+
+
Presence / count mode at query time
+
The --force-presence flag and --presence-threshold control how per-genome values are accumulated, independently of what the index stores:
+
genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
+
+
presence is true when --force-presence is set or when the index has no counts (!with_counts). The default presence_threshold is 1, so any nonzero count counts as a match.
+
+
Coverage vectors (--detail)
+
When --detail is requested, a 3-D accumulator cov[seq_idx][genome][kmer_pos] is allocated after all partitions are queried, with dimensions derived from n_kmers_out = n_smers − z + 1 (k_user-mer positions, not s-mer positions):
+
cov[seq_idx][g][pos] += contribution
+where pos is the k_user-mer index in the filtered (post-Findere) vector
+
+
Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key "coverage".
+
+
kmer_missing semantics
+
kmer_missing counts k_user-mer positions where the first s-mer (seq_results[seq_idx][pos]) is None — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).
+
+
Output format
+
Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.
Genome keys follow the iteration order of meta.genomes.
+
+
Annotation schema
+
+
+
+
Key
+
Type
+
Condition
+
Semantics
+
+
+
+
+
kmer_count
+
int
+
always
+
k-mers confirmed (post-Findere) with at least one genome match
+
+
+
kmer_missing
+
int
+
--count-missing
+
k-mers absent from the index entirely (pre-Findere None)
+
+
+
kmer_strict_matches
+
object
+
always
+
per-genome accumulated value (label → count or 0/1)
+
+
+
coverage
+
object
+
--detail
+
per-genome array of per-position contributions (label → [u32])
+
+
+
+
kmer_count + kmer_missing ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.
obiread/src/chunk.rs — SeqChunkIter, détection de frontières FASTA/FASTQ, state machines
+
obikrope/src/lib.rs — type Rope (Vec), opérations zero-copy
+
+
Notes
+
Document stable (la stratégie de chunking rope ne devrait pas avoir changé).
+Vérifier que le split FASTA/FASTQ reste correct si de nouveaux formats ont été ajoutés.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/chunkreader/index.html b/doc/implementation/chunkreader/index.html
index b638090..9d23527 100644
--- a/doc/implementation/chunkreader/index.html
+++ b/doc/implementation/chunkreader/index.html
@@ -243,19 +243,28 @@
The obiread crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.
-
Output type: rope
-
Each chunk is a Vec<Bytes> — a rope: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.
-
Using bytes::Bytes means the split at the record boundary is O(1): Bytes::split_to(n) adjusts a reference counter, not memory. No memcpy in the common case.
-
Allocation policy
+
obiread exposes two distinct sequence reading paths, each optimised for a different use case.
one allocation to pack the rope into a flat buffer
-
-
-
EOF flush
-
zero extra allocation
+
Stream path
+
open_nuc_stream
+
NucPage (flat normalised byte buffer)
+
no
+
index, superkmer — bulk throughput
+
The record path uses Rope-backed chunks and is described in detail below.
+The stream path (NucStream / NucPage) is described in the scatter section of pipeline.
+
+
Record path: chunk reader
+
The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. parse_chunk then converts each chunk into a Vec<SeqRecord>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.
+
This path is mandatory for query, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.
+
Output type: Rope
+
Each chunk is a Rope — a segmented byte sequence: a Vec of blocks, where each block is a Vec<Cell<u8>>. The consumer iterates over the blocks via a forward or backward cursor.
+
Rope::split_off(pos) splits at an absolute byte offset in O(log n) (binary search over block-start index). If pos falls inside a block, that block is split in two via Vec::split_off — no memcpy in the common case.
1. read one block of block_size bytes → push onto rope
-2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
- last block, skip the splitter (avoids a full backward scan for nothing)
-3. call splitter on last block
- if found at offset n:
- remainder = last_block.split_to(n) ← O(1), zero copy
- return std::mem::take(&mut self.rope) ← the chunk
-4. if rope.len() > 1 (multi-block accumulation):
- pack rope into one flat buffer ← one alloc
- retry splitter on flat buffer
-5. if EOF: flush remaining rope as final chunk
+
1. read one block of block_size bytes → push onto Rope
+2. call splitter(rope) → Option<abs_offset>
+ if Some(pos):
+ tail = rope.split_off(pos) ← O(log n), may split one block
+ chunk = mem::replace(&mut rope, tail)
+ return Some(Ok(chunk))
+3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
+4. if EOF and rope empty: return None
+
The Splitter function signature is fn(&Rope) -> Option<usize>. It returns the absolute byte offset of the start of the last complete record, or None if no boundary was found in the accumulated rope (need more data).
Boundary detection — FASTA
-
Backward scan with a 2-state machine. Searches for > immediately preceded by \n or \r:
+
Backward scan with a 2-state machine. Searches (right to left) for > followed by \n or \r (i.e., a > that is preceded by a newline in forward order):
stateDiagram-v2
direction LR
[*] --> Scanning
Scanning --> FoundGt : '>'
FoundGt --> Scanning : other
FoundGt --> [*] : '\\n' / '\\r' ✓
-
Returns the byte offset of the > that starts the last complete record.
+
Returns the byte offset of the > that starts the last complete record. Returns None if only one > is found (cannot confirm there is a prior complete record).
Boundary detection — FASTQ
FASTQ records have a rigid 4-line structure (@header, sequence, +, quality). The @ character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate @.
-
7-state machine (port of Go's EndOfLastFastqEntry), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.
+
7-state machine (states 0–6), scanning from right to left. Each time a + is found, its position is saved as restart; any state mismatch resets the scan to that position.
obikmer/src/cmd/estimate.rs — commande estimate (dry-run des paramètres)
+
+
Notes
+
Ce document était à l'origine une discussion de design (4 approches). L'implémentation
+a maintenant convergé vers l'approche fingerprint (Findere-style).
+FORT RISQUE DE DÉRIVE — le contenu est probablement un mélange de design et d'implémentation :
+- Le modèle FP = 1/2^(b·z) et les règles de résolution (2-of-3 parmi b, z, fp) sont implémentés
+- La commande reindex permet la conversion a posteriori exact↔approx
+- La commande estimate fait le dry-run des paramètres
+Cette page doit être réécrite pour documenter l'implémentation Findere réelle plutôt que les alternatives abandonnées.
evidence.bin maps each MPHF slot to the position of the k-mer that owns it,
+enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
+k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
+
+
+
+
file
+
size/partition
+
fraction
+
+
+
+
+
evidence.bin
+
132 MB
+
66 %
+
+
+
unitigs.bin
+
58 MB
+
29 %
+
+
+
mphf.bin
+
10 MB
+
5 %
+
+
+
+
evidence.bin is a bijection from MPHF-space to unitig-position-space and
+costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
+only ~22 % packing headroom. Compression is not a path to elimination.
+
The approximate index replaces evidence.bin + unitigs.bin.idx with a
+fingerprint.bin file. The MPHF and unitigs.bin are kept unchanged. Set
+operations still require an exact index; the approximate index targets query
+workloads that can tolerate a bounded false-positive rate.
+
+
The Findere model
+
A B-bit fingerprint stored per MPHF slot provides the discrimination that
+evidence.bin would otherwise provide through full k-mer reconstruction.
+
For a foreign k-mer query, the MPHF maps it to some slot s. The fingerprint
+stored at s belongs to the legitimate k-mer at that slot. The FP event is:
+
P(FP per k-mer) = 1 / 2^b
+
+
The Findere trick reduces the indexed k-mer size. When the user specifies k_user
+and z, the index physically stores k-mers of size s = k_user − z + 1. At query
+time, the same s-mer size is used. After collecting per-position s-mer results
+over the full query sequence, a sliding window of size z aggregates z consecutive
+s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:
+
P(FP per k_user-mer) = 1 / 2^(b·z)
+
+
IndexConfig::kmer_size stores s = k_user − z + 1, not k_user. Both indexing
+and querying use this stored size via set_k(idx.kmer_size()).
+
Parameters b and z are stored in layer_meta.json (EvidenceKind::Approx { b, z }).
+
+
FingerprintVec on disk
+
fingerprint.bin layout:
+
magic: b"FPVF" (4 bytes)
+b: u8 (bits per slot, 1..=64)
+padding: [0u8; 3]
+n: u64 LE (number of slots)
+data: packed bits, ceil(n·b/8) bytes, Lsb0 order
+
+
FingerprintVec is memory-mapped. The match check against a query k-mer:
build_approx_evidence iterates unitigs.bin sequentially, writes
+kmer.seq_hash() into the slot assigned by the MPHF, then saves fingerprint.bin
+and layer_meta.json. No .idx file is produced; random access into
+unitigs.bin is not needed.
MphfLayer::open reads this tag and dispatches find to find_exact or
+find_approx transparently. find_exact panics on an approximate layer;
+find_approx panics on an exact layer — mode mixing is a programming error.
+
+
Parameter resolution (resolve_approx_params)
+
The identity b·z = ⌈−log₂(fp)⌉ lets any two of (b, z, fp) derive the third.
+resolve_approx_params implements a 2-of-3 rule with conservative ceiling
+rounding:
+
+
+
+
given
+
derived
+
+
+
+
+
b, z
+
fp = 1/2^(b·z)
+
+
+
z, fp
+
b = ⌈−log₂(fp) / z⌉
+
+
+
b, fp
+
z = ⌈−log₂(fp) / b⌉
+
+
+
z only
+
b = 8 (default), fp derived
+
+
+
b only
+
z = 1 (default), fp derived
+
+
+
fp only
+
b = 8 (default), z derived
+
+
+
none
+
b = 8, z = 1, fp = 1/256
+
+
+
+
When all three are given, b and z are authoritative and fp is recomputed.
+
+
CLI flags
+
Both index and reindex accept the same flags:
+
+
+
+
flag
+
type
+
meaning
+
+
+
+
+
--approx
+
bool
+
enable fingerprint evidence
+
+
+
--evidence-bits (b)
+
u8
+
fingerprint bits per slot
+
+
+
-z
+
u8
+
Findere z parameter
+
+
+
--fp
+
f64
+
target FP rate per z-window
+
+
+
--block-size
+
usize
+
unitig block size for exact .idx; ignored in approx mode
+
+
+
+
--approx must be set explicitly; the other three flags are optional and
+resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
+
+
reindex command
+
reindex converts an existing index between exact and approximate evidence
+in-place across all partitions and layers, running partitions in parallel via
+Rayon.
+
Conversion to approximate (--approx):
+
+
Builds fingerprint.bin from unitigs.bin + mphf.bin.
+
Removes evidence.bin and unitigs.bin.idx.
+
Updates layer_meta.json with EvidenceKind::Approx { b, z }.
+
+
Conversion to exact (default, no --approx):
+
+
Builds evidence.bin + unitigs.bin.idx from unitigs.bin + mphf.bin.
+
Removes fingerprint.bin.
+
Updates layer_meta.json with EvidenceKind::Exact.
+
+
The root index.meta is updated with the new evidence kind on success.
+mphf.bin and unitigs.bin are never modified.
+
+
estimate command
+
estimate is a dry-run that resolves and prints (b, z, fp) without touching
+any index. It accepts the same --evidence-bits, -z, and --fp flags and
+additionally accepts -k to display the effective indexed k-mer length:
+
k (user): 31
+k (indexed, s=k-z+1): 27
+z: 5
+evidence bits (b): 8
+FP per s-mer: 3.906e-3 (1/2^8)
+FP per k-mer window: 9.537e-7 (1/2^(8·5))
+
+
Useful for choosing parameters before committing to an index build.
obikseq/src/kmer.rs — layout mémoire (repr(transparent) u64), encodage/décodage, revcomp, forme canonique
+
obikseq/src/params.rs — k global (set_k / k())
+
+
Notes
+
Document d'implémentation stable. L'algorithme de revcomp bit-à-bit est décrit —
+vérifier qu'il correspond à revcomp_raw dans obiskio/src/unitig_index.rs (copie locale)
+et à l'implémentation dans obikseq/src/kmer.rs.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/kmer/index.html b/doc/implementation/kmer/index.html
index 91f0ea8..a9a332f 100644
--- a/doc/implementation/kmer/index.html
+++ b/doc/implementation/kmer/index.html
@@ -514,10 +514,21 @@
Nucleotides are packed 2 bits each, left-aligned, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is not stored — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.
+
Three marker types implement KmerLength:
+
+
+
+
Marker
+
len() source
+
Used for
+
+
+
+
+
KLen
+
params::k()
+
k-mers
+
+
+
MLen
+
params::m()
+
minimizers
+
+
+
ConstLen<N>
+
const generic N
+
tests
+
+
+
+
Public aliases:
+
pubtypeKmer=KmerOf<KLen>;// k-mer, global k
+pubtypeMinimizer=CanonicalKmerOf<MLen>;// canonical m-mer, global m
+
+
Nucleotides are packed 2 bits each, left-aligned, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is not stored — every operation reads it from L::len().
@@ -1071,33 +1252,41 @@
+
Global parameters
+
params::set_k(k) / params::k() and params::set_m(m) / params::m() are backed by OnceLock<usize> in production (write-once, panic on conflict) and by thread_local! { Cell<usize> } in test builds (per-thread, freely writable). params::init(k, m) sets both in one call.
Encoding
-
Kmer::from_ascii(ascii, k) encodes the first k bytes of an ASCII slice using the shared ENC table (see SuperKmer — ASCII encoding):
+
KmerOf::<L>::from_ascii(ascii) encodes the first L::len() bytes using the shared ENC table (see SuperKmer — ASCII encoding):
write_ascii(k, buf) appends k ASCII characters to a caller-supplied Vec<u8> using the shared DEC4 table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.
-
to_ascii(k) is a convenience wrapper that allocates and returns a Vec<u8>; intended for tests and display only.
+
write_ascii(writer) writes k ASCII characters to any W: Write using the shared DEC4 table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.
+
to_ascii() is a convenience wrapper that allocates and returns a Vec<u8>; intended for tests and display only.
Reverse complement
Computed as pure arithmetic — no lookup table, no memory access:
After complementing, bytes are reversed (swap_bytes), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
canonical() returns a CanonicalKmerOf<L> — a distinct newtype that carries the same u64 layout but enforces the invariant that the stored value equals min(kmer, revcomp):
Lexicographic minimum of forward and reverse-complement, comparing the raw u64 values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
+
CanonicalKmerOf::from_raw_unchecked(raw) is the only other public constructor, for trusted paths such as deserialisation.
+
Sliding window helpers
+
push_right(nuc) / push_left(nuc) shift the window by one base in O(1). is_overlapping(other) checks whether the last k−1 nucleotides of self equal the first k−1 of other.
+
Hashing
+
hash_kmer(raw: u64) -> u64 computes mix64(raw ^ 0x9e3779b97f4a7c15), the seeded splitmix64 finalizer. CanonicalKmerOf::seq_hash() delegates to hash_kmer.
obilayeredmap/src/layer.rs — Layer::append_genome_column() (PersistentCompactIntMatrix et PersistentBitMatrix)
+
obicompactvec/src/intmatrix.rs — append_column pour PersistentCompactIntMatrix
+
obicompactvec/src/bitmatrix.rs — append_column pour PersistentBitMatrix
+
+
Notes
+
FORT RISQUE DE DÉRIVE. Changements récents :
+- Ajout de la validation de compatibilité d'évidence : merge exact+approx → erreur (OKIError::IncompatibleEvidence)
+- merge_partition reçoit maintenant block_bits: u8
+- La commande reindex a été ajoutée comme outil de conversion exact↔approx avant merge
+Vérifier que la doc décrit la politique de merge mixed-evidence et le recours à reindex.
obikmer merge combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
+
+
Modes
+
pubenumMergeMode{Presence,Count}
+
+
Default mode is Presence. Count mode requires all source indexes to have with_counts=true; mixing count and non-count sources is rejected at validation.
+
+
+
+
Mode
+
Column type
+
Constraint
+
+
+
+
+
Presence
+
PersistentBitMatrix (one bit per genome per slot)
+
none
+
+
+
Count
+
PersistentCompactIntMatrix (one u32 per genome per slot)
+
all sources with_counts=true
+
+
+
+
+
Input / output constraints
+
All source indexes must satisfy:
+
+
IndexState::Indexed (fully built — index.done sentinel present)
+
Same kmer_size, minimizer_size, n_partitions
+
Same evidence kind: all Exact, or all Approx with identical (b, z) parameters
+
If Count mode: all sources must have with_counts=true
+
+
--force: if the output directory already exists, it is deleted before the merge begins.
+
+
Evidence compatibility
+
validate_evidence_compat(sources) is called before any I/O. It compares each source's EvidenceKind against sources[0]:
+
+
All Exact → accepted, output uses Exact
+
All Approx { b, z } with same (b, z) → accepted, output uses those parameters
+
Any other combination → OKIError::IncompatibleEvidence, with a message directing the user to run reindex first
+
+
Mixed exact/approx is a hard error, not a silent conversion.
compute_labels(sources, rename_duplicates) assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive .1, .2, … suffixes when rename_duplicates is true, or trigger OKIError::DuplicateGenomeLabel otherwise.
+
+
Algorithm
+
1. Validation
+
Check all sources against the constraints above. Abort on any mismatch.
+
2. Bootstrap output from first source
+
Recursive file copy of sources[0] → output. Immediately after the copy:
+
+
index.meta is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
+
In Presence mode, any counts/ directories inherited from source_0 are removed.
+
spectrums/ from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.
+
+
This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … n_dst_genomes - 1 in the destination.
+
3. For each subsequent source (parallel across partitions)
+
KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits) is called for each partition index i. block_bits is taken from dst.meta.config.block_bits.
+
Each entry in sources is (&KmerPartition, n_genomes) where n_genomes is the column count that source contributes (> 1 when the source is itself a merged index).
+
First merge, Presence mode: when n_dst_genomes == 1, Layer::<()>::init_presence_matrix is called on every existing destination layer before any source column is appended. This creates presence/col_000000.pbiv set all-true (genome 0 is present in every slot).
+
Pass 1 — classify kmers
+
Iterate all kmers from all source partitions (via UnitigFileReader + canonical kmer iteration). For each kmer, probe the destination LayeredMap<()>:
+
+
Hit: kmer already in the destination; record for Pass 2.
+
Miss: push kmer into a GraphDeBruijn accumulator.
+
+
New layer construction
+
If the accumulator is non-empty, compute de Bruijn unitigs and call Layer::<()>::build(&new_layer_dir, block_bits). All kmers absent from the destination — across all sources — accumulate into a single graph, producing one new layer per merge operation (not one per source).
+
Pass 2 — fill column builders
+
For each source and each of its layers, re-iterate unitigs and look up stored values via SrcLayerData::lookup(kmer, src_n):
+
+
SrcLayerData::SetMembership — no data directory exists; every kmer returns vec![1; n_genomes]
+
SrcLayerData::Presence — reads PersistentBitMatrix from presence/
+
SrcLayerData::Count — reads PersistentCompactIntMatrix from counts/
+
+
Hits are routed to exist_builders[dst_layer][src_col]; misses are routed to new_src_builders[src_col].
+
Column prepending for new layers
+
Before source columns are written to the new layer, n_dst_genomes absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.
+
Close and update metadata
+
Close all builders; update presence/meta.json or counts/meta.json with {"n": N, "n_cols": n_dst_genomes + n_src_total}; increment PartitionMeta::n_layers if a new layer was added.
+
4. Update index metadata
+
index.meta was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.
+
+
append_genome_column
+
Defined on two concrete specialisations of Layer<D>:
Each appends one column file to the matrix subdirectory (counts/ or presence/). In merge_partition, columns are written directly via PersistentBitVecBuilder / PersistentCompactIntVecBuilder rather than through these helpers, but the invariant they enforce is the same.
+
+
Column count invariant
+
After any merge, every layer in every partition has exactly n_genomes columns, where n_genomes is the total genome count in the index at that point.
+
Maintained by three mechanisms:
+
+
Existing layers: n_src_total columns appended (one per source genome).
+
New layers created during merge: n_dst_genomes absent columns prepended before source columns.
+
First merge, Presence mode: init_presence_matrix retroactively creates presence/col_0 all-true for genome 0.
+
+
The invariant is a precondition of LayeredStore aggregation traits: col_weights() and all partial distance methods assume every inner store has the same column count.
+
+
Error variants relevant to merge
+
+
+
+
Variant
+
Condition
+
+
+
+
+
OKIError::NotIndexed(path)
+
Source not in Indexed state
+
+
+
OKIError::IncompatibleConfig
+
Mismatched kmer_size, minimizer_size, or n_partitions
+
+
+
OKIError::MismatchedMode
+
Count mode but a source has with_counts=false
+
+
+
OKIError::IncompatibleEvidence(msg)
+
Mixed exact/approx or different approx (b, z)
+
+
+
OKIError::DuplicateGenomeLabel(label)
+
Duplicate label and rename_duplicates=false
+
+
+
+
+
On-disk impact
+
After merging G genomes (sources_0 contributes G0, subsequent sources the rest):
+
partitions/
+ part_00000/
+ index/
+ meta.json ← n_layers updated if new layer added
+ layer_0/
+ mphf.bin ← unchanged
+ unitigs.bin ← unchanged
+ evidence.bin ← unchanged
+ presence/ ← created on first merge (Presence mode)
+ meta.json {"n": N, "n_cols": G}
+ col_000000.pbiv ← all-true (genome 0 … G0-1)
+ col_000001.pbiv ← next source
+ ...
+ counts/ ← extended (Count mode)
+ meta.json {"n": N, "n_cols": G}
+ col_000000.pciv ← genome 0 counts (from original build)
+ col_000001.pciv ← next source
+ ...
+ layer_N/ ← new layer (if new kmers found)
+ mphf.bin
+ unitigs.bin
+ evidence.bin
+ presence/ or counts/
+ meta.json {"n": N1, "n_cols": G}
+ col_000000.pbiv ← all-false (absent for existing genomes)
+ ...
+spectrums/
+ <label>.json ← one file per genome, rebuilt from all sources
+index.meta ← complete genome list + evidence kind written at bootstrap
+
obilayeredmap/src/mphf_layer.rs — type Mphf (PtrHash + CubicEps + CachelineEfVec + Xx64), construction en 2 passes, build(), build_exact_evidence(), build_approx_evidence(), build_evidence()
+
obikpartitionner/src/index_layer.rs — build_index_layer() avec passage de block_bits
+
+
Notes
+
FORT RISQUE DE DÉRIVE. Changements récents :
+- build_exact_evidence(dir, block_bits) — block_bits maintenant paramétrisé (défaut 0)
+- build_approx_evidence(dir, b, z) — nouvelle fonction pour l'évidence fingerprint
+- build_evidence(dir, kind, block_bits) — dispatch selon EvidenceKind
+- Construction en 2 phases : pass 1 (Rayon parallèle) + pass 2 (callback fill_slot)
+Vérifier que la doc décrit correctement les deux nouvelles routes d'évidence et le paramètre block_bits.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/mphf/index.html b/doc/implementation/mphf/index.html
index 56c46b0..5d4b4dd 100644
--- a/doc/implementation/mphf/index.html
+++ b/doc/implementation/mphf/index.html
@@ -757,6 +757,28 @@
+
+
+
Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.
Phase 1 — provisional MPHF + kmer spectrum
-
Implemented in obikpartitionner::KmerPartition::count_kmer().
+
Implemented in obikpartitionner::KmerPartition::count_kmer() → count_partition().
-
Pass 1: read the dereplicated superkmer file; enumerate all unique canonical kmers into a HashSet. Exact count known after this pass.
-
Build a provisional MPHF (GOFunction from the ph crate) over the exact kmer set. Produces mphf1.bin.
-
Create counts1.bin: one zero-initialised u32 per MPHF slot (mmap'd).
-
Pass 2: re-read the dereplicated file; for each kmer, query mphf1.get(kmer) and atomically accumulate the superkmer count into counts1[slot].
+
External sort: read the dereplicated superkmer file; extract the raw u64 canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: sorted_unique.bin — a flat array of f0 distinct sorted u64 values. Exact kmer count f0 is known at this point.
+
Build provisional MPHF (ptr_hash, same configuration as phase 2) over sorted_unique.bin using new_from_par_iter. Delete sorted_unique.bin immediately after. Persist to mphf1.bin.
+
Create counts1.bin: PersistentCompactIntVec with f0 slots, zero-initialised.
+
Accumulation pass: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute slot = mphf.index(kmer.raw()) and increment counts1[slot] by the superkmer's COUNT.
Build kmer frequency spectrum from counts1: histogram {count → n_kmers}, totals f0 (distinct kmers) and f1 (total abundance). Written to kmer_spectrum_raw.json per partition, then merged globally.
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see Construction pipeline), the exact filtered kmer set is available via unitigs.bin.
-
MphfLayer::build is called on the unitig file:
+
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot) is called on the unitig directory:
-
Pass 1: iterate all canonical kmers from unitigs.bin in parallel, build and store mphf.bin (ptr_hash).
-
Pass 2: iterate sequentially, fill evidence.bin, call the mode-specific fill_slot callback.
+
Pass 1 (parallel): a CanonicalKmerIter — clonable via Arc<Mmap>, no file reopening — is passed directly to new_from_par_iter via par_bridge(). No .idx is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces mphf.bin.
+
Pass 2 (sequential): iterate with iter_indexed_canonical_kmers; fill evidence files; call fill_slot(slot, kmer) callback per kmer. For Exact/Hybrid, .idx is written at the end of this pass — never earlier.
mphf1.bin and counts1.bin are no longer needed after phase 2 and can be deleted.
@@ -1265,13 +1393,11 @@
FMPH/FMPHGO (ph crate, Beling, ACM JEA 2023):
~2.1 bits/key — most compact; good query speed; deterministic construction
-
Works well from an exact or slightly overestimated count
-
GOFunction (group-oriented variant) is the specific type used
+
GOFunction (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well
MPHF choice per phase
-
Phase 1 (provisional, discarded after spectrum computation): ph::fmph::GOFunction. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of count_kmer.
-
Phase 2 (persistent, queried repeatedly): ptr_hash. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
-
boomphf is eliminated: largest space overhead, streaming advantage does not apply.
+
Both phases: ptr_hash, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the ph crate dependency.
+
boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.
Space at scale
For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
@@ -1320,9 +1446,12 @@
Layer structure
Each layer is a self-contained unit. See obilayeredmap for the full on-disk layout. The MPHF-relevant files are:
layer_i/
- unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
+ unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
+ unitigs.bin.idx — random-access block index (block_bits controls granularity)
mphf.bin — ptr_hash phase-2 MPHF
- evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
+ evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
+ fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
+ [no layer_meta.json — mode stored once in partition-level meta.json]
Layers are disjoint: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
@@ -1330,17 +1459,43 @@
Collect kmers of B not present in any layer → set B \ A.
Build layer 1 from B \ A (dereplicate → count → De Bruijn → unitigs → MphfLayer::build).
+
Evidence modes
+
Three evidence modes are supported via IndexMode, stored once in PartitionMeta at partition root. There is no layer_meta.json.
+
Exact (IndexMode::Exact): evidence.bin stores one (chunk_id, rank) pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. .idx required at query time.
+
Approx (IndexMode::Approx { b, z }): fingerprint.bin stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No .idx written or needed.
+
Hybrid (IndexMode::Hybrid { b, z }): both fingerprint.bin and evidence.bin + .idx. find() uses the fingerprint (O(1)); find_strict() uses exact evidence (O(1)).
+
Build functions
+
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
+ Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
+ Pass 2: sequential iter → fill evidence files + call fill_slot
+ .idx written last for Exact/Hybrid (query-time only)
+
+MphfLayer::build_exact_evidence(dir, block_bits)
+ Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
+ Uses open_sequential(); no .idx required on entry
+
+MphfLayer::build_approx_evidence(dir, b, z)
+ Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
+ Uses open_sequential(); never writes .idx
+
+
There is no build_evidence dispatch wrapper. Callers choose the appropriate post-hoc build directly.
+
In obikpartitionner, build_index_layer receives block_bits: u8 from IndexConfig::block_bits and forwards it directly to Layer::build and Layer::build_approx_evidence.
Membership verification
-
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from (chunk_id, rank) and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
+
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:
+
+
Exact: decode (chunk_id, rank) from evidence.bin; reconstruct the kmer via unitigs.verify_canonical_kmer; compare to query.
+
Approx: compare kmer.seq_hash() to the b-bit fingerprint stored at the slot.
+
+
A mismatch in either mode means the kmer is absent from this layer; probe the next layer.
Query algorithm
fn query(kmer) → Option<(layer_index, slot)>:
for (i, layer) in layers.iter().enumerate():
slot = layer.mphf.index(kmer)
- if layer.evidence.decode(slot) matches kmer:
+ if layer.evidence.matches(slot, kmer): // exact or approx dispatch
return Some((i, slot))
return None
-
Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
+
MphfLayer::find dispatches on LayerEvidence at O(1) — no panicking find_exact/find_approx methods. find_strict always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
Merging layers
Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
FORT RISQUE DE DÉRIVE. C'est le fichier le plus affecté par les changements récents :
+- EvidenceKind (Exact / Approx) est désormais un concept de premier plan — toute la sémantique de query en dépend
+- LayerEvidence enum interne à MphfLayer : dispatch transparent find() → find_exact() ou find_approx()
+- fingerprint.rs : module entièrement nouveau (FingerprintVec + FingerprintVecWriter)
+- build_evidence() / build_exact_evidence() / build_approx_evidence() sont nouveaux
+- block_bits dans les fonctions build : O(1) garanti avec le chemin chaud explicit pour block_bits=0
+- Séparation open() (accès aléatoire, requiert .idx) vs open_sequential() (itération seule)
+Pratiquement toute cette page est à réécrire.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/obilayeredmap/index.html b/doc/implementation/obilayeredmap/index.html
index 95fe20e..06fe65e 100644
--- a/doc/implementation/obilayeredmap/index.html
+++ b/doc/implementation/obilayeredmap/index.html
@@ -9,7 +9,7 @@
-
+
@@ -647,6 +647,34 @@
+
+
+
obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
+
obilayeredmap implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
Three usage modes
The MPHF + evidence infrastructure is the same for all modes. The payload varies.
@@ -1214,34 +1588,65 @@
Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.
+
Index mode (homogeneity invariant)
+
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at LayeredMap::open() from PartitionMeta.mode and passed to each Layer::open() — no per-layer file is read.
IndexMode is stored once in PartitionMeta (meta.json at partition root). There is no layer_meta.json.
+
+
Exact: writes evidence.bin + unitigs.bin.idx. Zero false positives.
+
Approx: writes fingerprint.bin only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No .idx written or required.
+
Hybrid: writes both fingerprint.bin and evidence.bin + .idx. find() uses the fingerprint (fast, O(1)); find_strict() uses exact evidence.
+
+
MphfLayer — autonomous kmer → slot mapping
-
MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
+
MphfLayer encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
pubstructMphfLayer{
-mphf:Mphf,
-evidence:Evidence,
-unitigs:UnitigFileReader,
-n:usize,// number of indexed kmers = number of MPHF slots
+mphf:Mphf,
+ev:LayerEvidence,// loaded at open() time
+n:usize,}
find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
-
build runs two sequential passes over unitigs.bin:
+
MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read.
+
Query API
+
Two public query methods, both returning Option<usize> (slot index):
MphfLayer::build runs two passes over unitigs.bin:
-
Pass 1: iterate all canonical kmers in parallel via rayon, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
-
Pass 2: iterate again sequentially, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.
+
Pass 1 (parallel via rayon): a CanonicalKmerIter (clonable, Arc<Mmap>, no file reopening) is passed to new_from_par_iter via par_bridge(). Produces mphf.bin. No .idx is read or created at this stage.
+
Pass 2 (sequential): fill evidence files; call fill_slot(slot, kmer) per kmer. .idx is written last for Exact/Hybrid modes (query-time only).
-
For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.
+
There is no build_evidence dispatch wrapper — callers invoke build_exact_evidence or build_approx_evidence directly.
+
For empty layers (n = 0), all build variants return Ok(0) immediately after creating empty output files.
Layer\<D: LayerData> — MPHF + payload
Layer<D> pairs an MphfLayer with one payload store.
@@ -1261,7 +1666,7 @@
pubdata:T,}
-
LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.
+
LayerData covers the read path only (open + read). Build signatures differ between modes and are not part of the trait.
All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.
+
All build impls delegate to MphfLayer::build via a mode-specific fill_slot callback. The mode parameter is forwarded directly — no LayerMeta is written.
+
Evidence-only post-hoc builds are accessible directly on Layer<D>:
open reads PartitionMeta once, extracts mode, and passes it to every Layer::open — no per-layer file is read. create stores the given mode in PartitionMeta.
+
query probes layers in order and returns (layer_index, Hit) on the first match. Expected probe depth: 1 for kmers in layer 0.
+
push_layer
+
push_layer builds the next layer from a unitigs.bin already written via next_layer_writer, using DEFAULT_BLOCK_BITS:
Mode 3 (PersistentBitMatrix) has no push_layer on LayeredMap; callers build directly via Layer<PersistentBitMatrix>::build_presence.
LayeredStore\<S> and aggregation traits
LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:
Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.
-
Aggregation hierarchy:
-
PersistentCompactIntMatrix implements CountPartials
-LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
-LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
-
Leaf implementors (in obicompactvec):
@@ -1344,69 +1805,77 @@ LayeredStore<LayeredStore<…>> via blanket impl
-
PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.
See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.
Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
-
Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
+
There is no layer_meta.json. The mode is stored once in PartitionMeta and is valid for all layers. unitigs.bin.idx is built at the end of build_exact_evidence — never during MPHF construction — and is consumed at query time only.
-
Evidence encoding
+
Evidence encoding (exact)
evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
-
Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.
-
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
+
chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id (requires unitigs.bin.idx for random access).
+
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
-
CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.
+
CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5): 2× slower construction than Linear/λ=3.0, ~20% less space.
MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.
-
In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
+
Both delegate to the corresponding PersistentBitMatrix::append_column / PersistentCompactIntMatrix::append_column. They write a new column file (col_NNNNNN.pbiv / col_NNNNNN.pciv) and update meta.json to increment n_cols. value_of is called once per slot (0..n).
Called on the first merge of a Presence-mode index. Creates presence/ with meta.json {"n": n_kmers, "n_cols": 1} and col_000000.pbiv set entirely to true. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
+
Why the MPHF is never rebuilt
+
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new .pciv/.pbiv file and a single meta.json update.
Add-layer algorithm
When adding dataset B to an existing index:
For each partition, probe existing layers for kmers of B routed to that partition.
Collect kmers absent from all layers → B \ index.
-
Write B \ index to a new unitigs.bin via MphfLayer::unitig_writer.
-
Call Layer<D>::build on the new directory.
-
Update meta.json.
+
Write B \ index to a new unitigs.bin via next_layer_writer().
+
Call Layer<D>::build (or build_presence) on the new layer directory.
+
Call push_layer (or append_layer) to register the new layer in meta.json.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
@@ -1433,11 +1902,15 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
memmap2 0.9
-
mmap of evidence and payload files
+
mmap of evidence and fingerprint files
+
+
+
bitvec
+
packed b-bit fingerprint storage
obiskio
-
unitig file writer/reader
+
unitig file writer/reader + .idx build
obicompactvec
@@ -1448,8 +1921,8 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
obipipeline/src/scheduler.rs — Scheduler avec Select biaisé sur les entrées de canaux
+
+
Notes
+
Document stable (librairie générique, peu de risque de dérive).
+Vérifier si obipipeline est toujours utilisé dans la phase scatter de obikpartitionner
+ou s'il a été remplacé par Rayon dans certains chemins.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/obipipeline/index.html b/doc/implementation/obipipeline/index.html
index 5018f41..8693d77 100644
--- a/doc/implementation/obipipeline/index.html
+++ b/doc/implementation/obipipeline/index.html
@@ -662,6 +662,17 @@
+
+
+
obipipeline is a generic, multi-threaded data pipeline crate. It connects a source, a chain of transforms, and a sink via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.
+
obipipeline is a generic, multi-threaded data pipeline crate. It connects a source, a chain of stages, and a sink via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.
Bound on every crossbeam channel in the pipeline (source output, inter-stage channels, worker input, sink input, sink error). Controls memory and back-pressure: a full channel blocks the sender until a slot frees.
+
Bound on every crossbeam channel in the pipeline. Controls memory and back-pressure: a full channel blocks the sender until a slot frees.
@@ -1208,7 +1325,7 @@
Each variant carries the concrete type for one stage's output. The macros pattern-match on this enum to route values between stages.
Macros
-
Six low-level macros build individual stages; one high-level macro (make_pipeline!) composes them.
make_source!(Enum,iterator,OutputVariant)// iterator yields Tmake_source_fallible!(Enum,iterator,OutputVariant)// iterator yields Result<T, E>
@@ -1216,6 +1333,9 @@
make_transform!(Enum,func,InputVariant,OutputVariant)// func: T -> Umake_transform_fallible!(Enum,func,InputVariant,OutputVariant)// func: T -> Result<U, E>
+make_flat_transform!(Enum,func,InputVariant,OutputVariant)// func: T -> impl IntoIterator<Item=U>
+make_flat_transform_fallible!(Enum,func,InputVariant,OutputVariant)// func: T -> Result<impl IntoIterator<Item=U>, E>
+
make_sink!(Enum,func,InputVariant)// func: T -> ()make_sink_fallible!(Enum,func,InputVariant)// func: T -> Result<(), E>
@@ -1224,17 +1344,31 @@
make_pipeline! {
DataEnum,
source iterator => OutputVariant, // or source? for fallible
- | func: In => Out, // non-fallible transform
- |? func: In => Out, // fallible transform
+ | func: In => Out, // 1→1 non-fallible transform
+ |? func: In => Out, // 1→1 fallible transform
+ || func: In => Out, // 1→N non-fallible flat transform
+ ||? func: In => Out, // 1→N fallible flat transform
sink func @ InputVariant, // or sink? for fallible
}
? marks fallibility on source, individual transforms, or sink independently.
Implemented as a TT muncher: the internal rule @build recurses over transform tokens one at a time, accumulating them into a vec![], then terminates on sink/sink?.
+
make_pipe! DSL
+
make_pipe! builds a sourceless/sinkless Pipe<D, In, Out> — a reusable, composable stage sequence:
Two pipes compose with .then(other). Apply to an iterator with .apply(iter, n_workers, capacity) to get a PipeIter<Out> — an iterator over the pipeline output, backed by a background WorkerPool. The scatter step in obikmer uses make_pipe! and .apply() rather than the full make_pipeline! / WorkerPool pattern.
The scheduler is a single thread running a biased Select over all input channels. Priority order (highest first):
index 0 sink_err_rx abort on sink error
-index 1 stage_rxs[N-1] drain last stage first
-...
-index N stage_rxs[0]
-index N+1 source_rx pull new data last
+index 1 flat_delta_rx adjust in_flight before dispatching
+index 2..=n+1 stage_rxs[n-1..0] drain last stage first
+index n+2 source_rx pull new data last
This back-pressure-friendly ordering ensures downstream stages are drained before new items enter the pipeline.
-
Workers are generic: each receives (data, SharedFn, result_tx) and calls f(data), sending the result to the provided channel. The scheduler decides which transform to apply and where to route the result.
-
Termination uses an in_flight counter:
+
Workers are generic: each receives a WorkerTask — either Transform(data, stage_idx) or Flat(data, stage_idx). For Transform, the worker calls f(data) and sends the result to stage_txs[stage_idx]. For Flat, the worker calls f(data, &push_tx, &delta_tx): the closure pushes N items into push_tx then sends N-1 to delta_tx. The scheduler uses the delta to adjust in_flight without knowing N in advance.
+
Termination uses an in_flight: isize counter and a flat_workers_active: usize counter:
-
incremented when an item is dispatched from source to workers
-
decremented when the item exits the last stage
-
the loop exits only when source_done && in_flight == 0
+
in_flight incremented when an item is dispatched from source to workers
+
in_flight decremented when the item exits the last stage to the sink
+
flat_workers_active incremented when a Flat task is dispatched, decremented when the delta arrives
+
the loop exits only when source_done && in_flight == 0 && flat_workers_active == 0
-
This guarantees all in-flight items complete before join().
+
This guarantees all in-flight items complete (including all N outputs of a flat stage) before join().
Error handling
PipelineError has four variants:
@@ -1279,7 +1413,7 @@ index N+1 source_rx pull new data last
Internal routing error
-
StepError(Box<dyn Error>)
+
StepError(Box<dyn Error + Send + Sync>)
Error from user code (wrapped by make_*_fallible!)
Document d'implémentation stable. Vérifier que PersistentBitMatrixBuilder et append_column
+sont couverts (utilisés dans Layer::<PersistentBitMatrix>::build_presence et append_genome_column).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/persistent_bit_vec/index.html b/doc/implementation/persistent_bit_vec/index.html
index e1ae9f5..c7105ac 100644
--- a/doc/implementation/persistent_bit_vec/index.html
+++ b/doc/implementation/persistent_bit_vec/index.html
@@ -12,7 +12,7 @@
-
+
@@ -649,6 +649,34 @@
+
obicompactvec/src/format.rs — format de fichier (magic PCIV, header, primary u8, overflow, index)
+
+
Notes
+
Document d'implémentation stable. Vérifier que append_column (utilisé dans merge et reindex)
+est décrit. Vérifier que PersistentCompactIntMatrixBuilder est couvert (utilisé dans layer.rs).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/persistent_compact_int_vec/index.html b/doc/implementation/persistent_compact_int_vec/index.html
index f37e659..3f9ea66 100644
--- a/doc/implementation/persistent_compact_int_vec/index.html
+++ b/doc/implementation/persistent_compact_int_vec/index.html
@@ -649,6 +649,34 @@
+
RISQUE DE DÉRIVE modéré. Vérifier :
+- Phase 6 : la doc mentionne-t-elle le filtre d'abondance (min_ab, max_ab) ?
+- Phase 6 : block_bits passé à build_index_layer depuis IndexConfig
+- Phase 6 : dispatch exact/approx selon EvidenceKind dans build_index_layer
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/pipeline/index.html b/doc/implementation/pipeline/index.html
index a38914c..91d32f0 100644
--- a/doc/implementation/pipeline/index.html
+++ b/doc/implementation/pipeline/index.html
@@ -773,6 +773,34 @@
+
error valley → suggests min_count (typically the local minimum between the error peak and the coverage peak)
Phase 1 — Scatter
-
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:
+
Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.
+
Input files are read via open_nuc_stream, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of NucPage buffers. Each NucPage is a flat 64 KB buffer of normalised bytes (ACGT + \x00 separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.
+
For each read fragment within a page:
Ambiguous base filter: cut at any non-ACGT base; discard fragments shorter than k.
Entropy filter: scan each fragment with a sliding window of size k. When the kmer \(K_i = S[i \mathinner{..} i+k-1]\) ended by nucleotide \(S[j]\) (with \(j = i+k-1\)) has entropy below threshold \(\theta\), emit the current segment and start a new one (see algorithm below). \(K_i\) belongs to neither segment, and no valid kmer is lost.
@@ -1154,8 +1240,13 @@ B ≈ 100 is tunable; RAM needed ≈ partition_size / B.
for each kmer in sequence:
kmer_counts[canonical(kmer)] += COUNT
-
Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.
-
Abundance filter applied here: kmers with total_count < q are discarded. q is a collection parameter (0 = keep all, including singletons for ≤1x data).
+
Implemented as a three-step pipeline in count_partition():
+
+
External sort (kmer_sort::sort_unique_kmers): read dereplicated superkmers, extract canonical kmer raw u64 values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → sorted_unique.bin. f0 is now known exactly.
+
Provisional MPHF (ptr_hash): built from sorted_unique.bin via new_from_par_iter(f0, ...). Stored to mphf1.bin; sorted_unique.bin deleted immediately.
+
Accumulation pass: re-read dereplicated superkmers; for each kmer, slot = mphf.index(kmer.raw()), increment counts1[slot] by the superkmer COUNT. Stored in a PersistentCompactIntVec (counts1.bin).
+
+
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (spectrums/{label}.json) is written to the index root.
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
Phase 4 — Super-kmer compaction
The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:
@@ -1188,14 +1279,52 @@ branching / dead-end → unitig start or end
Output: unitigs.bin — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.
Scope of local unitigs: these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.
Phase 6 — MPHF construction and index finalisation
-
Built once on the definitive kmer set (all kmers in all unitigs of the partition). See obilayeredmap and MPHF selection for the current implementation.
-
kmers from unitigs → MPHF → mphf.bin
- → evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
- → payload : counts/ (mode 2) or presence/ (mode 3)
+
build_index_layer is called per partition (in parallel via build_layers) with the following parameters sourced from IndexConfig:
+
+
block_bits — from IndexConfig::block_bits; controls the .idx block size (2^block_bits unitig chunks per block) for exact evidence
+
evidence — EvidenceKind::Exact or EvidenceKind::Approx { b, z }; propagated unchanged from IndexConfig::evidence
+
min_ab / max_ab — abundance bounds applied before graph construction
+
with_counts — whether to store kmer counts alongside set membership
+
+
Abundance filtering: when min_ab > 1 or max_ab.is_some(), the provisional mphf1.bin and counts1.bin produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in counts1 satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).
+
for each kmer in dereplicated super-kmer:
+ ab = counts1[mphf1.index(kmer.raw())]
+ if ab < min_ab || ab > max_ab: skip
+ graph.push(kmer)
-
The MPHF is built in two passes over unitigs.bin: parallel pass for mphf.bin, sequential pass for evidence.bin and payload. The exact kmer count is available from the unitig index (unitigs.bin.idx) before the passes begin.
-
Exact verification via unitig evidence:
-
unitigs.bin serves as the evidence structure. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back (via evidence.bin) is the only correct membership test.
+
Graph build and unitig write:
+
The surviving kmers are fed into GraphDeBruijn, which computes degrees and yields unitigs. Unitigs are written to layer_0/unitigs.bin via a UnitigFileWriter.
+
MPHF and evidence build:
+
Layer::build (membership-only) or Layer::<PersistentCompactIntMatrix>::build (with counts) is called next. Internally, MphfLayer::build performs two passes:
+
+
Pass 1 (parallel): build unitigs.bin.idx (block size = 2^block_bits) then construct the MPHF from all canonical kmers in unitigs.bin; store to mphf.bin.
+
Pass 2 (sequential): for each kmer in unitigs.bin, compute its slot and write evidence.bin (chunk_id: 25 bits | rank: 7 bits packed into a u32); also invoke the payload callback (fill_slot) to populate counts/ if with_counts.
+
+
After Layer::build completes, layer_meta.json records EvidenceKind::Exact.
+
Approximate evidence override:
+
If evidence is EvidenceKind::Approx { b, z }, build_approx_evidence is called immediately after Layer::build. It overwrites the exact evidence bundle with fingerprint.bin (b-bit hash per slot) and rewrites layer_meta.json with EvidenceKind::Approx { b, z }. No .idx file is needed at query time in this mode.
+
// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
+// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
+// (evidence.bin left on disk but not used)
+
+
Partition metadata:
+
After all layer files are written, PartitionMeta { n_layers: 1 } is serialised to index/meta.json inside the partition directory. This file is required by LayeredMap::open for subsequent merge operations.
Cleanup: unless --keep-intermediate is set, remove_build_artifacts deletes dereplicated.skmer.zst, mphf1.bin, and counts1.bin after all partitions are indexed.
The rebuild, dump, and unitig commands all share the same filtering
+system. Filters can select k-mers based on per-genome quorum counts, optionally
+restricted to ingroup and outgroup genome sets derived from genome
+metadata.
+
rebuild additionally accepts --min-total-count / --max-total-count filters
+that operate on the sum of counts across all genomes.
+
Predicate syntax
+
Each --ingroup and --outgroup flag takes a predicate of the form:
+
key OP value1|value2|…
+
+
+
+
+
Operator
+
Meaning
+
+
+
+
+
* or all
+
wildcard — every genome matches unconditionally
+
+
+
key=v1\|v2
+
exact match — genome's key equals v1 or v2
+
+
+
key!=v
+
negation — genome's key equals none of the values
+
+
+
key~path
+
path ancestry — genome's key is path or a descendant
+
+
+
key!~path
+
not a descendant
+
+
+
+
Multiple values separated by | are always OR-ed within the predicate.
+
Path matching (~ and !~)
+
Metadata values can represent hierarchical taxonomic paths such as
+/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana.
+
+
Absolute pattern (starts with /): the value must start with the pattern
+ at a segment boundary.
+ taxon~/Betulaceae/Betula matches /Betulaceae/Betula/nana and
+ /Betulaceae/Betula but not /Betulaceae/Betuloides/….
+
Bare segment (no leading /): the value must contain the pattern as an
+ exact path component anywhere.
+ taxon~Betula matches any path that has Betula as one of its segments.
+
+
Missing metadata key → NA
+
If a genome does not carry the queried metadata key, the predicate returns NA.
+NA propagates through the group evaluation logic (see below), and genomes that
+cannot be classified are ignored in all quorum counts.
+
Group semantics
+
Multiple predicates
+
+
+
+
Flag
+
Combination rule
+
+
+
+
+
--ingroup (repeated)
+
AND — genome must satisfy all predicates
+
+
+
--outgroup (repeated)
+
OR — genome satisfies any predicate
+
+
+
+
Three-value logic
+
Each predicate returns true, false, or NA (absent key).
+
+
AND: false absorbs everything; NA propagates unless already false.
+
OR: true absorbs everything; NA propagates unless already true.
+
+
Classification and priority
+
For each genome:
+
+
Evaluate AND(ingroup predicates) → in_result
+
Evaluate OR(outgroup predicates) → out_result
+
If in_result = true → Ingroup (ingroup wins over outgroup)
+
Else if out_result = true → Outgroup
+
Otherwise → Uncategorized (ignored in all quorum counts)
+
+
Implicit groups
+
+
+
+
--ingroup
+
--outgroup
+
Effective behaviour
+
+
+
+
+
not set
+
not set
+
all genomes form the ingroup
+
+
+
set
+
not set
+
only ingroup quorum flags apply
+
+
+
not set
+
set
+
only outgroup quorum flags apply
+
+
+
set
+
set
+
both constraints apply simultaneously
+
+
+
+
Quorum flags
+
+
+
+
Flag
+
Applies to
+
Meaning
+
+
+
+
+
--min-count N
+
ingroup
+
k-mer present in at least N ingroup genomes
+
+
+
--max-count N
+
ingroup
+
k-mer present in at most N ingroup genomes
+
+
+
--min-frac F
+
ingroup
+
k-mer present in at least fraction F of ingroup genomes
+
+
+
--max-frac F
+
ingroup
+
k-mer present in at most fraction F of ingroup genomes
+
+
+
--min-outgroup-count N
+
outgroup
+
k-mer present in at least N outgroup genomes
+
+
+
--max-outgroup-count N
+
outgroup
+
k-mer present in at most N outgroup genomes
+
+
+
--min-outgroup-frac F
+
outgroup
+
k-mer present in at least fraction F of outgroup genomes
+
+
+
--max-outgroup-frac F
+
outgroup
+
k-mer present in at most fraction F of outgroup genomes
+
+
+
--min-total-count N
+
all genomes
+
sum of per-genome counts ≥ N (rebuild only)
+
+
+
--max-total-count N
+
all genomes
+
sum of per-genome counts ≤ N (rebuild only)
+
+
+
--presence-threshold N
+
all
+
per-genome count > N to be considered "present" (default 0)
+
+
+
+
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
+(no upper bound). A filter with all defaults is a no-op.
+
Fractions are computed over the size of the classified group, not over total
+genome count. An empty group (no genome classified as ingroup/outgroup) never
+triggers a filter failure.
+
Examples
+
Keep k-mers specific to Betula nana — present in at least 2 B. nana genomes
+and absent from every other genome in the index:
obikpartitionner::filter::GroupQuorumFilter — implements KmerFilter
+ using pre-computed ingroup and outgroup index vectors. The heavy logic
+ (predicate parsing, three-value evaluation, genome classification) happens
+ once before any iteration; each k-mer row evaluation is a simple index
+ lookup and counter.
+
+
+
obikmer::cmd::predicate::FilterArgs — shared clap argument group
+ embedded via #[command(flatten)] in RebuildArgs, DumpArgs, and
+ UnitigArgs. FilterArgs::build_filters() returns a ready-to-use filter
+ list.
+
+
+
obikpartitionner::KmerPartition::iter_partition_kmers — accepts
+ filters: &[Box<dyn KmerFilter>] and applies them per-kmer before invoking
+ the callback. rebuild, dump, and unitig all go through this single
+ entry point.
FORT RISQUE DE DÉRIVE. Nombreux champs ajoutés :
+- IndexConfig : champs evidence (EvidenceKind) et block_bits ajoutés
+- Nouveau fichier fingerprint.bin pour l'évidence approximative
+- LayerMeta / layer_meta.json introduit pour stocker EvidenceKind par layer
+- Structure du répertoire layer : evidence.bin vs fingerprint.bin selon le mode
+Mettre à jour le schéma de layout sur disque en conséquence.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/storage/index.html b/doc/implementation/storage/index.html
index 543a74d..d284fd8 100644
--- a/doc/implementation/storage/index.html
+++ b/doc/implementation/storage/index.html
@@ -64,7 +64,7 @@
The index root contains one part_XXXXX/ directory per partition, each holding one or more layer_N/ directories. Each layer directory contains mphf.bin, unitigs.bin, unitigs.bin.idx, evidence.bin, and optionally a counts/ or presence/ payload directory.
The sentinels are touched atomically at the end of each pipeline stage.
+A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
+detected as the lowest sentinel present.
+
+
+
+
State
+
Sentinel present
+
Meaning
+
+
+
+
+
Empty
+
—
+
index.meta exists; scatter not started or interrupted
n_bits determines the partition count: 2^n_bits directories under partitions/.
+
evidence is either the string "Exact" or {"Approx": {"b": 8, "z": 1}}.
+
block_bits controls the .idx granularity: one offset entry every 2^block_bits
+chunks. block_bits=0 stores one entry per chunk (O(1) random access, largest .idx).
+
GenomeInfo.meta is a free-form string→string map for categorical metadata (e.g.
+taxonomy, sample origin). It is optional; defaults to empty.
+
Layer files
+
unitigs.bin
+
2-bit packed binary unitig sequences. Each record: 1 byte seql_minus_k
+(nucleotide length − k), followed by ceil((seql_minus_k + k) / 4) bytes of
+packed sequence. Long unitigs are transparently split into overlapping chunks
+(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.
+
unitigs.bin.idx (Exact only)
+
Magic UIX3, little-endian header: block_bits (u32), n_unitigs (u32),
+n_kmers (u64), then ceil(n_unitigs / 2^block_bits) + 1 byte-offset entries
+(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
+
mphf.bin
+
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
+2-bit) to a slot index in [0, n_kmers).
+
layer_meta.json (LayerMeta)
+
{"evidence":{"type":"exact"}}
+
+or
+
{"evidence":{"type":"approx","b":8,"z":1}}
+
+
evidence.bin (Exact)
+
One (chunk_id: u32, rank: u8) record per MPHF slot, packed. Used to verify
+that the kmer mapped to a slot is actually present: unitigs.bin[chunk_id][rank]
+is re-read and compared against the query.
+
fingerprint.bin (Approx)
+
b-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
+False-positive rate per query ≈ 1/2^b. With Findere parameter z ≥ 2,
+z consecutive k-mers must all match, reducing the effective FP rate to
+approximately W / 2^(b·z) per read of length L
+(where W = L − k − z + 2).
+
counts/ (PersistentCompactIntMatrix)
+
Present when with_counts=true. One column per genome; each row holds the
+per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
+during indexing and merge.
+
presence/ (PersistentBitMatrix)
+
Present when the layer was built in presence/absence mode (merge path).
+One bit per genome per MPHF slot. Written during merge; never present on a
+freshly indexed single-genome layer.
+
meta.json (PartitionMeta)
+
{"n_layers":2}
+
+
Records how many layer_N/ directories exist under index/. Incremented by
+each merge that adds a layer.
obiskbuilder/src/lib.rs — fenêtre glissante monotone pour le maintien du minimiseur
+
+
Notes
+
Document d'implémentation détaillé. Vérifier que le layout header (longueur, orientation,
+position minimiseur) n'a pas changé. La doc mentionne un revcomp SIMD — vérifier si c'est
+toujours le cas ou si l'implémentation est scalaire.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/implementation/superkmer/index.html b/doc/implementation/superkmer/index.html
index 2537ee6..d7ee7a0 100644
--- a/doc/implementation/superkmer/index.html
+++ b/doc/implementation/superkmer/index.html
@@ -751,6 +751,34 @@
+
PackedSeq stores a 2-bit packed DNA sequence as a heap-allocated Box<[u8]> plus a tail: u8 field:
Field
-
Bits
+
Type
Role
-
COUNT
-
24
-
Occurrence count (≤ 16 M)
+
tail
+
u8
+
Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity
-
NKMERS
-
8
-
Number of kmers (= seq_length − k + 1, range 1–255)
+
seq
+
Box<[u8]>
+
2-bit packed bytes, nucleotide 0 at bits 7–6 of seq[0]
-
Bit layout (MSB to LSB): [31:8] COUNT [7:0] NKMERS
-
NKMERS is stored as a raw u8 in kmer units, not nucleotides. The nucleotide length is recovered as NKMERS + k − 1. This avoids the awkward wrapping convention (0 = 256) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:
Nucleotide length is recovered without storing it explicitly:
+
seql = (seq.len() - 1) * 4 + tail_count(tail)
+
+
There is no packed header word — count and the sequence live in separate fields.
+
The on-disk binary format (produced by write_to_binary) is:
+
[varint(count)] [u8: seql − k] [packed bytes…]
+
+
seql − k fits in a u8 when n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256). If a super-kmer exceeds 256 kmers, write_to_binary splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by read_from_binary.
+
The public accessors operate on the struct fields directly:
In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).
-
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
ASCII encoding and decoding
Two lookup tables handle ASCII ↔ 2-bit conversion:
@@ -1125,7 +1197,7 @@
REVCOMP4 is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
Step 2 — realignment. After step 1, padding = n × 8 − seql × 2 spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using BitSlice<u8, Msb0>::rotate_left(padding) from the bitvec crate, which is SIMD-accelerated. The trailing padding bits are then zeroed:
-
letseql=self.n_kmers()+k-1;
+
letseql=self.seql();shift=n*8-seql*2// number of padding bitsbits.rotate_left(shift)bits[len-shift..].fill(false)
@@ -1143,7 +1215,7 @@
Minimizer sliding window
-
Super-kmers are built by SuperKmerIter (crate obiskbuilder), which maintains the current minimizer with a monotonic deque over a sliding window of W = k − m + 1 m-mer positions.
+
Super-kmers are built by SuperKmerIter (crate obiskbuilder), which tracks the current minimizer with a monotonic deque (Ring<MmerItem, 32>) inside RollingStat, a rolling-window entropy and minimizer tracker.
Each deque entry stores:
@@ -1167,20 +1239,11 @@
hash
u64
-
\(H(\text{canonical})\) — ordering key for random minimizer selection
+
hash_kmer(canonical << (64 − 2m)) — ordering key for random minimizer selection
-
The hash \(H\) is the seeded splitmix64 finalizer (see Minimizer selection):
-
fnhash_mmer(canonical:u64)->u64{
-letx=canonical^0x9e3779b97f4a7c15;// seed: eliminates fixed point at 0
-letx=x^(x>>30);
-letx=x.wrapping_mul(0xbf58476d1ce4e5b9);
-letx=x^(x>>27);
-letx=x.wrapping_mul(0x94d049bb133111eb);
-x^(x>>31)
-}
-
+
The hash uses the seeded splitmix64 finalizer (mix64(raw ^ 0x9e3779b97f4a7c15)), the same function as kmer::hash_kmer.
On each new nucleotide, once the window is full, the deque is updated:
Algorithm — minimizer deque update
@@ -1196,17 +1259,21 @@
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
-
A super-kmer boundary is emitted when the minimizer changes: deque.front.hash ≠ prev_hash. The canonical field of the front entry is not used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key \(H(\text{canonical})\) can be recomputed independently at routing time from the stored minimizer_pos, without inheriting the minimum-order-statistic bias (see Minimizer selection — partition key independence).
+
A super-kmer boundary is emitted when the minimizer changes: current_minimizer != prev_minimizer. SuperKmerIter also emits a boundary when:
+
+
entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1)
+
super-kmer length reaches 256 nucleotides (cursor retreated by k)
+
Kmer extraction
-
A k-mer is extracted from a super-kmer with SuperKmer::kmer(i, k), which returns a Kmer — a left-aligned u64 newtype (see Kmer implementation):
A k-mer is extracted from a super-kmer with SuperKmer::kmer(i), which delegates to PackedSeq::extract::<KLen>(i) and returns a Kmer — a left-aligned u64 newtype (see Kmer implementation):
+
pubfnkmer(&self,i:usize)->Result<Kmer,KmerError>
-
The bit slice seq[i*2 .. (i+k)*2] (Msb0 order) is loaded as a big-endian u64 via bitvec::load_be, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
+
The bit slice seq[i*2 .. (i+k)*2] (Msb0 order) is loaded as a u64 via bitvec::load_be, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
obidebruinj/src/debruijn.rs — extraction unitigs, chunking à MAX_KMERS_PER_CHUNK
+
+
Notes
+
FORT RISQUE DE DÉRIVE. Changements récents :
+- DEFAULT_BLOCK_BITS est passé de 6 à 0 (accès O(1) par défaut)
+- block_bits est maintenant un paramètre runtime de build_unitig_idx() et UnitigFileWriter
+- chunk_start() a un chemin chaud explicite pour block_bits=0 (accès tableau direct, 0 scan)
+- open() vs open_sequential() : distinction nouvelle, importante pour la compréhension du coût
+- iter_unitigs() ajouté comme alias public de iter_chunks_sequential()
+Mettre à jour la description du format .idx et le modèle de coût d'accès aléatoire.
k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
+
Canonical form: min(kmer, revcomp(kmer)) reduces strand-symmetric space by half
+
Input formats for index/superkmer: FASTA (.fa, .fasta), FASTQ (.fq, .fastq), GenBank flat file (.gb, .gbk, .gbff), all optionally gzip-compressed; directories expanded recursively; streaming stdin via -
+
Input formats for query: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via -
+
Parameter constraints (enforced at CLI)
+
All constraints below are checked by CommonArgs::validate() at the start of superkmer and index. Invalid values exit immediately with an error.
+
+
+
+
Parameter
+
Constraint
+
Reason
+
+
+
+
+
k (--kmer-size)
+
odd
+
even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant
+
+
+
k (--kmer-size)
+
k ∈ [11, 31]
+
k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity
+
+
+
m (--minimizer-size)
+
odd
+
same palindrome argument as k
+
+
+
m (--minimizer-size)
+
3 ≤ m ≤ k−1
+
minimizer must be strictly shorter than the kmer
+
+
+
z (-z, Findere, index --approx only)
+
z ≤ k−1
+
effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0
+
+
+
+
Genome label constraints
+
Genome labels are arbitrary Unicode strings with the following restrictions:
+
+
+
+
Character
+
Forbidden
+
Reason
+
+
+
+
+
/
+
yes
+
filesystem path separator
+
+
+
=
+
yes
+
--new-label parser separator
+
+
+
\0
+
yes
+
null byte
+
+
+
\n\r\t
+
yes
+
break CSV output
+
+
+
spaces
+
allowed
+
use shell quoting: --new-label 'new label=old label'
+
+
+
+
Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.
obikmer/src/main.rs — point d'entrée CLI, contraintes globales
+
obikmer/src/cli.rs — structure des arguments communs
+
+
Notes
+
Document de niveau projet (vue d'ensemble, motivations, contraintes fondamentales).
+Pas de code Rust spécifique à vérifier au-delà des contraintes générales (k impair, formats d'entrée).
+À mettre à jour si de nouvelles sous-commandes ou formats sont ajoutés.
obikseq/src/kmer.rs — type Kmer, propriétés, forme canonique
+
obikseq/src/superkmer.rs — type SuperKmer, longueur attendue
+
obiskbuilder/src/lib.rs — extraction de superkmers par minimiseur
+
+
Notes
+
Chevauche theory/encoding.md (encodage 2 bits) et theory/minimizer.md (choix du minimiseur).
+Vérifier que la définition de SuperKmer est cohérente avec les invariants actuels de obikseq.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/kmers/index.html b/doc/kmers/index.html
index 7b4795c..f315726 100644
--- a/doc/kmers/index.html
+++ b/doc/kmers/index.html
@@ -746,6 +746,34 @@
+
A kmer is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
-
k ∈ [11, 31]: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
+
k ∈ [11, 31]: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k < 11 yields insufficient specificity).
k is odd: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form min(kmer, revcomp(kmer)) is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
+
Both constraints are enforced at CLI entry by CommonArgs::validate() in superkmer and index. Passing an invalid k exits immediately with an error message.
Super-kmers
-
A super-kmer is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides. Each kmer of the run carries the same canonical minimizer. The canonical minimizer of a kmer is the smallest value of min(m-mer, revcomp(m-mer)) over all m-mers within the kmer (m < k, m odd), with the constraint that non-degenerate m-mers are always preferred over degenerate ones. A degenerate m-mer is one composed of a single repeated nucleotide (all-A, all-C, all-G, or all-T); such m-mers are selected only if no non-degenerate candidate exists in the window.
+
A super-kmer is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides, sharing the same canonical minimizer. The canonical minimizer of a kmer is the m-mer (m < k) whose canonical hash hash_kmer(min(m-mer, revcomp(m-mer))) is smallest over all m-mers in the kmer window. The hash function is a mix64-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.
Canonical super-kmers
A canonical super-kmer is the lexicographic minimum of a super-kmer and its reverse complement:
obikseq/src/kmer.rs — encodage 2 bits/base, revcomp, forme canonique
+
+
Notes
+
Document purement théorique. Peu de risque de dérive sauf si l'encodage interne de Kmer change.
+Vérifier que la table d'encodage A=00, C=01, G=10, T=11 est toujours celle du code.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/theory/encoding/index.html b/doc/theory/encoding/index.html
index 6adf132..d44aeb9 100644
--- a/doc/theory/encoding/index.html
+++ b/doc/theory/encoding/index.html
@@ -718,6 +718,34 @@
+
The Watson-Crick complement of any base is its bitwise NOT on 2 bits: complement(base) = ~base & 0b11.
Kmer encoding
A kmer fits in a single u64. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): (kmer >> (62 - 2*i)) & 0b11.
-
Reverse complement is computed via a 16-bit lookup table (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
+
Reverse complement is computed by bit manipulation in four steps, with no lookup table:
Algorithm — Kmer reverse complement
procedure KmerRevcomp(kmer, k):
- raw ← TABLE16[kmer & 0xFFFF] << 48
- | TABLE16[(kmer >> 16) & 0xFFFF] << 32
- | TABLE16[(kmer >> 32) & 0xFFFF] << 16
- | TABLE16[(kmer >> 48) & 0xFFFF]
- return raw << (64 - 2*k)
+ x ← ~kmer -- complement all bases
+ x ← swap_bytes(x) -- reverse byte order
+ x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
+ | ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte
+ x ← ((x >> 2) & 0x3333333333333333)
+ | ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble
+ return x << (64 - 2*k) -- re-align to MSB
+
The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.
The canonical form is the lexicographic minimum of the kmer and its reverse complement:
obiskbuilder/src/entropy_table.rs — filtre Shannon sur les kmers à basse complexité
+
obiskbuilder/src/lib.rs — application du filtre lors du scatter (phase 1)
+
+
Notes
+
Document théorique stable. Vérifier que les paramètres theta et level_max dans le CLI
+(obikmer/src/cli.rs → CommonArgs) correspondent bien à ce qui est décrit.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/theory/entropy/index.html b/doc/theory/entropy/index.html
index 20e5a37..e1f2229 100644
--- a/doc/theory/entropy/index.html
+++ b/doc/theory/entropy/index.html
@@ -773,6 +773,34 @@
+
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \(\text{entropy}(kmer) \leq \theta\), where \(\theta\) is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
+
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \(\text{entropy}(kmer) < \theta\), where \(\theta\) is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
Interpretation as an effective number of classes
\(H_{\text{corr}}\) is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: \(N_{\text{eff}} = e^{H_{\text{corr}}}\) is the number of equiprobable classes that would yield the same entropy.
For the normalised score \(\hat{H}\), dividing by \(H_{\text{max}}\) changes the logarithm base:
obiskbuilder/src/lib.rs — sélection du minimiseur par hash seedé (splitmix64 finalizer)
+
obikseq/src/superkmer.rs — forme canonique du minimiseur, fenêtre glissante
+
+
Notes
+
Vérifier que la fonction de hash décrite (splitmix64 finalizer avec graine) correspond
+au code actuel. Vérifier aussi que la définition de « minimiseur canonique » est toujours cohérente.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/doc/theory/minimizer/index.html b/doc/theory/minimizer/index.html
index 0c56933..1cbf856 100644
--- a/doc/theory/minimizer/index.html
+++ b/doc/theory/minimizer/index.html
@@ -762,6 +762,34 @@
+
+
+
+
+
diff --git a/docmd/implementation/rebuild_filter.md b/docmd/implementation/rebuild_filter.md
index 1904e94..88bf3b2 100644
--- a/docmd/implementation/rebuild_filter.md
+++ b/docmd/implementation/rebuild_filter.md
@@ -1,9 +1,12 @@
-# Rebuild filters and ingroup/outgroup predicates
+# Kmer filtering and ingroup/outgroup predicates
-The `rebuild` command compacts an existing index into a new single-layer index,
-optionally keeping only k-mers that satisfy a set of filters.
-Filters can operate on raw quorum counts over all genomes, or on pre-defined
-**ingroup** and **outgroup** genome sets derived from genome metadata.
+The `rebuild`, `dump`, and `unitig` commands all share the same filtering
+system. Filters can select k-mers based on per-genome quorum counts, optionally
+restricted to **ingroup** and **outgroup** genome sets derived from genome
+metadata.
+
+`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
+that operate on the sum of counts across all genomes.
## Predicate syntax
@@ -73,8 +76,8 @@ For each genome:
| `--ingroup` | `--outgroup` | Effective behaviour |
|-------------|--------------|---------------------|
| not set | not set | all genomes form the ingroup |
-| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
-| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
+| set | not set | only ingroup quorum flags apply |
+| not set | set | only outgroup quorum flags apply |
| set | set | both constraints apply simultaneously |
## Quorum flags
@@ -89,10 +92,13 @@ For each genome:
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
-| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
-| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
+| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
+| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
+Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
+(no upper bound). A filter with all defaults is a no-op.
+
Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure.
@@ -107,17 +113,18 @@ obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
- --max-count 0
+ --max-outgroup-count 0
```
-Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
+Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
+other *Betula*:
```sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
- --max-count 0
+ --max-outgroup-count 0
```
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
@@ -128,7 +135,7 @@ obikmer rebuild src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
- --max-frac 0.1
+ --max-outgroup-frac 0.1
```
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
@@ -138,7 +145,28 @@ obikmer rebuild src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
- --max-count 0
+ --max-outgroup-count 0
+```
+
+The same flags work identically for `dump` and `unitig`. To dump only k-mers
+specific to *Betula nana*:
+
+```sh
+obikmer dump myindex \
+ --ingroup "species=Betula_nana" \
+ --outgroup "*" \
+ --min-count 1 \
+ --max-outgroup-count 0
+```
+
+To enumerate unitigs of the *Betula*-specific subgraph:
+
+```sh
+obikmer unitig myindex \
+ --ingroup "genus=Betula" \
+ --outgroup "*" \
+ --min-count 2 \
+ --max-outgroup-count 0
```
## Implementation
@@ -146,9 +174,15 @@ obikmer rebuild src --output dst \
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
using pre-computed ingroup and outgroup index vectors. The heavy logic
(predicate parsing, three-value evaluation, genome classification) happens
- once before the rebuild loop; each k-mer row evaluation is a simple index
+ once before any iteration; each k-mer row evaluation is a simple index
lookup and counter.
-- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
- (`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
- which returns a ready-to-use `GroupQuorumFilter`.
+- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
+ embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
+ `UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
+ list.
+
+- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
+ `filters: &[Box]` and applies them per-kmer before invoking
+ the callback. `rebuild`, `dump`, and `unitig` all go through this single
+ entry point.
diff --git a/docmd/index.md b/docmd/index.md
index 262a6c3..5e04a48 100644
--- a/docmd/index.md
+++ b/docmd/index.md
@@ -9,15 +9,16 @@
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one |
-| `rebuild` | Filter and compact an existing index into a new single-layer index |
+| `rebuild` | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata |
| `query` | Query an index with sequences and annotate matches |
-| `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) |
+| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild` |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
-| `unitig` | Dump unitigs from a built index to stdout (debug) |
+| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
-| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) |
+| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
+| `pack` | Pack per-column matrix files into single-file format to reduce query I/O |
## Constraints
diff --git a/mkdocs.yml b/mkdocs.yml
index bee4ebb..1900320 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -49,7 +49,7 @@ nav:
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
- PersistentBitVec: implementation/persistent_bit_vec.md
- Merge command: implementation/merge.md
- - Rebuild filters: implementation/rebuild_filter.md
+ - Kmer filtering (rebuild/dump/unitig): implementation/rebuild_filter.md
- Architecture:
- Sequences: architecture/sequences/invariant.md
- Kmer index: architecture/index_architecture.md