evidence.bin
-├── header : k (u8), n_kmers (u64), n_unitigs (u64)
-├── id_array : n_kmers × ⌈log₂ n_unitigs⌉ bits — MPHF slot → unitig_id
-└── rank_array: n_kmers × 8 bits (u8[n_kmers]) — MPHF slot → rank within unitig
+Evidence file layout (strategy B — implemented)
+evidence.bin is a flat [u32; n] array with no header:
+evidence.bin: n × 4 bytes, little-endian
+ each u32: bits [31:7] = chunk_id (25 bits)
+ bits [6:0] = rank (7 bits)
-id_array is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for B. nana at 256 partitions). rank_array is a plain u8 array — no bit-packing needed. Access is O(1) with a single multiplication and mask for id_array, and a direct byte index for rank_array.
+File size = n × 4 bytes exactly. Decoding a slot: chunk_id = raw >> 7, rank = raw & 0x7F.
+The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
Unitig file layout
-FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel .2bit file for speed.
->c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
-ACGTGGCTA...
-
+Binary packed 2-bit nucleotide file (unitigs.bin) with a companion index (unitigs.bin.idx). The index stores: magic UIDX, n_unitigs: u32, n_kmers: u64, seqls: [u8; n_unitigs] (kmer count − 1 per chunk), and packed_offsets: [u32; n_unitigs + 1] (byte offsets into unitigs.bin, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
Decoding a kmer from slot s
-unitig_id = id_array[s]
-rank = rank_array[s]
-kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
+(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
+kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
-One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).
+Two memory accesses: one into evidence.bin, one into unitigs.bin. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
+Field widths in practice
+Rank is stored in 7 bits (0–127). On B. nana (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
+chunk_id is stored in 25 bits (0–33 554 431). For B. nana at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
Forward vs reverse complement
The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.
@@ -1176,9 +1194,7 @@ kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
Open questions
-- Rank field width: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On B. nana (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
-- Packed nucleotide cache: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead (\(N_{nuc} / 4\) bytes per partition).
-- Cross-partition evidence: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.
+- Cross-partition evidence: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
diff --git a/doc/index.html b/doc/index.html
index 763d9c1..22a67a0 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -978,7 +978,7 @@
obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
Constraints
-- Target scale: metagenomic data, tens of Gbases, billions of kmers
+- Target scale: individual genome datasets, tens of Gbases
- Maximum efficiency in computation, memory, and disk usage
- Input formats: FASTA, FASTQ, gzip, streaming stdin
diff --git a/doc/sitemap.xml.gz b/doc/sitemap.xml.gz
index 3b8a973..7a87a36 100644
Binary files a/doc/sitemap.xml.gz and b/doc/sitemap.xml.gz differ
diff --git a/docmd/implementation/mphf.md b/docmd/implementation/mphf.md
index cc7c244..3c852b0 100644
--- a/docmd/implementation/mphf.md
+++ b/docmd/implementation/mphf.md
@@ -1,57 +1,68 @@
# MPHF selection — two-phase indexing architecture
-## Indexing architecture
+## Why two phases are needed
-Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.
+Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.
-### Superkmer vs kmer counts
+### Phase 1 — provisional MPHF + kmer spectrum
-The `SKFileMeta` sidecar written by `SKFileWriter` records `instances` (unique superkmers) and `length_sum` (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as `length_sum − instances × (k − 1)`. This is an **overestimate** of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.
+Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
-Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.
+1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
+2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
+3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
+4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
+5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
-### Phase 1 — provisional index and spectrum
+Files produced per partition:
-1. Enumerate all kmers from the dereplicated superkmers of the partition.
-2. Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
-3. Accumulate counts: for each kmer in each superkmer, `count[MPHF(kmer)] += sk.count()`.
-4. Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
-5. Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
-6. Discard the provisional MPHF.
+```
+part_XXXXX/
+ mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
+ counts1.bin — [u32; n_kmers] kmer counts, mmap'd
+ kmer_spectrum_raw.json — local frequency spectrum
+```
-### Phase 2 — definitive index
+### Phase 2 — definitive MPHF
-Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).
+After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
+
+`MphfLayer::build` is called on the unitig file:
+
+1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
+2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
+
+`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
---
-## Candidates
+## MPHF candidates
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
-- Parallel construction; well-tested with DNA kmer data at scale
-- Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2
+- Supports streaming construction (no exact count needed)
+- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
-- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
-- Requires exact key count at construction — available at phase 2
-- Drawback: published February 2025 — very young, no production track record
+- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
+- Requires exact key count at construction — available at both phases after pass 1
+- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file
-**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
+**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
-- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
-- More established than ptr_hash; actively maintained
-- Works well with overestimated capacity → natural fit for phase 1
+- ~2.1 bits/key — most compact; good query speed; deterministic construction
+- Works well from an exact or slightly overestimated count
+- `GOFunction` (group-oriented variant) is the specific type used
## MPHF choice per phase
-**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.
+**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
-**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.
+**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
-boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.
+boomphf is eliminated: largest space overhead, streaming advantage does not apply.
---
@@ -63,74 +74,68 @@ For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
|----------|----------|-----------------|
| boomphf | 3.7 | ~47 GB |
| ptr_hash | 2.4 | ~31 GB |
-| FMPHGO | 2.1 | ~27 GB |
+| FMPH | 2.1 | ~27 GB |
For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.
-## On-disk and mmap considerations
+---
-All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
+## ptr_hash configuration (phase 2)
-No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
+```rust
+type Mphf = PtrHash<
+ u64, // key: canonical kmer raw encoding
+ CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
+ CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano)
+ Xx64, // hasher: XXH3-64 with seed
+ Vec, // pilots
+>;
+```
+
+**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.
+
+**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.
+
+**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec`). One cacheline per query; space win dominates at billion-scale key counts.
---
## Multilayer index architecture
-### Motivation
-
-An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.
-
### Layer structure
-Each layer is a self-contained unit:
+Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:
```
layer_i/
- unitigs.bin — packed 2-bit nucleotide sequences
- mphf.bin — ptr_hash index (phase-2, exact key count)
- evidence.bin — [(unitig_id, rank)] per MPHF slot (see unitig_evidence.md)
- counts.bin — [u32] per MPHF slot
+ unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
+ mphf.bin — ptr_hash phase-2 MPHF
+ evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
```
-Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:
+Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
-1. For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
-2. Collect all kmers of B not present in any existing layer → set `B \ A`.
-3. Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).
-
-Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.
+1. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
+2. Collect kmers of B not present in any layer → set `B \ A`.
+3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).
### Membership verification
-ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
-
-This makes the evidence layer load-bearing for correctness, not only for locality.
+ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
### Query algorithm
```
-fn query(kmer) → Option:
- for layer in layers:
- slot = layer.mphf.query(kmer)
- if layer.evidence.decode(slot) == kmer:
- return Some(layer.counts[slot])
+fn query(kmer) → Option<(layer_index, slot)>:
+ for (i, layer) in layers.iter().enumerate():
+ slot = layer.mphf.index(kmer)
+ if layer.evidence.decode(slot) matches kmer:
+ return Some((i, slot))
return None
```
-Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.
-
-### Layer count and probe cost
-
-Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.
+Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
### Merging layers
-Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
-
-## Open questions
-
-- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
-- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
-- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
-- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.
+Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
diff --git a/docmd/implementation/obilayeredmap.md b/docmd/implementation/obilayeredmap.md
index fd4f191..cec43a2 100644
--- a/docmd/implementation/obilayeredmap.md
+++ b/docmd/implementation/obilayeredmap.md
@@ -2,40 +2,66 @@
## Purpose
-`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
+`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
---
-## Four usage modes
+## Three usage modes
-The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode.
+The MPHF + evidence infrastructure is the same for all modes. The **payload** varies.
| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
-| 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
-| 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory |
+| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
-Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::::build_presence`); mode 4 is not yet implemented.
-
-### Payload for modes 2/4: PersistentCompactIntMatrix
-
-`PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index.
-
-Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns.
-
-### Payload for mode 3: PersistentBitMatrix
-
-`PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes.
-
-Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time.
+Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
---
-## Payload architecture
+## MphfLayer — autonomous kmer → slot mapping
-The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`:
+`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
+
+```rust
+pub struct MphfLayer {
+ mphf: Mphf,
+ evidence: Evidence,
+ unitigs: UnitigFileReader,
+ n: usize, // number of indexed kmers = number of MPHF slots
+}
+```
+
+Public API:
+
+```rust
+impl MphfLayer {
+ pub fn open(dir: &Path) -> OLMResult
+ pub fn find(&self, kmer: CanonicalKmer) -> Option // Some(slot) or None
+ pub fn n(&self) -> usize
+ pub fn unitig_writer(dir: &Path) -> OLMResult
+ pub(crate) fn build(
+ dir: &Path,
+ fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
+ ) -> OLMResult
+}
+```
+
+`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
+
+`build` runs two sequential passes over `unitigs.bin`:
+
+1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
+2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
+
+For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`.
+
+---
+
+## Layer\ — MPHF + payload
+
+`Layer` pairs an `MphfLayer` with one payload store.
```rust
pub trait LayerData: Sized {
@@ -45,10 +71,8 @@ pub trait LayerData: Sized {
}
pub struct Layer {
- mphf: Mphf,
- evidence: Evidence,
- unitigs: UnitigFileReader,
- data: D,
+ mphf: MphfLayer,
+ data: D,
}
pub struct Hit {
@@ -57,115 +81,15 @@ pub struct Hit {
}
```
-`LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks.
-
-Implemented concrete types:
+`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait.
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
-| `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column |
-| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column |
+| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
+| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
-`LayeredMap` mirrors the same parameterisation: `LayeredMap`.
-
----
-
-## Three-level hierarchy
-
-```
-index_root/ ← LayeredMap (collection)
- meta.json
- part_00000/ ← Partition
- layer_0/ ← Layer
- mphf.bin
- unitigs.bin
- unitigs.bin.idx
- evidence.bin
- counts/ [modes 2/4]
- meta.json {"n": N, "n_cols": 1}
- col_000000.pciv
- presence/ [mode 3]
- meta.json {"n": N, "n_cols": G}
- col_000000.pbiv
- col_000001.pbiv
- ...
- layer_1/
- ...
- part_00001/
- layer_0/
- ...
-```
-
-**Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry.
-
-**Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`.
-
-**Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
-
----
-
-## Layer file layout
-
-```
-layer_N/
- mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format)
- unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format)
- unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
- evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
- counts/ — [modes 2/4] PersistentCompactIntMatrix
- presence/ — [mode 3] PersistentBitMatrix
-```
-
-`unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
-
-### Evidence encoding
-
-Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
-
-```
-slot s → unitig_id: u25 | rank: u7
-```
-
-Packed into a `u32` (29 bits used, 3 spare). Decoding:
-
-```
-kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
-```
-
-`rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient.
-
----
-
-## ptr_hash configuration
-
-The MPHF per layer is configured as:
-
-```rust
-type Mphf = PtrHash<
- u64, // key type: canonical kmer raw encoding
- CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5)
- CachelineEfVec>, // remap: 11.6 bits/entry vs 32 for Vec
- Xx64, // hasher: XXH3-64 with seed, handles structured keys
- Vec, // pilots
->;
-```
-
-**Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly.
-
-**Bucket function — `CubicEps` with `PtrHashParams::::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen.
-
-**Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets.
-
----
-
-## Build path
-
-The build path is not part of `LayerData`. Each mode exposes its own `impl Layer::build` with the exact signature it needs. Two private module-level helpers avoid code duplication:
-
-**`build_mphf(out_dir, n) -> OLMResult`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n).
-
-**`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n).
+**Build signatures:**
```rust
// mode 1
@@ -173,7 +97,7 @@ impl Layer<()> {
pub fn build(out_dir: &Path) -> OLMResult
}
-// modes 2/4
+// mode 2
impl Layer {
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult
pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult
@@ -189,33 +113,119 @@ impl Layer {
}
```
-Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass.
+All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`.
-Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec`.
+---
+
+## LayeredStore\ and aggregation traits
+
+`LayeredStore` is a generic aggregation wrapper over `Vec`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
+
+```rust
+pub struct LayeredStore(pub Vec);
+
+impl ColumnWeights for LayeredStore { … } // Σ col_weights across inner stores
+impl CountPartials for LayeredStore { … } // element-wise Σ partials
+impl BitPartials for LayeredStore { … } // element-wise Σ partials
+```
+
+Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
+
+**Aggregation hierarchy:**
+
+```
+PersistentCompactIntMatrix implements CountPartials
+LayeredStore via blanket impl (one partition)
+LayeredStore> via blanket impl (partitioned index)
+```
+
+**Leaf implementors** (in `obicompactvec`):
+
+| Type | Traits |
+|---|---|
+| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
+| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
+
+`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
+
+See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
+
+---
+
+## On-disk structure
+
+```
+index_root/ ← LayeredMap (collection)
+ meta.json
+ part_00000/ ← Partition
+ layer_0/ ← Layer
+ mphf.bin — ptr_hash MPHF (epserde format)
+ unitigs.bin — packed 2-bit nucleotide sequences
+ unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
+ evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
+ counts/ [mode 2] PersistentCompactIntMatrix
+ meta.json {"n": N, "n_cols": 1}
+ col_000000.pciv
+ presence/ [mode 3] PersistentBitMatrix
+ meta.json {"n": N, "n_cols": G}
+ col_000000.pbiv
+ …
+ layer_1/
+ …
+ part_00001/
+ …
+```
+
+**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
+
+**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
+
+---
+
+## Evidence encoding
+
+`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
+
+```
+bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
+bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
+```
+
+Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`.
+
+For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
+
+---
+
+## ptr_hash configuration
+
+```rust
+type Mphf = PtrHash<
+ u64, // key type: canonical kmer raw encoding
+ CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
+ CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano)
+ Xx64, // hasher: XXH3-64 with seed
+ Vec, // pilots
+>;
+```
+
+`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
+
+`CubicEps` with `PtrHashParams::::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space.
---
## Query path
-A kmer query routes through all three levels:
-
-1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`.
-2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins.
-3. **Data access**: `layer.data.read(slot)` returns `D::Item`.
-
```rust
-// pseudo-code
-fn query(kmer) -> Option<(usize, Hit)>:
- for (i, layer) in self.layers.iter().enumerate():
- slot = layer.mphf.index(&kmer.raw())
- if layer.evidence.decode(slot) == kmer:
- return Some((i, Hit { slot, data: layer.data.read(slot) }))
- return None
+pub fn query(&self, kmer: CanonicalKmer) -> Option> {
+ self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
+}
```
-Expected probe depth: 1 for kmers in layer 0, increasing for later layers.
+`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
-For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome.
+In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
---
@@ -223,11 +233,11 @@ For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the coun
When adding dataset B to an existing index:
-1. For each partition, iterate kmers of B routed to that partition.
-2. Probe existing layers; collect kmers absent from all layers → `B \ index`.
-3. Build a new layer from `B \ index`.
-4. Append the new layer directory under each `part_XXXXX/`.
-5. Update `meta.json` (layer count, sample registry).
+1. For each partition, probe existing layers for kmers of B routed to that partition.
+2. Collect kmers absent from all layers → `B \ index`.
+3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`.
+4. Call `Layer::build` on the new directory.
+5. Update `meta.json`.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
@@ -237,24 +247,11 @@ Each partition's new layer is built independently; the operation is fully parall
| crate | role |
|---|---|
-| `ptr_hash 1.1` | MPHF per layer (epserde serialisation) |
-| `cacheline-ef 1.1` | compact remap storage inside ptr_hash |
-| `epserde 0.8` | zero-copy serialisation of MPHF |
-| `memmap2` | mmap of layer files |
+| `ptr_hash 1.1` | MPHF per layer |
+| `cacheline-ef 1.1` | compact remap inside ptr_hash |
+| `epserde 0.8` | zero-copy MPHF serialisation |
+| `memmap2 0.9` | mmap of evidence and payload files |
| `obiskio` | unitig file writer/reader |
-| `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` |
-
----
-
-## Relationship to target architecture
-
-The target architecture (see [Kmer index architecture](../architecture/index_architecture.md)) separates `MphfLayer` from data stores entirely and introduces a `PartitionedIndex` with parallel dispatch and an `Aggregator` pattern. The current implementation is a stepping stone: `obicompactvec` types are already fully decoupled from the MPHF; the remaining refactoring is within `obilayeredmap` itself.
-
----
-
-## Open questions
-
-- **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts.
-- **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model.
-- **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.
-- **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.
+| `obicompactvec` | payload types + aggregation traits |
+| `rayon 1` | parallel MPHF construction pass |
+| `ndarray 0.16` | aggregation output arrays |
diff --git a/docmd/implementation/persistent_bit_vec.md b/docmd/implementation/persistent_bit_vec.md
index db8399e..3622c7b 100644
--- a/docmd/implementation/persistent_bit_vec.md
+++ b/docmd/implementation/persistent_bit_vec.md
@@ -236,3 +236,35 @@ impl LayerData for PersistentBitMatrix {
fn read(&self, slot: usize) -> Box<[bool]> { self.row(slot) }
}
```
+
+---
+
+## Aggregation traits — `obicompactvec::traits`
+
+`PersistentBitMatrix` implements two aggregation traits used by `LayeredStore` for cross-layer and cross-partition distance computations.
+
+### ColumnWeights
+
+```rust
+impl ColumnWeights for PersistentBitMatrix {
+ fn col_weights(&self) -> Array1 // = self.count_ones()
+}
+```
+
+`col_weights()[c]` = number of set bits in column `c` across all slots.
+
+### BitPartials
+
+```rust
+impl BitPartials for PersistentBitMatrix {
+ // Self-contained partials (additive across layers)
+ fn partial_jaccard(&self) -> (Array2, Array2) // (inter, union)
+ fn partial_hamming(&self) -> Array2 // differing bits
+
+ // Provided finalisations
+ fn jaccard_dist_matrix(&self) -> Array2
+ fn hamming_dist_matrix(&self) -> Array2
+}
+```
+
+`partial_jaccard` returns `(inter, union)` as a pair because `union` is not reconstructible from per-column `count_ones()` — it depends on both columns simultaneously. Both components are additively decomposable across `(partition, layer)` pairs; the final `jaccard_dist_matrix()` is computed from their element-wise sums.
diff --git a/docmd/implementation/persistent_compact_int_vec.md b/docmd/implementation/persistent_compact_int_vec.md
index 3808262..d025919 100644
--- a/docmd/implementation/persistent_compact_int_vec.md
+++ b/docmd/implementation/persistent_compact_int_vec.md
@@ -258,3 +258,51 @@ impl LayerData for PersistentCompactIntMatrix {
fn read(&self, slot: usize) -> Box<[u32]> { self.row(slot) }
}
```
+
+---
+
+## Aggregation traits — `obicompactvec::traits`
+
+`PersistentCompactIntMatrix` implements two aggregation traits used by `LayeredStore` for cross-layer and cross-partition distance computations.
+
+### ColumnWeights
+
+```rust
+impl ColumnWeights for PersistentCompactIntMatrix {
+ fn col_weights(&self) -> Array1 // = self.sum()
+}
+```
+
+`col_weights()[c]` = sum of all values in column `c` across all slots.
+
+### CountPartials
+
+```rust
+impl CountPartials for PersistentCompactIntMatrix {
+ // Self-contained partials (additive across layers, no external parameter)
+ fn partial_bray(&self) -> Array2
+ fn partial_euclidean(&self) -> Array2
+ fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2)
+
+ // Normalised partials (require global col_weights across all layers/partitions)
+ fn partial_relfreq_bray(&self, global: &Array1) -> Array2
+ fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2
+ fn partial_hellinger(&self, global: &Array1) -> Array2
+
+ // Provided finalisations (default implementations on the trait)
+ fn bray_dist_matrix(&self) -> Array2
+ fn euclidean_dist_matrix(&self) -> Array2
+ fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2
+ fn relfreq_bray_dist_matrix(&self) -> Array2
+ fn relfreq_euclidean_dist_matrix(&self) -> Array2
+ fn hellinger_dist_matrix(&self) -> Array2
+}
+```
+
+**Self-contained partials** are additively decomposable: summing `partial_bray()` across all `(partition, layer)` pairs and finalising gives the same result as computing on the combined data.
+
+**Normalised partials** require the global column weights (sum across all layers and all partitions). The `global` parameter must reflect the complete index, not a per-layer sum. The provided `relfreq_bray_dist_matrix()` etc. call `col_weights()` first (pass 1) then the normalised partial (pass 2); when called on a `LayeredStore>` these two-pass calls cascade automatically through the blanket impls.
+
+**`partial_bray` returns `Array2`** (sum_min only, not a tuple). The denominator is always reconstructible as `col_weights()[i] + col_weights()[j]`.
+
+**`partial_threshold_jaccard` returns `(inter, union)`** as a pair because `union[i,j]` is not reconstructible from per-column statistics — it depends on both columns simultaneously.
diff --git a/docmd/implementation/pipeline.md b/docmd/implementation/pipeline.md
index ebe9c08..0c49f88 100644
--- a/docmd/implementation/pipeline.md
+++ b/docmd/implementation/pipeline.md
@@ -134,28 +134,28 @@ Output: `unitigs.bin` — the permanent evidence structure for the partition. Ea
## Phase 6 — MPHF construction and index finalisation
-Built once on the definitive kmer set (all kmers in all unitigs of the partition):
+Built once on the definitive kmer set (all kmers in all unitigs of the partition). See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for the current implementation.
```
kmers from unitigs → MPHF → mphf.bin
- → counts.bin : packed n-bit array (or 1-bit for presence mode)
- → refs.bin : u32 nucleotide offset into unitigs.bin per kmer
+ → evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
+ → payload : counts/ (mode 2) or presence/ (mode 3)
```
-The MPHF is built once — no rebuild. The n-bit width for `counts.bin` is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into `overflow.bin` as sorted `(mphf_index: u32, count: u32)` pairs.
+The MPHF is built in two passes over `unitigs.bin`: parallel pass for `mphf.bin`, sequential pass for `evidence.bin` and payload. The exact kmer count is available from the unitig index (`unitigs.bin.idx`) before the passes begin.
**Exact verification via unitig evidence:**
-`unitigs.bin` serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.
+`unitigs.bin` serves as the evidence structure. The MPHF maps every input to `[0, N)` including absent kmers — the unitig read-back (via `evidence.bin`) is the only correct membership test.
```
query kmer q
- → canonical_minimizer(q) → hash → PART → part_XXXX/
- → MPHF(q) → index i
- → refs[i] = (unitig_id, kmer_offset)
- → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
- → match : return counts[i] ← exact hit
- → no match: kmer absent ← MPHF collision on absent kmer
+ → canonical_minimizer(q) → hash → PART → part_XXXXX/
+ → MPHF(q) → slot s
+ → evidence[s] = (chunk_id, rank)
+ → read k nucleotides at rank in unitigs[chunk_id] → compare with q
+ → match : return payload[s] ← exact hit
+ → no match: kmer absent ← MPHF collision on absent kmer
```
-One random disk access into `unitigs.bin` per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). `superkmers.bin.gz` is no longer needed at this point and can be deleted.
+`superkmers.bin.gz` is no longer needed at this point and can be deleted.
diff --git a/docmd/implementation/storage.md b/docmd/implementation/storage.md
index 86a90c8..defabf4 100644
--- a/docmd/implementation/storage.md
+++ b/docmd/implementation/storage.md
@@ -1,61 +1,5 @@
# On-disk collection structure
-Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:
+See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout.
-```
-collection/
- metadata.toml — collection parameters (see below)
- part_XXXX/
- superkmers.bin.gz — dereplicated super-kmers for this partition (construction artifact)
- mphf.bin — minimal perfect hash function for this partition
- counts.bin — packed n-bit count array (or 1-bit presence array)
- refs.bin — back-references u32 nucleotide offset into unitigs.bin per kmer
- unitigs.bin — local de Bruijn unitigs (permanent evidence structure)
- overflow.bin — counts exceeding the packed range (optional)
-```
-
-`superkmers.bin.gz` is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is `mphf.bin + counts.bin + refs.bin + unitigs.bin`.
-
-## Collection parameters
-
-Stored in `metadata.toml`:
-
-| Parameter | Role |
-|-----------|------|
-| k | kmer length |
-| m | minimizer length (odd, < k) |
-| p | partition bits (0 ≤ p ≤ min(14, 2m−16)) |
-| mode | `presence` (1 bit/kmer) or `count` (n bits/kmer) |
-| n | bits per kmer in count mode (chosen at construction) |
-| min_count | singleton filtering threshold (0 = keep all) |
-| hash_fn | hash function identifier |
-| hash_seed | seed for the hash function |
-
-## Count storage
-
-**refs.bin capacity:** `unitigs.bin` is a flat 2-bit-packed nucleotide stream with no separators. Each entry in `refs.bin` is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.
-
-*Presence mode* (coverage ≤ 1x, or when only presence/absence matters):
-
-- `counts.bin` is a packed 1-bit array — all bits set to 1 for indexed kmers
-- Singletons are the signal, not filtered
-
-*Count mode* (coverage > 1x):
-
-- `counts.bin` is a packed n-bit array; n chosen at construction from the observed distribution
-- Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow
-- Overflow counts stored in a separate `overflow.bin` as sorted `(index: u32, count: u32)` pairs
-- Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%
-- min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0
-
-## Query protocol
-
-```
-query kmer q
- → canonical_minimizer(q) → hash → PART → part_XXXX/
- → MPHF(q) → index i
- → refs[i] = (unitig_id, kmer_offset)
- → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
- → match : return counts[i]
- → no match: kmer absent
-```
+The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory.
diff --git a/docmd/implementation/unitig_evidence.md b/docmd/implementation/unitig_evidence.md
index e5a97a8..61ae3df 100644
--- a/docmd/implementation/unitig_evidence.md
+++ b/docmd/implementation/unitig_evidence.md
@@ -191,35 +191,38 @@ Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)`
## Implementation notes
-### Evidence file layout (strategy B)
+### Evidence file layout (strategy B — implemented)
+
+`evidence.bin` is a flat `[u32; n]` array with no header:
```
-evidence.bin
-├── header : k (u8), n_kmers (u64), n_unitigs (u64)
-├── id_array : n_kmers × ⌈log₂ n_unitigs⌉ bits — MPHF slot → unitig_id
-└── rank_array: n_kmers × 8 bits (u8[n_kmers]) — MPHF slot → rank within unitig
+evidence.bin: n × 4 bytes, little-endian
+ each u32: bits [31:7] = chunk_id (25 bits)
+ bits [6:0] = rank (7 bits)
```
-`id_array` is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for *B. nana* at 256 partitions). `rank_array` is a plain `u8` array — no bit-packing needed. Access is O(1) with a single multiplication and mask for `id_array`, and a direct byte index for `rank_array`.
+File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
+
+The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
### Unitig file layout
-FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel `.2bit` file for speed.
-
-```
->c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
-ACGTGGCTA...
-```
+Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
### Decoding a kmer from slot s
```
-unitig_id = id_array[s]
-rank = rank_array[s]
-kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
+(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
+kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
```
-One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).
+Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
+
+### Field widths in practice
+
+Rank is stored in 7 bits (0–127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
+
+chunk_id is stored in 25 bits (0–33 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
### Forward vs reverse complement
@@ -264,6 +267,4 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv
## Open questions
-- **Rank field width**: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On *B. nana* (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
-- **Packed nucleotide cache**: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead ($N_{nuc} / 4$ bytes per partition).
-- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.
+- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
diff --git a/docmd/index.md b/docmd/index.md
index 7696949..ce4e5f8 100644
--- a/docmd/index.md
+++ b/docmd/index.md
@@ -4,7 +4,7 @@
## Constraints
-- Target scale: metagenomic data, tens of Gbases, billions of kmers
+- Target scale: individual genome datasets, tens of Gbases
- Maximum efficiency in computation, memory, and disk usage
- Input formats: FASTA, FASTQ, gzip, streaming stdin