Files

T

Eric Coissac 5169f65dc9 feat: implement persistent layered index and chunked binary format

Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.

2026-05-09 17:38:29 +08:00

14 KiB

Raw Blame History

Unitig-based MPHF evidence encoding

Role of unitigs in the index

The MPHF maps each canonical kmer to an integer slot, but provides no way to reconstruct the kmer from its slot. A downstream operation (query, set operation) that receives a slot index and needs the kmer sequence must be able to retrieve it. The evidence file serves this purpose: it stores the kmer sequences in compact form and provides, for each MPHF slot, a pointer to where the corresponding kmer can be decoded.

Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition can be reconstructed from its unitig FASTA file.

Two encoding strategies

Strategy A — global nucleotide offset

Each MPHF slot stores a single integer: the byte offset of the kmer's first nucleotide within a packed 2-bit nucleotide array that concatenates all unitigs.

evidence[slot] = global_offset  (bits: ⌈log₂ N_nuc⌉)

where N_nuc is the total number of nucleotides across all unitigs in the partition.

Decoding: read k nucleotides starting at global_offset.

Strategy B — (unitig_id, rank within unitig)

Each MPHF slot stores a pair:

evidence[slot] = (unitig_id, rank)

unitig_id : index of the unitig in the partition (0-based)
rank : kmer index within the unitig (0 ≤ rank < n_kmers); kmer i starts at nucleotide i, so the nucleotide offset is identical numerically but the kmer-unit interpretation is the natural one

Decoding: look up the unitig at unitig_id, then read k nucleotides starting at rank.

Bit-cost analysis

Define for a partition of P kmers with average kmers-per-unitig m:

total nucleotides: N_{nuc} = P \cdot \left(1 + \dfrac{k-1}{m}\right)
number of unitigs: U = P / m

Strategy A


b_A = \left\lceil \log_2 N_{nuc} \right\rceil = \left\lceil \log_2 P + \log_2\!\left(1 + \frac{k-1}{m}\right) \right\rceil

Strategy B


b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rceil

where L_{max} is the maximum unitig length (in nucleotides). In practice L_{max} \ll P, so the rank field is much cheaper than the full global offset. If unitig lengths are bounded (e.g. by partition structure), the rank field width is a small constant independent of P.

Empirical bound on unitig length

Lengths and ranks are expressed in kmer units (not nucleotides): the nucleotide length is n_kmers + k − 1, so storing n_kmers instead of seq_length saves k−1 = 30 units of headroom in the same field width.

Consequence for u8 capacity:

unit	max representable	max nucleotides
nucleotides	255 nuc	225 kmers
kmers	255 kmers	285 nuc

Structural maximum from superkmer construction. For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = 21 kmers (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.

On Betula nana (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The rank field (kmer index within the unitig) fits in a u8 as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).

Split strategy for long unitigs

For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a k−1 nucleotide overlap at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.

original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
                                   ↑ cut here

chunk 1: nucleotides 0 … 284        (255 kmers)
chunk 2: nucleotides 255 … N+k-1    (N-255+1 kmers)
shared:  nucleotides 255 … 284      (k-1 = 30 nucleotides, stored in both)

Cost of one split: k−1 = 30 redundant nucleotides = 60 bits. This event is rare in practice (m_u ≈ 38 for B. nana, well below the 255-kmer cap). No kmer is lost: kmer i is in chunk 1 if i < 255, in chunk 2 (at rank i−255) otherwise.

Savings from u8 length fields

Because all chunks are guaranteed ≤ 255 kmers, the per-chunk length array in the binary index is a flat u8 array — 1 byte per chunk instead of 8 bytes (usize) or 4 bytes (u32). For a partition with 4 M unitigs:

length type	bytes/chunk	total (4 M chunks)
usize (u64)	8	32 MB
u32	4	16 MB
u8	1	4 MB

Random access to chunk i is recovered at load time by a single prefix-sum pass over the u8 array, computing a u32/u64 offset array in O(n_chunks) time and O(n_chunks × 4) bytes — paid once at open time, cached for the lifetime of the partition handle.

Bit costs for Betula nana (k=31, 256 partitions, P ≈ 10.4 M, U ≈ 275 k, m_u ≈ 37.9):

field	strategy A	strategy B
offset / id	`\lceil\log_2(P \cdot (1 + 30/m_u))\rceil = 25` bits	`\lceil\log_2(U)\rceil = 19` bits
rank	—	8 bits (u8, fixed)
total	25 bits	27 bits

Strategy A is 2 bits cheaper. Strategy B's main advantage is locality: decoding a kmer touches one unitig's cache lines rather than an arbitrary offset in a large flat array, and the rank field doubles as a direct index into the packed nucleotide sequence without pointer arithmetic.

Partition-size tradeoff

The total bits/kmer for the index (sequence + evidence + MPHF) as a function of partition size is:


\text{total} = \underbrace{2\!\left(1 + \frac{k-1}{m}\right)}_{\text{sequence}} + \underbrace{\log_2 P + \log_2\!\left(1+\frac{k-1}{m}\right)}_{\text{evidence}} + \underbrace{c_{MPHF}}_{\approx 2\text{–}4}

Empirical observation: m_u is set by De Bruijn graph topology, not partition count

Measured on Betula nana (k=31, m=11), summing n_kmers and sequence counts across all partition files:

N partitions	m_sk	m_u	factor m_u/m_sk	nuc ratio (u/sk)
1	12.13	41.89	3.45×	0.273
16	12.13	38.19	3.15×	0.376
256	12.13	37.90	3.12×	0.388
1 024	12.13	37.89	3.12×	0.389

m_sk = avg kmers/super-kmer (invariant — same dataset regardless of partition scheme)
m_u = avg kmers/unitig = total_n_kmers / total_unitigs, summed across all partitions
nuc ratio = (u_symbols + 30·u_reads) / (sk_symbols + 30·sk_reads)

X-axis in both charts: partition bits (0 = 1 partition, 10 = 1024 partitions) — each step doubles the partition count.

xychart-beta
    title "m_u (avg kmers/unitig) vs partition bits — B. nana k=31"
    x-axis "partition bits" 0 --> 10
    y-axis "m_u" 37 --> 43
    line [41.89, 40.78, 39.22, 38.52, 38.19, 38.03, 37.96, 37.92, 37.90, 37.89, 37.89]

xychart-beta
    title "Nucleotide storage: unitigs / super-kmers (%) vs partition bits — B. nana k=31"
    x-axis "partition bits" 0 --> 10
    y-axis "%" 25 --> 42
    line [27.3, 29.7, 33.9, 36.3, 37.6, 38.3, 38.6, 38.7, 38.8, 38.9, 38.9]

Key observations:

Partition boundaries have a small but non-zero effect on m_u. Going from 1 to 1024 partitions reduces m_u by 10% (41.9 → 37.9). Within the practical range 16–1024, the variation is under 1% — m_u is effectively constant.
m_u is a property of the De Bruijn graph, not the partition scheme. The dominant factor is graph branching (heterozygosity, repeats, sequencing errors).
Unitigs provide substantial compaction over super-kmers. At 256 partitions, unitigs cover the same unique kmers using 39% of the raw nucleotide content of super-kmers (3.1× compaction factor).

Per-partition compaction ratio (sk_symbols / u_symbols)

The ratio measures how much super-kmer kmer-slots are "shared" across different super-kmer records: a ratio of 1.35 means each unique kmer (counted once in unitigs) appears in 1.35 super-kmer kmer-slots on average.

bits	N partitions	median ratio	min ratio	min partition	min u_reads
6	64	1.355	1.073	—	4.5 M
7	128	1.352	1.037	—	4.1 M
8	256	1.350	1.012	145	3.8 M
9	512	1.350	0.998	145	3.6 M
10	1024	1.351	0.992	145	3.6 M

The median stabilises at 1.35 from 64 partitions onward (stdev = 0.027 at 256 partitions). There is one persistent outlier: partition 145 (at 256-partition resolution) is consistently anomalous across all partition depths — it contains 10–14× more super-kmers and unitigs than the average partition, with a ratio near 1.0, meaning the unitig representation provides almost no kmer deduplication. This is consistent with a highly repetitive or organellar region where the dominant minimiser belongs to a sequence that appears in many reads without forming long overlapping paths in the De Bruijn graph.

Per-partition parameters at 256 partitions (B. nana):

quantity	value
P (unique kmers/partition, avg)	≈ 10.4 M
U (unitigs/partition, avg)	≈ 275 k
m_u	≈ 37.9
Strategy A bits/kmer	⌈log₂(P·(1+30/m_u))⌉ = 25
Strategy B bits/kmer	⌈log₂(U)⌉ + 8 = 27

Consequence: the partition count should be as large as memory and parallelism allow. Each doubling saves 1 bit/kmer in evidence (log₂ P decreases by 1). The sequence term 2·(1 + 30/m_u) ≈ 3.6 bits/kmer is approximately constant.

Strategy B partially decouples evidence cost from P: log₂(U) = log₂(P/m_u) grows more slowly than log₂(P) by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.

Implementation notes

Evidence file layout (strategy B)

evidence.bin
├── header    : k (u8), n_kmers (u64), n_unitigs (u64)
├── id_array  : n_kmers × ⌈log₂ n_unitigs⌉ bits  — MPHF slot → unitig_id
└── rank_array: n_kmers × 8 bits (u8[n_kmers])    — MPHF slot → rank within unitig

id_array is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for B. nana at 256 partitions). rank_array is a plain u8 array — no bit-packing needed. Access is O(1) with a single multiplication and mask for id_array, and a direct byte index for rank_array.

Unitig file layout

FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel .2bit file for speed.

>c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
ACGTGGCTA...

Decoding a kmer from slot s

unitig_id = id_array[s]
rank      = rank_array[s]
kmer      = nucleotides(unitig_id)[rank .. rank + k]   // 2-bit packed slice

One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).

Forward vs reverse complement

The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.

Non-determinism of the unitig decomposition

The unitig extraction is not deterministic: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.

Source of non-determinism

The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with ahash::RandomState::new()). The start_iter first pass emits every node whose can_extend_left flag is false — which includes not only true dead-end nodes but also branch points (nodes with 2 or more left neighbours, for which unique_neighbor returns None).

When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.

Example — fork topology (k = 31):

A → B ← C
    ↓
    D

All four nodes are in the graph. B has two left neighbours (A and C), so can_extend_left = false; B also has one right neighbour D, so can_extend_right = true.

iteration order	unitigs produced	count
A first, then B, C	ABD · C	2
B first, then A, C	BD · A · C	3

Both tilings cover the same 4 canonical k-mers.

Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.

Consequence for MPHF construction

The MPHF is built from the k-mer set, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.

Open questions

Rank field width: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On B. nana (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
Packed nucleotide cache: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead (N_{nuc} / 4 bytes per partition).
Cross-partition evidence: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.

14 KiB Raw Blame History Unescape Escape