docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
Eric Coissac
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
+37 -21
View File
@@ -428,10 +428,10 @@
<nav aria-label="Implementation notes" class="md-nav">
<ul class="md-nav__list">
<li class="md-nav__item">
<a class="md-nav__link" href="#evidence-file-layout-strategy-b">
<a class="md-nav__link" href="#evidence-file-layout-strategy-b-implemented">
<span class="md-ellipsis">
Evidence file layout (strategy B)
Evidence file layout (strategy B — implemented)
</span>
</a>
@@ -455,6 +455,15 @@
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#field-widths-in-practice">
<span class="md-ellipsis">
Field widths in practice
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#forward-vs-reverse-complement">
<span class="md-ellipsis">
@@ -738,10 +747,10 @@
<nav aria-label="Implementation notes" class="md-nav">
<ul class="md-nav__list">
<li class="md-nav__item">
<a class="md-nav__link" href="#evidence-file-layout-strategy-b">
<a class="md-nav__link" href="#evidence-file-layout-strategy-b-implemented">
<span class="md-ellipsis">
Evidence file layout (strategy B)
Evidence file layout (strategy B — implemented)
</span>
</a>
@@ -765,6 +774,15 @@
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#field-widths-in-practice">
<span class="md-ellipsis">
Field widths in practice
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#forward-vs-reverse-complement">
<span class="md-ellipsis">
@@ -1116,24 +1134,24 @@ shared: nucleotides 255 … 284 (k-1 = 30 nucleotides, stored in both)
<p>Strategy B partially decouples evidence cost from P: <code>log₂(U) = log₂(P/m_u)</code> grows more slowly than <code>log₂(P)</code> by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.</p>
<hr/>
<h2 id="implementation-notes">Implementation notes</h2>
<h3 id="evidence-file-layout-strategy-b">Evidence file layout (strategy B)</h3>
<div class="highlight"><pre><span></span><code>evidence.bin
├── header : k (u8), n_kmers (u64), n_unitigs (u64)
├── id_array : n_kmers × ⌈log₂ n_unitigs⌉ bits — MPHF slot → unitig_id
└── rank_array: n_kmers × 8 bits (u8[n_kmers]) — MPHF slot → rank within unitig
<h3 id="evidence-file-layout-strategy-b-implemented">Evidence file layout (strategy B — implemented)</h3>
<p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header:</p>
<div class="highlight"><pre><span></span><code>evidence.bin: n × 4 bytes, little-endian
each u32: bits [31:7] = chunk_id (25 bits)
bits [6:0] = rank (7 bits)
</code></pre></div>
<p><code>id_array</code> is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for <em>B. nana</em> at 256 partitions). <code>rank_array</code> is a plain <code>u8</code> array — no bit-packing needed. Access is O(1) with a single multiplication and mask for <code>id_array</code>, and a direct byte index for <code>rank_array</code>.</p>
<p>File size = <code>n × 4</code> bytes exactly. Decoding a slot: <code>chunk_id = raw &gt;&gt; 7</code>, <code>rank = raw &amp; 0x7F</code>.</p>
<p>The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.</p>
<h3 id="unitig-file-layout">Unitig file layout</h3>
<p>FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel <code>.2bit</code> file for speed.</p>
<div class="highlight"><pre><span></span><code>&gt;c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
ACGTGGCTA...
</code></pre></div>
<p>Binary packed 2-bit nucleotide file (<code>unitigs.bin</code>) with a companion index (<code>unitigs.bin.idx</code>). The index stores: magic <code>UIDX</code>, <code>n_unitigs: u32</code>, <code>n_kmers: u64</code>, <code>seqls: [u8; n_unitigs]</code> (kmer count 1 per chunk), and <code>packed_offsets: [u32; n_unitigs + 1]</code> (byte offsets into <code>unitigs.bin</code>, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.</p>
<h3 id="decoding-a-kmer-from-slot-s">Decoding a kmer from slot s</h3>
<div class="highlight"><pre><span></span><code>unitig_id = id_array[s]
rank = rank_array[s]
kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
<div class="highlight"><pre><span></span><code>(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
</code></pre></div>
<p>One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).</p>
<p>Two memory accesses: one into <code>evidence.bin</code>, one into <code>unitigs.bin</code>. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).</p>
<h3 id="field-widths-in-practice">Field widths in practice</h3>
<p>Rank is stored in 7 bits (0127). On <em>B. nana</em> (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.</p>
<p>chunk_id is stored in 25 bits (033 554 431). For <em>B. nana</em> at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.</p>
<h3 id="forward-vs-reverse-complement">Forward vs reverse complement</h3>
<p>The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.</p>
<hr/>
@@ -1176,9 +1194,7 @@ kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
<hr/>
<h2 id="open-questions">Open questions</h2>
<ul>
<li><strong>Rank field width</strong>: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k1 extra units of headroom at no cost. On <em>B. nana</em> (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.</li>
<li><strong>Packed nucleotide cache</strong>: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead (<span class="arithmatex">\(N_{nuc} / 4\)</span> bytes per partition).</li>
<li><strong>Cross-partition evidence</strong>: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.</li>
<li><strong>Cross-partition evidence</strong>: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.</li>
</ul>
</article>
</div>