docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
This commit is contained in:
@@ -6,7 +6,7 @@
|
||||
<meta charset="utf-8"/>
|
||||
<meta content="width=device-width,initial-scale=1" name="viewport"/>
|
||||
<link href="../mphf/" rel="prev"/>
|
||||
<link href="../../architecture/sequences/invariant/" rel="next"/>
|
||||
<link href="../obilayeredmap/" rel="next"/>
|
||||
<link href="../../assets/images/favicon.png" rel="icon"/>
|
||||
<meta content="mkdocs-1.6.1, mkdocs-material-9.7.6" name="generator"/>
|
||||
<title>Unitig evidence encoding - obikmer</title>
|
||||
@@ -467,6 +467,37 @@
|
||||
</nav>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#non-determinism-of-the-unitig-decomposition">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Non-determinism of the unitig decomposition
|
||||
|
||||
</span>
|
||||
</a>
|
||||
<nav aria-label="Non-determinism of the unitig decomposition" class="md-nav">
|
||||
<ul class="md-nav__list">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#source-of-non-determinism">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Source of non-determinism
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#consequence-for-mphf-construction">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Consequence for MPHF construction
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
</nav>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#open-questions">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
@@ -478,6 +509,42 @@
|
||||
</ul>
|
||||
</nav>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../obilayeredmap/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obilayeredmap crate
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../persistent_compact_int_vec/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
PersistentCompactIntVec
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../persistent_bit_vec/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
PersistentBitVec
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
</nav>
|
||||
</li>
|
||||
@@ -513,6 +580,18 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../../architecture/index_architecture/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer index
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -698,6 +777,37 @@
|
||||
</nav>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#non-determinism-of-the-unitig-decomposition">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Non-determinism of the unitig decomposition
|
||||
|
||||
</span>
|
||||
</a>
|
||||
<nav aria-label="Non-determinism of the unitig decomposition" class="md-nav">
|
||||
<ul class="md-nav__list">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#source-of-non-determinism">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Source of non-determinism
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#consequence-for-mphf-construction">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Consequence for MPHF construction
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
</nav>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#open-questions">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
@@ -774,7 +884,8 @@ b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rcei
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>On <em>Betula nana</em> (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average; no unitig length distribution data measured yet. The <code>rank</code> field (kmer index within the unitig) fits in a <code>u8</code> as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below.</p>
|
||||
<p><strong>Structural maximum from superkmer construction.</strong> For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k − m + 1 = <strong>21 kmers</strong> (the minimiser traverses from position k−m to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range.</p>
|
||||
<p>On <em>Betula nana</em> (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The <code>rank</code> field (kmer index within the unitig) fits in a <code>u8</code> as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed).</p>
|
||||
<h3 id="split-strategy-for-long-unitigs">Split strategy for long unitigs</h3>
|
||||
<p>For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a <strong>k−1 nucleotide overlap</strong> at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.</p>
|
||||
<div class="highlight"><pre><span></span><code>original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
|
||||
@@ -1026,6 +1137,43 @@ kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
|
||||
<h3 id="forward-vs-reverse-complement">Forward vs reverse complement</h3>
|
||||
<p>The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.</p>
|
||||
<hr/>
|
||||
<h2 id="non-determinism-of-the-unitig-decomposition">Non-determinism of the unitig decomposition</h2>
|
||||
<p>The unitig extraction is <strong>not deterministic</strong>: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set.</p>
|
||||
<h3 id="source-of-non-determinism">Source of non-determinism</h3>
|
||||
<p>The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with <code>ahash::RandomState::new()</code>). The <code>start_iter</code> first pass emits every node whose <code>can_extend_left</code> flag is false — which includes not only true dead-end nodes but also <strong>branch points</strong> (nodes with 2 or more left neighbours, for which <code>unique_neighbor</code> returns <code>None</code>).</p>
|
||||
<p>When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.</p>
|
||||
<p><strong>Example</strong> — fork topology (k = 31):</p>
|
||||
<div class="highlight"><pre><span></span><code>A → B ← C
|
||||
↓
|
||||
D
|
||||
</code></pre></div>
|
||||
<p>All four nodes are in the graph. B has two left neighbours (A and C), so <code>can_extend_left = false</code>; B also has one right neighbour D, so <code>can_extend_right = true</code>.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>iteration order</th>
|
||||
<th>unitigs produced</th>
|
||||
<th>count</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>A first, then B, C</td>
|
||||
<td>ABD · C</td>
|
||||
<td>2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>B first, then A, C</td>
|
||||
<td>BD · A · C</td>
|
||||
<td>3</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Both tilings cover the same 4 canonical k-mers.</p>
|
||||
<p>Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies.</p>
|
||||
<h3 id="consequence-for-mphf-construction">Consequence for MPHF construction</h3>
|
||||
<p>The MPHF is built from the <strong>k-mer set</strong>, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case.</p>
|
||||
<hr/>
|
||||
<h2 id="open-questions">Open questions</h2>
|
||||
<ul>
|
||||
<li><strong>Rank field width</strong>: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On <em>B. nana</em> (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.</li>
|
||||
|
||||
Reference in New Issue
Block a user