docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
@@ -575,24 +575,6 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
On-disk storage
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
|
||||
|
||||
@@ -610,58 +592,6 @@
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#collection-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Collection parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#count-storage" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Count storage
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-protocol" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query protocol
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -944,47 +874,6 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#collection-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Collection parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#count-storage" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Count storage
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-protocol" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query protocol
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
@@ -1001,86 +890,8 @@
|
||||
|
||||
|
||||
<h1 id="on-disk-collection-structure">On-disk collection structure</h1>
|
||||
<p>Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:</p>
|
||||
<div class="highlight"><pre><span></span><code>collection/
|
||||
metadata.toml — collection parameters (see below)
|
||||
part_XXXX/
|
||||
superkmers.bin.gz — dereplicated super-kmers for this partition (construction artifact)
|
||||
mphf.bin — minimal perfect hash function for this partition
|
||||
counts.bin — packed n-bit count array (or 1-bit presence array)
|
||||
refs.bin — back-references u32 nucleotide offset into unitigs.bin per kmer
|
||||
unitigs.bin — local de Bruijn unitigs (permanent evidence structure)
|
||||
overflow.bin — counts exceeding the packed range (optional)
|
||||
</code></pre></div>
|
||||
<p><code>superkmers.bin.gz</code> is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is <code>mphf.bin + counts.bin + refs.bin + unitigs.bin</code>.</p>
|
||||
<h2 id="collection-parameters">Collection parameters</h2>
|
||||
<p>Stored in <code>metadata.toml</code>:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Parameter</th>
|
||||
<th>Role</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>k</td>
|
||||
<td>kmer length</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>m</td>
|
||||
<td>minimizer length (odd, < k)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>p</td>
|
||||
<td>partition bits (0 ≤ p ≤ min(14, 2m−16))</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>mode</td>
|
||||
<td><code>presence</code> (1 bit/kmer) or <code>count</code> (n bits/kmer)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>n</td>
|
||||
<td>bits per kmer in count mode (chosen at construction)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>min_count</td>
|
||||
<td>singleton filtering threshold (0 = keep all)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>hash_fn</td>
|
||||
<td>hash function identifier</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>hash_seed</td>
|
||||
<td>seed for the hash function</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="count-storage">Count storage</h2>
|
||||
<p><strong>refs.bin capacity:</strong> <code>unitigs.bin</code> is a flat 2-bit-packed nucleotide stream with no separators. Each entry in <code>refs.bin</code> is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.</p>
|
||||
<p><em>Presence mode</em> (coverage ≤ 1x, or when only presence/absence matters):</p>
|
||||
<ul>
|
||||
<li><code>counts.bin</code> is a packed 1-bit array — all bits set to 1 for indexed kmers</li>
|
||||
<li>Singletons are the signal, not filtered</li>
|
||||
</ul>
|
||||
<p><em>Count mode</em> (coverage > 1x):</p>
|
||||
<ul>
|
||||
<li><code>counts.bin</code> is a packed n-bit array; n chosen at construction from the observed distribution</li>
|
||||
<li>Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow</li>
|
||||
<li>Overflow counts stored in a separate <code>overflow.bin</code> as sorted <code>(index: u32, count: u32)</code> pairs</li>
|
||||
<li>Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%</li>
|
||||
<li>min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0</li>
|
||||
</ul>
|
||||
<h2 id="query-protocol">Query protocol</h2>
|
||||
<div class="highlight"><pre><span></span><code>query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXX/
|
||||
→ MPHF(q) → index i
|
||||
→ refs[i] = (unitig_id, kmer_offset)
|
||||
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
|
||||
→ match : return counts[i]
|
||||
→ no match: kmer absent
|
||||
</code></pre></div>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p>
|
||||
<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user