docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
@@ -64,7 +64,7 @@
|
||||
<div data-md-component="skip">
|
||||
|
||||
|
||||
<a href="#on-disk-collection-structure" class="md-skip">
|
||||
<a href="#on-disk-index-layout" class="md-skip">
|
||||
Skip to content
|
||||
</a>
|
||||
|
||||
@@ -575,6 +575,24 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
On-disk storage
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
|
||||
|
||||
@@ -592,6 +610,174 @@
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -659,6 +845,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -737,6 +951,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -874,6 +1144,163 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
@@ -889,9 +1316,131 @@
|
||||
|
||||
|
||||
|
||||
<h1 id="on-disk-collection-structure">On-disk collection structure</h1>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p>
|
||||
<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p>
|
||||
<h1 id="on-disk-index-layout">On-disk index layout</h1>
|
||||
<h2 id="directory-tree">Directory tree</h2>
|
||||
<div class="highlight"><pre><span></span><code><index_root>/
|
||||
index.meta ← JSON: IndexMeta
|
||||
scatter.done ← sentinel: scatter phase complete
|
||||
count.done ← sentinel: dereplicate + count complete
|
||||
index.done ← sentinel: MPHF index fully built
|
||||
spectrums/
|
||||
<label>.json ← kmer frequency spectrum per genome
|
||||
partitions/
|
||||
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers }
|
||||
layer_0/
|
||||
unitigs.bin ← binary unitig sequences (2-bit packed)
|
||||
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
|
||||
mphf.bin ← serialised PtrHash MPHF
|
||||
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
|
||||
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
|
||||
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
|
||||
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
|
||||
presence/ ← PersistentBitMatrix (if presence mode, merge)
|
||||
layer_1/ ← added by merge; same structure as layer_0
|
||||
layer_2/ …
|
||||
part_00001/ …
|
||||
</code></pre></div>
|
||||
<h2 id="state-machine-sentinels">State machine (sentinels)</h2>
|
||||
<p>The sentinels are touched atomically at the end of each pipeline stage.
|
||||
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
|
||||
detected as the lowest sentinel present.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>State</th>
|
||||
<th>Sentinel present</th>
|
||||
<th>Meaning</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>Empty</code></td>
|
||||
<td>—</td>
|
||||
<td><code>index.meta</code> exists; scatter not started or interrupted</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Scattered</code></td>
|
||||
<td><code>scatter.done</code></td>
|
||||
<td>All super-kmers routed to partition files</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Counted</code></td>
|
||||
<td><code>count.done</code></td>
|
||||
<td>Partitions dereplicated; <code>spectrums/</code> written</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Indexed</code></td>
|
||||
<td><code>index.done</code></td>
|
||||
<td>All MPHF layers built; index ready for queries</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="indexmeta-indexmeta">index.meta (IndexMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"version"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"kmer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">31</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"minimizer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">11</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"n_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"with_counts"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Exact"</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"block_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span>
|
||||
<span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="nt">"genomes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
|
||||
<span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"label"</span><span class="p">:</span><span class="w"> </span><span class="s2">"genome_A"</span><span class="p">,</span><span class="w"> </span><span class="nt">"meta"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"species"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Homo sapiens"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="w"> </span><span class="p">]</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>n_bits</code> determines the partition count: <code>2^n_bits</code> directories under <code>partitions/</code>.</p>
|
||||
<p><code>evidence</code> is either the string <code>"Exact"</code> or <code>{"Approx": {"b": 8, "z": 1}}</code>.</p>
|
||||
<p><code>block_bits</code> controls the <code>.idx</code> granularity: one offset entry every <code>2^block_bits</code>
|
||||
chunks. <code>block_bits=0</code> stores one entry per chunk (O(1) random access, largest <code>.idx</code>).</p>
|
||||
<p><code>GenomeInfo.meta</code> is a free-form string→string map for categorical metadata (e.g.
|
||||
taxonomy, sample origin). It is optional; defaults to empty.</p>
|
||||
<h2 id="layer-files">Layer files</h2>
|
||||
<h3 id="unitigsbin">unitigs.bin</h3>
|
||||
<p>2-bit packed binary unitig sequences. Each record: 1 byte <code>seql_minus_k</code>
|
||||
(nucleotide length − k), followed by <code>ceil((seql_minus_k + k) / 4)</code> bytes of
|
||||
packed sequence. Long unitigs are transparently split into overlapping chunks
|
||||
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.</p>
|
||||
<h3 id="unitigsbinidx-exact-only">unitigs.bin.idx (Exact only)</h3>
|
||||
<p>Magic <code>UIX3</code>, little-endian header: <code>block_bits</code> (u32), <code>n_unitigs</code> (u32),
|
||||
<code>n_kmers</code> (u64), then <code>ceil(n_unitigs / 2^block_bits) + 1</code> byte-offset entries
|
||||
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.</p>
|
||||
<h3 id="mphfbin">mphf.bin</h3>
|
||||
<p>PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
|
||||
2-bit) to a slot index in <code>[0, n_kmers)</code>.</p>
|
||||
<h3 id="layer_metajson-layermeta">layer_meta.json (LayerMeta)</h3>
|
||||
<p><div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"exact"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
or
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"approx"</span><span class="p">,</span><span class="w"> </span><span class="nt">"b"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w"> </span><span class="nt">"z"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div></p>
|
||||
<h3 id="evidencebin-exact">evidence.bin (Exact)</h3>
|
||||
<p>One <code>(chunk_id: u32, rank: u8)</code> record per MPHF slot, packed. Used to verify
|
||||
that the kmer mapped to a slot is actually present: <code>unitigs.bin[chunk_id][rank]</code>
|
||||
is re-read and compared against the query.</p>
|
||||
<h3 id="fingerprintbin-approx">fingerprint.bin (Approx)</h3>
|
||||
<p><code>b</code>-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
|
||||
False-positive rate per query ≈ <code>1/2^b</code>. With Findere parameter <code>z ≥ 2</code>,
|
||||
<code>z</code> consecutive k-mers must all match, reducing the effective FP rate to
|
||||
approximately <code>W / 2^(b·z)</code> per read of length <code>L</code>
|
||||
(where <code>W = L − k − z + 2</code>).</p>
|
||||
<h3 id="counts-persistentcompactintmatrix">counts/ (PersistentCompactIntMatrix)</h3>
|
||||
<p>Present when <code>with_counts=true</code>. One column per genome; each row holds the
|
||||
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
|
||||
during indexing and merge.</p>
|
||||
<h3 id="presence-persistentbitmatrix">presence/ (PersistentBitMatrix)</h3>
|
||||
<p>Present when the layer was built in presence/absence mode (merge path).
|
||||
One bit per genome per MPHF slot. Written during merge; never present on a
|
||||
freshly indexed single-genome layer.</p>
|
||||
<h2 id="metajson-partitionmeta">meta.json (PartitionMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"n_layers"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Records how many <code>layer_N/</code> directories exist under <code>index/</code>. Incremented by
|
||||
each merge that adds a layer.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user