docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -575,24 +575,6 @@
        
      
      
-        <label class="md-nav__link md-nav__link--active" for="__toc">
-          
-  
-  
-  <span class="md-ellipsis">
-    
-  
-    On-disk storage
-  
-
-    
-  </span>
-  
-  
-
-          <span class="md-nav__icon md-icon"></span>
-        </label>
-      
      <a href="./" class="md-nav__link md-nav__link--active">
        
  
@@ -610,58 +592,6 @@

      </a>
      
-        
-
-<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
-  
-  
-  
-    
-  
-  
-    <label class="md-nav__title" for="__toc">
-      <span class="md-nav__icon md-icon"></span>
-      Table of contents
-    </label>
-    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
-      
-        <li class="md-nav__item">
-  <a href="#collection-parameters" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Collection parameters
-      
-    </span>
-  </a>
-  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#count-storage" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Count storage
-      
-    </span>
-  </a>
-  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#query-protocol" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Query protocol
-      
-    </span>
-  </a>
-  
-</li>
-      
-    </ul>
-  
-</nav>
-      
    </li>
  

@@ -944,47 +874,6 @@
    
  
  
-    <label class="md-nav__title" for="__toc">
-      <span class="md-nav__icon md-icon"></span>
-      Table of contents
-    </label>
-    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
-      
-        <li class="md-nav__item">
-  <a href="#collection-parameters" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Collection parameters
-      
-    </span>
-  </a>
-  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#count-storage" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Count storage
-      
-    </span>
-  </a>
-  
-</li>
-      
-        <li class="md-nav__item">
-  <a href="#query-protocol" class="md-nav__link">
-    <span class="md-ellipsis">
-      
-        Query protocol
-      
-    </span>
-  </a>
-  
-</li>
-      
-    </ul>
-  
 </nav>
                  </div>
                </div>
@@ -1001,86 +890,8 @@


 <h1 id="on-disk-collection-structure">On-disk collection structure</h1>
-<p>Collections are too large to hold in RAM (hundreds of genomes, billions of kmers). The collection lives on disk as a directory of memory-mapped files:</p>
-<div class="highlight"><pre><span></span><code>collection/
-  metadata.toml          — collection parameters (see below)
-  part_XXXX/
-    superkmers.bin.gz    — dereplicated super-kmers for this partition (construction artifact)
-    mphf.bin             — minimal perfect hash function for this partition
-    counts.bin           — packed n-bit count array (or 1-bit presence array)
-    refs.bin             — back-references u32 nucleotide offset into unitigs.bin per kmer
-    unitigs.bin          — local de Bruijn unitigs (permanent evidence structure)
-    overflow.bin         — counts exceeding the packed range (optional)
-</code></pre></div>
-<p><code>superkmers.bin.gz</code> is produced during phase 1 and consumed through phases 2–4. It can be deleted after phase 5 — it is not needed for querying. The permanent query structure is <code>mphf.bin + counts.bin + refs.bin + unitigs.bin</code>.</p>
-<h2 id="collection-parameters">Collection parameters</h2>
-<p>Stored in <code>metadata.toml</code>:</p>
-<table>
-<thead>
-<tr>
-<th>Parameter</th>
-<th>Role</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td>k</td>
-<td>kmer length</td>
-</tr>
-<tr>
-<td>m</td>
-<td>minimizer length (odd, &lt; k)</td>
-</tr>
-<tr>
-<td>p</td>
-<td>partition bits (0 ≤ p ≤ min(14, 2m−16))</td>
-</tr>
-<tr>
-<td>mode</td>
-<td><code>presence</code> (1 bit/kmer) or <code>count</code> (n bits/kmer)</td>
-</tr>
-<tr>
-<td>n</td>
-<td>bits per kmer in count mode (chosen at construction)</td>
-</tr>
-<tr>
-<td>min_count</td>
-<td>singleton filtering threshold (0 = keep all)</td>
-</tr>
-<tr>
-<td>hash_fn</td>
-<td>hash function identifier</td>
-</tr>
-<tr>
-<td>hash_seed</td>
-<td>seed for the hash function</td>
-</tr>
-</tbody>
-</table>
-<h2 id="count-storage">Count storage</h2>
-<p><strong>refs.bin capacity:</strong> <code>unitigs.bin</code> is a flat 2-bit-packed nucleotide stream with no separators. Each entry in <code>refs.bin</code> is a u32 nucleotide offset pointing to the first base of the kmer. A u32 covers 4 billion nucleotide positions = 1 GB of sequence per partition. In the worst case (all unitigs of length 1 kmer, offsets spaced k apart), this supports 4 billion / k ≈ 130 million kmers per partition at k=31. In the typical case (long unitigs, consecutive kmers at offset +1), the limit approaches 4 billion kmers — well beyond any realistic partition size.</p>
-<p><em>Presence mode</em> (coverage ≤ 1x, or when only presence/absence matters):</p>
-<ul>
-<li><code>counts.bin</code> is a packed 1-bit array — all bits set to 1 for indexed kmers</li>
-<li>Singletons are the signal, not filtered</li>
-</ul>
-<p><em>Count mode</em> (coverage &gt; 1x):</p>
-<ul>
-<li><code>counts.bin</code> is a packed n-bit array; n chosen at construction from the observed distribution</li>
-<li>Value 0: absent sentinel; values 1..2ⁿ−2: direct count; value 2ⁿ−1: overflow</li>
-<li>Overflow counts stored in a separate <code>overflow.bin</code> as sorted <code>(index: u32, count: u32)</code> pairs</li>
-<li>Empirically (k=31, 15x coverage): n=5 covers 97% of real kmers, n=6 covers 99%</li>
-<li>min_count threshold filters low-frequency kmers (errors) before indexing; for ≤1x, min_count=0</li>
-</ul>
-<h2 id="query-protocol">Query protocol</h2>
-<div class="highlight"><pre><span></span><code>query kmer q
-  → canonical_minimizer(q) → hash → PART → part_XXXX/
-  → MPHF(q) → index i
-  → refs[i] = (unitig_id, kmer_offset)
-  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
-  → match   : return counts[i]
-  → no match: kmer absent
-</code></pre></div>
+<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p>
+<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p>