27f5e88a7b
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
1062 lines
34 KiB
HTML
1062 lines
34 KiB
HTML
|
||
<!DOCTYPE html>
|
||
|
||
<html class="no-js" lang="en">
|
||
<head>
|
||
<meta charset="utf-8"/>
|
||
<meta content="width=device-width,initial-scale=1" name="viewport"/>
|
||
<link href="../mphf/" rel="prev"/>
|
||
<link href="../../architecture/sequences/invariant/" rel="next"/>
|
||
<link href="../../assets/images/favicon.png" rel="icon"/>
|
||
<meta content="mkdocs-1.6.1, mkdocs-material-9.7.6" name="generator"/>
|
||
<title>Unitig evidence encoding - obikmer</title>
|
||
<link href="../../assets/stylesheets/main.484c7ddc.min.css" rel="stylesheet"/>
|
||
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
|
||
<link href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback" rel="stylesheet"/>
|
||
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
|
||
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
|
||
</head>
|
||
<body dir="ltr">
|
||
<input autocomplete="off" class="md-toggle" data-md-toggle="drawer" id="__drawer" type="checkbox"/>
|
||
<input autocomplete="off" class="md-toggle" data-md-toggle="search" id="__search" type="checkbox"/>
|
||
<label class="md-overlay" for="__drawer"></label>
|
||
<div data-md-component="skip">
|
||
<a class="md-skip" href="#unitig-based-mphf-evidence-encoding">
|
||
Skip to content
|
||
</a>
|
||
</div>
|
||
<div data-md-component="announce">
|
||
</div>
|
||
<header class="md-header md-header--shadow" data-md-component="header">
|
||
<nav aria-label="Header" class="md-header__inner md-grid">
|
||
<a aria-label="obikmer" class="md-header__button md-logo" data-md-component="logo" href="../.." title="obikmer">
|
||
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
|
||
</a>
|
||
<label class="md-header__button md-icon" for="__drawer">
|
||
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"></path></svg>
|
||
</label>
|
||
<div class="md-header__title" data-md-component="header-title">
|
||
<div class="md-header__ellipsis">
|
||
<div class="md-header__topic">
|
||
<span class="md-ellipsis">
|
||
obikmer
|
||
</span>
|
||
</div>
|
||
<div class="md-header__topic" data-md-component="header-topic">
|
||
<span class="md-ellipsis">
|
||
|
||
Unitig evidence encoding
|
||
|
||
</span>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
|
||
</nav>
|
||
</header>
|
||
<div class="md-container" data-md-component="container">
|
||
<main class="md-main" data-md-component="main">
|
||
<div class="md-main__inner md-grid">
|
||
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation">
|
||
<div class="md-sidebar__scrollwrap">
|
||
<div class="md-sidebar__inner">
|
||
<nav aria-label="Navigation" class="md-nav md-nav--primary" data-md-level="0">
|
||
<label class="md-nav__title" for="__drawer">
|
||
<a aria-label="obikmer" class="md-nav__button md-logo" data-md-component="logo" href="../.." title="obikmer">
|
||
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
|
||
</a>
|
||
obikmer
|
||
</label>
|
||
<ul class="md-nav__list" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../..">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Home
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item md-nav__item--nested">
|
||
<input class="md-nav__toggle md-toggle" id="__nav_2" type="checkbox"/>
|
||
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Theory
|
||
|
||
|
||
|
||
</span>
|
||
<span class="md-nav__icon md-icon"></span>
|
||
</label>
|
||
<nav aria-expanded="false" aria-labelledby="__nav_2_label" class="md-nav" data-md-level="1">
|
||
<label class="md-nav__title" for="__nav_2">
|
||
<span class="md-nav__icon md-icon"></span>
|
||
|
||
|
||
Theory
|
||
|
||
|
||
</label>
|
||
<ul class="md-nav__list" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../kmers/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Kmers and super-kmers
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../theory/encoding/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
DNA encoding
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../theory/entropy/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Entropy filter
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../theory/minimizer/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Minimizer selection
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../theory/indexing/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Partitioning architecture
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
|
||
<input checked="" class="md-nav__toggle md-toggle" id="__nav_3" type="checkbox"/>
|
||
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Implementation
|
||
|
||
|
||
|
||
</span>
|
||
<span class="md-nav__icon md-icon"></span>
|
||
</label>
|
||
<nav aria-expanded="true" aria-labelledby="__nav_3_label" class="md-nav" data-md-level="1">
|
||
<label class="md-nav__title" for="__nav_3">
|
||
<span class="md-nav__icon md-icon"></span>
|
||
|
||
|
||
Implementation
|
||
|
||
|
||
</label>
|
||
<ul class="md-nav__list" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../superkmer/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
SuperKmer
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../kmer/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Kmer
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../chunkreader/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Chunk reader
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../pipeline/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Construction pipeline
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../obipipeline/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
obipipeline library
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../storage/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
On-disk storage
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../mphf/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
MPHF selection
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item md-nav__item--active">
|
||
<input class="md-nav__toggle md-toggle" id="__toc" type="checkbox"/>
|
||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Unitig evidence encoding
|
||
|
||
|
||
|
||
</span>
|
||
<span class="md-nav__icon md-icon"></span>
|
||
</label>
|
||
<a class="md-nav__link md-nav__link--active" href="./">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Unitig evidence encoding
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
|
||
<label class="md-nav__title" for="__toc">
|
||
<span class="md-nav__icon md-icon"></span>
|
||
Table of contents
|
||
</label>
|
||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#role-of-unitigs-in-the-index">
|
||
<span class="md-ellipsis">
|
||
|
||
Role of unitigs in the index
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#two-encoding-strategies">
|
||
<span class="md-ellipsis">
|
||
|
||
Two encoding strategies
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Two encoding strategies" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#strategy-a-global-nucleotide-offset">
|
||
<span class="md-ellipsis">
|
||
|
||
Strategy A — global nucleotide offset
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#strategy-b-unitig_id-rank-within-unitig">
|
||
<span class="md-ellipsis">
|
||
|
||
Strategy B — (unitig_id, rank within unitig)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#bit-cost-analysis">
|
||
<span class="md-ellipsis">
|
||
|
||
Bit-cost analysis
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Bit-cost analysis" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#empirical-bound-on-unitig-length">
|
||
<span class="md-ellipsis">
|
||
|
||
Empirical bound on unitig length
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#split-strategy-for-long-unitigs">
|
||
<span class="md-ellipsis">
|
||
|
||
Split strategy for long unitigs
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#savings-from-u8-length-fields">
|
||
<span class="md-ellipsis">
|
||
|
||
Savings from u8 length fields
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#partition-size-tradeoff">
|
||
<span class="md-ellipsis">
|
||
|
||
Partition-size tradeoff
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Partition-size tradeoff" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#empirical-observation-m_u-is-set-by-de-bruijn-graph-topology-not-partition-count">
|
||
<span class="md-ellipsis">
|
||
|
||
Empirical observation: m_u is set by De Bruijn graph topology, not partition count
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Empirical observation: m_u is set by De Bruijn graph topology, not partition count" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#per-partition-compaction-ratio-sk_symbols-u_symbols">
|
||
<span class="md-ellipsis">
|
||
|
||
Per-partition compaction ratio (sk_symbols / u_symbols)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#implementation-notes">
|
||
<span class="md-ellipsis">
|
||
|
||
Implementation notes
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Implementation notes" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#evidence-file-layout-strategy-b">
|
||
<span class="md-ellipsis">
|
||
|
||
Evidence file layout (strategy B)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#unitig-file-layout">
|
||
<span class="md-ellipsis">
|
||
|
||
Unitig file layout
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#decoding-a-kmer-from-slot-s">
|
||
<span class="md-ellipsis">
|
||
|
||
Decoding a kmer from slot s
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#forward-vs-reverse-complement">
|
||
<span class="md-ellipsis">
|
||
|
||
Forward vs reverse complement
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#open-questions">
|
||
<span class="md-ellipsis">
|
||
|
||
Open questions
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item md-nav__item--nested">
|
||
<input class="md-nav__toggle md-toggle" id="__nav_4" type="checkbox"/>
|
||
<label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="0">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Architecture
|
||
|
||
|
||
|
||
</span>
|
||
<span class="md-nav__icon md-icon"></span>
|
||
</label>
|
||
<nav aria-expanded="false" aria-labelledby="__nav_4_label" class="md-nav" data-md-level="1">
|
||
<label class="md-nav__title" for="__nav_4">
|
||
<span class="md-nav__icon md-icon"></span>
|
||
|
||
|
||
Architecture
|
||
|
||
|
||
</label>
|
||
<ul class="md-nav__list" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="../../architecture/sequences/invariant/">
|
||
<span class="md-ellipsis">
|
||
|
||
|
||
Sequences
|
||
|
||
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc">
|
||
<div class="md-sidebar__scrollwrap">
|
||
<div class="md-sidebar__inner">
|
||
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
|
||
<label class="md-nav__title" for="__toc">
|
||
<span class="md-nav__icon md-icon"></span>
|
||
Table of contents
|
||
</label>
|
||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#role-of-unitigs-in-the-index">
|
||
<span class="md-ellipsis">
|
||
|
||
Role of unitigs in the index
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#two-encoding-strategies">
|
||
<span class="md-ellipsis">
|
||
|
||
Two encoding strategies
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Two encoding strategies" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#strategy-a-global-nucleotide-offset">
|
||
<span class="md-ellipsis">
|
||
|
||
Strategy A — global nucleotide offset
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#strategy-b-unitig_id-rank-within-unitig">
|
||
<span class="md-ellipsis">
|
||
|
||
Strategy B — (unitig_id, rank within unitig)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#bit-cost-analysis">
|
||
<span class="md-ellipsis">
|
||
|
||
Bit-cost analysis
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Bit-cost analysis" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#empirical-bound-on-unitig-length">
|
||
<span class="md-ellipsis">
|
||
|
||
Empirical bound on unitig length
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#split-strategy-for-long-unitigs">
|
||
<span class="md-ellipsis">
|
||
|
||
Split strategy for long unitigs
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#savings-from-u8-length-fields">
|
||
<span class="md-ellipsis">
|
||
|
||
Savings from u8 length fields
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#partition-size-tradeoff">
|
||
<span class="md-ellipsis">
|
||
|
||
Partition-size tradeoff
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Partition-size tradeoff" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#empirical-observation-m_u-is-set-by-de-bruijn-graph-topology-not-partition-count">
|
||
<span class="md-ellipsis">
|
||
|
||
Empirical observation: m_u is set by De Bruijn graph topology, not partition count
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Empirical observation: m_u is set by De Bruijn graph topology, not partition count" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#per-partition-compaction-ratio-sk_symbols-u_symbols">
|
||
<span class="md-ellipsis">
|
||
|
||
Per-partition compaction ratio (sk_symbols / u_symbols)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#implementation-notes">
|
||
<span class="md-ellipsis">
|
||
|
||
Implementation notes
|
||
|
||
</span>
|
||
</a>
|
||
<nav aria-label="Implementation notes" class="md-nav">
|
||
<ul class="md-nav__list">
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#evidence-file-layout-strategy-b">
|
||
<span class="md-ellipsis">
|
||
|
||
Evidence file layout (strategy B)
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#unitig-file-layout">
|
||
<span class="md-ellipsis">
|
||
|
||
Unitig file layout
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#decoding-a-kmer-from-slot-s">
|
||
<span class="md-ellipsis">
|
||
|
||
Decoding a kmer from slot s
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#forward-vs-reverse-complement">
|
||
<span class="md-ellipsis">
|
||
|
||
Forward vs reverse complement
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</li>
|
||
<li class="md-nav__item">
|
||
<a class="md-nav__link" href="#open-questions">
|
||
<span class="md-ellipsis">
|
||
|
||
Open questions
|
||
|
||
</span>
|
||
</a>
|
||
</li>
|
||
</ul>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="md-content" data-md-component="content">
|
||
<article class="md-content__inner md-typeset">
|
||
<h1 id="unitig-based-mphf-evidence-encoding">Unitig-based MPHF evidence encoding</h1>
|
||
<h2 id="role-of-unitigs-in-the-index">Role of unitigs in the index</h2>
|
||
<p>The MPHF maps each canonical kmer to an integer slot, but provides no way to reconstruct the kmer from its slot. A downstream operation (query, set operation) that receives a slot index and needs the kmer sequence must be able to retrieve it. The <strong>evidence file</strong> serves this purpose: it stores the kmer sequences in compact form and provides, for each MPHF slot, a pointer to where the corresponding kmer can be decoded.</p>
|
||
<p>Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition can be reconstructed from its unitig FASTA file.</p>
|
||
<hr/>
|
||
<h2 id="two-encoding-strategies">Two encoding strategies</h2>
|
||
<h3 id="strategy-a-global-nucleotide-offset">Strategy A — global nucleotide offset</h3>
|
||
<p>Each MPHF slot stores a single integer: the byte offset of the kmer's first nucleotide within a packed 2-bit nucleotide array that concatenates all unitigs.</p>
|
||
<div class="highlight"><pre><span></span><code>evidence[slot] = global_offset (bits: ⌈log₂ N_nuc⌉)
|
||
</code></pre></div>
|
||
<p>where <code>N_nuc</code> is the total number of nucleotides across all unitigs in the partition.</p>
|
||
<p>Decoding: read k nucleotides starting at <code>global_offset</code>.</p>
|
||
<h3 id="strategy-b-unitig_id-rank-within-unitig">Strategy B — (unitig_id, rank within unitig)</h3>
|
||
<p>Each MPHF slot stores a pair:</p>
|
||
<div class="highlight"><pre><span></span><code>evidence[slot] = (unitig_id, rank)
|
||
</code></pre></div>
|
||
<ul>
|
||
<li><code>unitig_id</code> : index of the unitig in the partition (0-based)</li>
|
||
<li><code>rank</code> : kmer index within the unitig (0 ≤ rank < n_kmers); kmer i starts at nucleotide i, so the nucleotide offset is identical numerically but the kmer-unit interpretation is the natural one</li>
|
||
</ul>
|
||
<p>Decoding: look up the unitig at <code>unitig_id</code>, then read k nucleotides starting at <code>rank</code>.</p>
|
||
<hr/>
|
||
<h2 id="bit-cost-analysis">Bit-cost analysis</h2>
|
||
<p>Define for a partition of P kmers with average kmers-per-unitig m:</p>
|
||
<ul>
|
||
<li>total nucleotides: <span class="arithmatex">\(N_{nuc} = P \cdot \left(1 + \dfrac{k-1}{m}\right)\)</span></li>
|
||
<li>number of unitigs: <span class="arithmatex">\(U = P / m\)</span></li>
|
||
</ul>
|
||
<p><strong>Strategy A</strong></p>
|
||
<div class="arithmatex">\[
|
||
b_A = \left\lceil \log_2 N_{nuc} \right\rceil = \left\lceil \log_2 P + \log_2\!\left(1 + \frac{k-1}{m}\right) \right\rceil
|
||
\]</div>
|
||
<p><strong>Strategy B</strong></p>
|
||
<div class="arithmatex">\[
|
||
b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rceil
|
||
\]</div>
|
||
<p>where <span class="arithmatex">\(L_{max}\)</span> is the maximum unitig length (in nucleotides). In practice <span class="arithmatex">\(L_{max} \ll P\)</span>, so the rank field is much cheaper than the full global offset. If unitig lengths are bounded (e.g. by partition structure), the rank field width is a small constant independent of P.</p>
|
||
<h3 id="empirical-bound-on-unitig-length">Empirical bound on unitig length</h3>
|
||
<p>Lengths and ranks are expressed in <strong>kmer units</strong> (not nucleotides): the nucleotide length is <code>n_kmers + k − 1</code>, so storing <code>n_kmers</code> instead of <code>seq_length</code> saves k−1 = 30 units of headroom in the same field width.</p>
|
||
<p>Consequence for <code>u8</code> capacity:</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>unit</th>
|
||
<th>max representable</th>
|
||
<th>max nucleotides</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>nucleotides</td>
|
||
<td>255 nuc</td>
|
||
<td>225 kmers</td>
|
||
</tr>
|
||
<tr>
|
||
<td><strong>kmers</strong></td>
|
||
<td><strong>255 kmers</strong></td>
|
||
<td><strong>285 nuc</strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>On <em>Betula nana</em> (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average; no unitig length distribution data measured yet. The <code>rank</code> field (kmer index within the unitig) fits in a <code>u8</code> as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below.</p>
|
||
<h3 id="split-strategy-for-long-unitigs">Split strategy for long unitigs</h3>
|
||
<p>For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a <strong>k−1 nucleotide overlap</strong> at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.</p>
|
||
<div class="highlight"><pre><span></span><code>original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
|
||
↑ cut here
|
||
|
||
chunk 1: nucleotides 0 … 284 (255 kmers)
|
||
chunk 2: nucleotides 255 … N+k-1 (N-255+1 kmers)
|
||
shared: nucleotides 255 … 284 (k-1 = 30 nucleotides, stored in both)
|
||
</code></pre></div>
|
||
<p>Cost of one split: k−1 = 30 redundant nucleotides = 60 bits. This event is rare in practice (m_u ≈ 38 for <em>B. nana</em>, well below the 255-kmer cap). No kmer is lost: kmer i is in chunk 1 if i < 255, in chunk 2 (at rank i−255) otherwise.</p>
|
||
<h3 id="savings-from-u8-length-fields">Savings from u8 length fields</h3>
|
||
<p>Because all chunks are guaranteed ≤ 255 kmers, the per-chunk length array in the binary index is a flat <code>u8</code> array — 1 byte per chunk instead of 8 bytes (usize) or 4 bytes (u32). For a partition with 4 M unitigs:</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>length type</th>
|
||
<th>bytes/chunk</th>
|
||
<th>total (4 M chunks)</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>usize (u64)</td>
|
||
<td>8</td>
|
||
<td>32 MB</td>
|
||
</tr>
|
||
<tr>
|
||
<td>u32</td>
|
||
<td>4</td>
|
||
<td>16 MB</td>
|
||
</tr>
|
||
<tr>
|
||
<td><strong>u8</strong></td>
|
||
<td><strong>1</strong></td>
|
||
<td><strong>4 MB</strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>Random access to chunk i is recovered at load time by a single prefix-sum pass over the u8 array, computing a u32/u64 offset array in O(n_chunks) time and O(n_chunks × 4) bytes — paid once at open time, cached for the lifetime of the partition handle.</p>
|
||
<p>Bit costs for <em>Betula nana</em> (k=31, 256 partitions, P ≈ 10.4 M, U ≈ 275 k, m_u ≈ 37.9):</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>field</th>
|
||
<th>strategy A</th>
|
||
<th>strategy B</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>offset / id</td>
|
||
<td><span class="arithmatex">\(\lceil\log_2(P \cdot (1 + 30/m_u))\rceil = 25\)</span> bits</td>
|
||
<td><span class="arithmatex">\(\lceil\log_2(U)\rceil = 19\)</span> bits</td>
|
||
</tr>
|
||
<tr>
|
||
<td>rank</td>
|
||
<td>—</td>
|
||
<td>8 bits (u8, fixed)</td>
|
||
</tr>
|
||
<tr>
|
||
<td><strong>total</strong></td>
|
||
<td><strong>25 bits</strong></td>
|
||
<td><strong>27 bits</strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>Strategy A is 2 bits cheaper. Strategy B's main advantage is <strong>locality</strong>: decoding a kmer touches one unitig's cache lines rather than an arbitrary offset in a large flat array, and the <code>rank</code> field doubles as a direct index into the packed nucleotide sequence without pointer arithmetic.</p>
|
||
<hr/>
|
||
<h2 id="partition-size-tradeoff">Partition-size tradeoff</h2>
|
||
<p>The total bits/kmer for the index (sequence + evidence + MPHF) as a function of partition size is:</p>
|
||
<div class="arithmatex">\[
|
||
\text{total} = \underbrace{2\!\left(1 + \frac{k-1}{m}\right)}_{\text{sequence}} + \underbrace{\log_2 P + \log_2\!\left(1+\frac{k-1}{m}\right)}_{\text{evidence}} + \underbrace{c_{MPHF}}_{\approx 2\text{–}4}
|
||
\]</div>
|
||
<h3 id="empirical-observation-m_u-is-set-by-de-bruijn-graph-topology-not-partition-count">Empirical observation: m_u is set by De Bruijn graph topology, not partition count</h3>
|
||
<p>Measured on <em>Betula nana</em> (k=31, m=11), summing n_kmers and sequence counts across all partition files:</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>N partitions</th>
|
||
<th>m_sk</th>
|
||
<th>m_u</th>
|
||
<th>factor m_u/m_sk</th>
|
||
<th>nuc ratio (u/sk)</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>1</td>
|
||
<td>12.13</td>
|
||
<td><strong>41.89</strong></td>
|
||
<td>3.45×</td>
|
||
<td>0.273</td>
|
||
</tr>
|
||
<tr>
|
||
<td>16</td>
|
||
<td>12.13</td>
|
||
<td><strong>38.19</strong></td>
|
||
<td>3.15×</td>
|
||
<td>0.376</td>
|
||
</tr>
|
||
<tr>
|
||
<td>256</td>
|
||
<td>12.13</td>
|
||
<td><strong>37.90</strong></td>
|
||
<td>3.12×</td>
|
||
<td>0.388</td>
|
||
</tr>
|
||
<tr>
|
||
<td>1 024</td>
|
||
<td>12.13</td>
|
||
<td><strong>37.89</strong></td>
|
||
<td>3.12×</td>
|
||
<td>0.389</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<ul>
|
||
<li><code>m_sk</code> = avg kmers/super-kmer (invariant — same dataset regardless of partition scheme)</li>
|
||
<li><code>m_u</code> = avg kmers/unitig = total_n_kmers / total_unitigs, summed across all partitions</li>
|
||
<li><code>nuc ratio</code> = (u_symbols + 30·u_reads) / (sk_symbols + 30·sk_reads)</li>
|
||
</ul>
|
||
<p>X-axis in both charts: partition bits (0 = 1 partition, 10 = 1024 partitions) — each step doubles the partition count.</p>
|
||
<pre class="mermaid"><code>xychart-beta
|
||
title "m_u (avg kmers/unitig) vs partition bits — B. nana k=31"
|
||
x-axis "partition bits" 0 --> 10
|
||
y-axis "m_u" 37 --> 43
|
||
line [41.89, 40.78, 39.22, 38.52, 38.19, 38.03, 37.96, 37.92, 37.90, 37.89, 37.89]</code></pre>
|
||
<pre class="mermaid"><code>xychart-beta
|
||
title "Nucleotide storage: unitigs / super-kmers (%) vs partition bits — B. nana k=31"
|
||
x-axis "partition bits" 0 --> 10
|
||
y-axis "%" 25 --> 42
|
||
line [27.3, 29.7, 33.9, 36.3, 37.6, 38.3, 38.6, 38.7, 38.8, 38.9, 38.9]</code></pre>
|
||
<p>Key observations:</p>
|
||
<ol>
|
||
<li><strong>Partition boundaries have a small but non-zero effect on m_u.</strong> Going from 1 to 1024 partitions reduces m_u by 10% (41.9 → 37.9). Within the practical range 16–1024, the variation is under 1% — m_u is effectively constant.</li>
|
||
<li><strong>m_u is a property of the De Bruijn graph, not the partition scheme.</strong> The dominant factor is graph branching (heterozygosity, repeats, sequencing errors).</li>
|
||
<li><strong>Unitigs provide substantial compaction over super-kmers.</strong> At 256 partitions, unitigs cover the same unique kmers using 39% of the raw nucleotide content of super-kmers (3.1× compaction factor).</li>
|
||
</ol>
|
||
<h4 id="per-partition-compaction-ratio-sk_symbols-u_symbols">Per-partition compaction ratio (sk_symbols / u_symbols)</h4>
|
||
<p>The ratio measures how much super-kmer kmer-slots are "shared" across different super-kmer records: a ratio of 1.35 means each unique kmer (counted once in unitigs) appears in 1.35 super-kmer kmer-slots on average.</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>bits</th>
|
||
<th>N partitions</th>
|
||
<th>median ratio</th>
|
||
<th>min ratio</th>
|
||
<th>min partition</th>
|
||
<th>min u_reads</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>6</td>
|
||
<td>64</td>
|
||
<td>1.355</td>
|
||
<td>1.073</td>
|
||
<td>—</td>
|
||
<td>4.5 M</td>
|
||
</tr>
|
||
<tr>
|
||
<td>7</td>
|
||
<td>128</td>
|
||
<td>1.352</td>
|
||
<td>1.037</td>
|
||
<td>—</td>
|
||
<td>4.1 M</td>
|
||
</tr>
|
||
<tr>
|
||
<td>8</td>
|
||
<td>256</td>
|
||
<td><strong>1.350</strong></td>
|
||
<td><strong>1.012</strong></td>
|
||
<td><strong>145</strong></td>
|
||
<td><strong>3.8 M</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td>9</td>
|
||
<td>512</td>
|
||
<td>1.350</td>
|
||
<td>0.998</td>
|
||
<td>145</td>
|
||
<td>3.6 M</td>
|
||
</tr>
|
||
<tr>
|
||
<td>10</td>
|
||
<td>1024</td>
|
||
<td>1.351</td>
|
||
<td>0.992</td>
|
||
<td>145</td>
|
||
<td>3.6 M</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>The median stabilises at <strong>1.35</strong> from 64 partitions onward (stdev = 0.027 at 256 partitions). There is one persistent outlier: <strong>partition 145</strong> (at 256-partition resolution) is consistently anomalous across all partition depths — it contains 10–14× more super-kmers and unitigs than the average partition, with a ratio near 1.0, meaning the unitig representation provides almost no kmer deduplication. This is consistent with a highly repetitive or organellar region where the dominant minimiser belongs to a sequence that appears in many reads without forming long overlapping paths in the De Bruijn graph.</p>
|
||
<p>Per-partition parameters at 256 partitions (<em>B. nana</em>):</p>
|
||
<table>
|
||
<thead>
|
||
<tr>
|
||
<th>quantity</th>
|
||
<th>value</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>P (unique kmers/partition, avg)</td>
|
||
<td>≈ 10.4 M</td>
|
||
</tr>
|
||
<tr>
|
||
<td>U (unitigs/partition, avg)</td>
|
||
<td>≈ 275 k</td>
|
||
</tr>
|
||
<tr>
|
||
<td>m_u</td>
|
||
<td>≈ 37.9</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Strategy A bits/kmer</td>
|
||
<td>⌈log₂(P·(1+30/m_u))⌉ = 25</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Strategy B bits/kmer</td>
|
||
<td>⌈log₂(U)⌉ + 8 = 27</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>Consequence: <strong>the partition count should be as large as memory and parallelism allow.</strong> Each doubling saves 1 bit/kmer in evidence (log₂ P decreases by 1). The sequence term 2·(1 + 30/m_u) ≈ 3.6 bits/kmer is approximately constant.</p>
|
||
<p>Strategy B partially decouples evidence cost from P: <code>log₂(U) = log₂(P/m_u)</code> grows more slowly than <code>log₂(P)</code> by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.</p>
|
||
<hr/>
|
||
<h2 id="implementation-notes">Implementation notes</h2>
|
||
<h3 id="evidence-file-layout-strategy-b">Evidence file layout (strategy B)</h3>
|
||
<div class="highlight"><pre><span></span><code>evidence.bin
|
||
├── header : k (u8), n_kmers (u64), n_unitigs (u64)
|
||
├── id_array : n_kmers × ⌈log₂ n_unitigs⌉ bits — MPHF slot → unitig_id
|
||
└── rank_array: n_kmers × 8 bits (u8[n_kmers]) — MPHF slot → rank within unitig
|
||
</code></pre></div>
|
||
<p><code>id_array</code> is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for <em>B. nana</em> at 256 partitions). <code>rank_array</code> is a plain <code>u8</code> array — no bit-packing needed. Access is O(1) with a single multiplication and mask for <code>id_array</code>, and a direct byte index for <code>rank_array</code>.</p>
|
||
<h3 id="unitig-file-layout">Unitig file layout</h3>
|
||
<p>FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel <code>.2bit</code> file for speed.</p>
|
||
<div class="highlight"><pre><span></span><code>>c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
|
||
ACGTGGCTA...
|
||
</code></pre></div>
|
||
<h3 id="decoding-a-kmer-from-slot-s">Decoding a kmer from slot s</h3>
|
||
<div class="highlight"><pre><span></span><code>unitig_id = id_array[s]
|
||
rank = rank_array[s]
|
||
kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
|
||
</code></pre></div>
|
||
<p>One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).</p>
|
||
<h3 id="forward-vs-reverse-complement">Forward vs reverse complement</h3>
|
||
<p>The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.</p>
|
||
<hr/>
|
||
<h2 id="open-questions">Open questions</h2>
|
||
<ul>
|
||
<li><strong>Rank field width</strong>: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On <em>B. nana</em> (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.</li>
|
||
<li><strong>Packed nucleotide cache</strong>: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead (<span class="arithmatex">\(N_{nuc} / 4\)</span> bytes per partition).</li>
|
||
<li><strong>Cross-partition evidence</strong>: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.</li>
|
||
</ul>
|
||
</article>
|
||
</div>
|
||
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
|
||
</div>
|
||
</main>
|
||
<footer class="md-footer">
|
||
<div class="md-footer-meta md-typeset">
|
||
<div class="md-footer-meta__inner md-grid">
|
||
<div class="md-copyright">
|
||
|
||
|
||
Made with
|
||
<a href="https://squidfunk.github.io/mkdocs-material/" rel="noopener" target="_blank">
|
||
Material for MkDocs
|
||
</a>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
</footer>
|
||
</div>
|
||
<div class="md-dialog" data-md-component="dialog">
|
||
<div class="md-dialog__inner md-typeset"></div>
|
||
</div>
|
||
<script id="__config" type="application/json">{"annotate": null, "base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.2c215733.min.js", "tags": null, "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}, "version": null}</script>
|
||
<script src="../../assets/javascripts/bundle.79ae519e.min.js"></script>
|
||
<script src="https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js"></script>
|
||
</body>
|
||
</html> |