Skip to content

Partitioning and indexing architecture

The canonical minimizer of a super-kmer is hashed to produce a p-bit routing value (p is a collection-level parameter):

canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory

PART is computed once at phase 1 to open the correct partition file, then discarded. It is recomputed on the fly at query time. It is never stored in the super-kmer header.

Each partition holds one MPHF instance (phase 6) that indexes kmers as plain u64 values — the minimizer plays no role inside the partition.

Why hashing is necessary

The canonical minimizer is an m-mer (m ∈ {9, 11, 13, 15}), encoded in 2m bits (18 to 30 bits). Its distribution over the \(4^m\) possible values is not uniform: because the minimizer is the lexicographic minimum of a window of m-mers, small values are systematically over-represented (Golan & Shur 2025; Kille et al. 2023; Pan & Reinert 2024; Zheng et al. 2020, 2021)1 2 3 4 5. Routing directly by the raw minimizer value would produce severely unbalanced partitions.

A hash function with good avalanche properties redistributes this skewed distribution uniformly over the \(2^p\) partition slots. The key reason this works well is the entropy gap: p is chosen to be much smaller than 2m, so the hash compresses many distinct minimizer values into each partition slot. Even under strong bias in the minimizer distribution, as long as its effective entropy exceeds p bits — which holds comfortably since the set of distinct minimizers in any real dataset is far larger than \(2^p\) — the load imbalance across partitions is negligible.

Parameter choices

m 2m (bits) Typical p Partitions
9 18 6–8 64–256
11 22 8–10 256–1 024
13 26 10–12 1 024–4 096
15 30 10–14 1 024–16 384

The hard constraint is p ≤ 2m: one cannot extract more bits of uniform randomness from a source than it contains. In practice p is chosen well below 2m, leaving a large entropy margin that absorbs the minimizer bias. For k=31, m=13, p=10: 1 024 partitions with comfortable balance.


  1. Zheng, H., Kingsford, C. & Marçais, G. (2020). Improved design and analysis of practical minimizers. Bioinformatics (Oxford, England), 36, i119--i127. 

  2. Zheng, H., Kingsford, C. & Marçais, G. (2021). Sequence-specific minimizers via polar sets. Bioinformatics (Oxford, England), 37, i187--i195. 

  3. Pan, C. & Reinert, K. (2024). A simple refined DNA minimizer operator enables 2-fold faster computation. Bioinformatics (Oxford, England), 40. 

  4. Kille, B., Garrison, E., Treangen, T.J. & Phillippy, A.M. (2023). Minmers are a generalization of minimizers that enable unbiased local jaccard estimation. Bioinformatics (Oxford, England), 39. 

  5. Golan, S. & Shur, A.M. (2025). Expected density of random minimizers. In: Lecture notes in computer science, Lecture notes in computer science. Springer Nature Switzerland, Cham, pp. 347--360.