Files
obitools4/autodoc/docmd/pkg_obikmer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

5.6 KiB
Raw Blame History

Semantic Description of obikmer Package

The obikmer package provides high-performance, disk-backed utilities for k-mer manipulation and comparison in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (.kdi, .skm), entropy filtering, and scalable set operations — all while minimizing allocations.


Core Encoding & Canonicalization

  • EncodeKmer, DecodeKmer: Encodes/decodes DNA sequences to/from compact 62-bit uint64s (2 bits/base), preserving top 2 bits for error metadata.
  • EncodeCanonicalKmer, CanonicalKmer: Normalizes k-mers to their biological canonical form — the lexicographically smaller of a k-mer and its reverse complement.
  • IterCanonicalKmers, IterCanonicalKmersWithErrors: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.

Minimizer-Based Partitioning

  • DefaultMinimizerSize(k), ValidateMinimizerSize(m, k, nworkers): Computes and validates minimizer size m for parallelization (e.g., ceil(k / 2.5)).
  • ExtractSuperKmers, IterSuperKmers(seq, k, m): Extracts super-k-mers — maximal contiguous regions where all embedded k-mers share the same minimizer. Uses monotone deque for O(n) time.

I/O Formats & Streaming

  • .kdi (K-Disk Index): Compact binary format for sorted uint64 k-mers using delta-varint encoding. Includes optional .kdx sparse index for fast SeekTo(target).
    • APIs: NewKdiWriter, NewKdiReader, .Next() → (kmer, ok).
  • .skm: Binary storage for super-k-mers, with 2-bit nucleotide packing (4× compression vs ASCII).
  • .kdx: Sparse index for .kdi, storing (kmer, byteOffset) every stride entries (e.g., 4096), enabling O(log M) seeks.

K-Way Merge & Deduplication

  • KWayMerge([]*KdiReader): Merges sorted .kdi streams, aggregating k-mer counts across inputs.
    • Uses min-heap for O(log k) per-output operations; supports streaming and deduplication.
    • Ideal for combining k-mer sets across samples or batches.

Entropy Filtering & Complexity Detection

  • KmerEntropy(kmer, k, levelMax): Computes minimum normalized Shannon entropy across sub-word sizes (1 to levelMax) using circular canonical normalization.
    • Values near 0 indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
  • KmerEntropyFilter: Precomputed filter for batch processing (no allocations), with Accept(kmer) and fast entropy lookup.

K-mer Set Management (KmerSetGroup)

A KmerSetGroup represents N disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.

Lifecycle & Construction

  • NewKmerSetGroupBuilder(...), AppendKmerSetGroupBuilder(dir): Builds or extends groups via:
    • AddSequence(setID, bioseq): Extracts canonical k-mers (with optional filtering).
    • Supports WithMinFrequency, WithEntropyFilter, and top-N tracking.
  • Close(): Finalizes .kdis, spectrum.bin, and optional top_kmers.csv.
  • OpenKmerSetGroup(dir): Loads existing group in read-only mode.

Access & Metadata

  • K(), M(), Partitions(), attributes via GetStringAttribute(key).
  • Contains(setID, kmer): Parallel membership check across partitions.
  • Iterator(setID): Yields sorted k-mers via k-way merge.

Set Algebra & Similarity

  • Set Operations: Union(), Intersect(), Difference(), QuorumAtLeast(q) (≥ q sets), etc.
  • Pairwise Group Ops: UnionWith(other), IntersectWith(other) (per-set, compatible groups only).
  • Similarity Metrics:
    JaccardDistanceMatrix() = 1 |A ∩ B| / |A B|
    JaccardSimilarityMatrix() = |A ∩ B| / |A B|

Utilities

  • CopySetsByIDTo(ids, destDir), RemoveSetByID(id), MatchSetIDs(patterns)
  • IsCompatibleWith(other): Validates (k, m, partitions).

K-mer Indexing & Matching (KmerMap)

Generic hash map associating canonical k-mers to sequences containing them.

  • Push(sequences): Builds index (optionally with maxocc limit).
  • Query(querySeq) → KmerMatch: Returns sequences sharing k-mers, with match counts.
  • Supports sparse mode (SparseAt ≥ 0): Ignores central base (e.g., for ambiguous-position matching).
  • Result utilities: FilterMinCount, .Max(), .Sequences().

K-mer Spectrum Analysis

  • SpectrumEntry{Frequency, Count}, KmerSpectrum: Sorted frequency distribution.
  • MapToSpectrum(), MergeTopN(), binary/CSV I/O (WriteSpectrum, ReadSpectrum).
  • Top-N collector via min-heap for streaming frequency tracking.

Utility & Helpers

  • HammingDistance(a, b): Bitwise distance between encoded k-mers.
  • Varint encoding/decoding (EncodeVarint, DecodeVarint): 7-bit-per-byte compression for I/O.
  • Reverse complement: Constant-time via lookup tables (revcompnuc, kmermask).

Design Principles

  • Zero-allocation where possible (buffer reuse, iterators).
  • Streaming-first: Avoids loading large datasets into memory.
  • Disk-backed persistence for reproducibility and scalability.
  • Canonicalization & symmetry: Strand-aware (reverse complement) or circular normalization for robustness.

Use Cases

  • Metagenomic read clustering & error correction
  • Minimizer-based sketching (e.g., Mash/Sourmash analogs)
  • Scalable Jaccard-based similarity matrices across thousands of samples
  • Low-complexity region detection via entropy filtering

All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.