mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
5.6 KiB
5.6 KiB
Semantic Description of obikmer Package
The obikmer package provides high-performance, disk-backed utilities for k-mer manipulation and comparison in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (.kdi, .skm), entropy filtering, and scalable set operations — all while minimizing allocations.
Core Encoding & Canonicalization
EncodeKmer,DecodeKmer: Encodes/decodes DNA sequences to/from compact 62-bituint64s (2 bits/base), preserving top 2 bits for error metadata.EncodeCanonicalKmer,CanonicalKmer: Normalizes k-mers to their biological canonical form — the lexicographically smaller of a k-mer and its reverse complement.IterCanonicalKmers,IterCanonicalKmersWithErrors: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.
Minimizer-Based Partitioning
DefaultMinimizerSize(k),ValidateMinimizerSize(m, k, nworkers): Computes and validates minimizer sizemfor parallelization (e.g.,ceil(k / 2.5)).ExtractSuperKmers,IterSuperKmers(seq, k, m): Extracts super-k-mers — maximal contiguous regions where all embeddedk-mers share the same minimizer. Uses monotone deque for O(n) time.
I/O Formats & Streaming
.kdi(K-Disk Index): Compact binary format for sorteduint64k-mers using delta-varint encoding. Includes optional.kdxsparse index for fastSeekTo(target).- APIs:
NewKdiWriter,NewKdiReader,.Next() → (kmer, ok).
- APIs:
.skm: Binary storage for super-k-mers, with 2-bit nucleotide packing (4× compression vs ASCII)..kdx: Sparse index for.kdi, storing(kmer, byteOffset)every stride entries (e.g., 4096), enabling O(log M) seeks.
K-Way Merge & Deduplication
KWayMerge([]*KdiReader): Merges sorted.kdistreams, aggregating k-mer counts across inputs.- Uses min-heap for O(log k) per-output operations; supports streaming and deduplication.
- Ideal for combining k-mer sets across samples or batches.
Entropy Filtering & Complexity Detection
KmerEntropy(kmer, k, levelMax): Computes minimum normalized Shannon entropy across sub-word sizes (1 tolevelMax) using circular canonical normalization.- Values near 0 indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
KmerEntropyFilter: Precomputed filter for batch processing (no allocations), withAccept(kmer)and fast entropy lookup.
K-mer Set Management (KmerSetGroup)
A KmerSetGroup represents N disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.
Lifecycle & Construction
NewKmerSetGroupBuilder(...),AppendKmerSetGroupBuilder(dir): Builds or extends groups via:AddSequence(setID, bioseq): Extracts canonical k-mers (with optional filtering).- Supports
WithMinFrequency,WithEntropyFilter, and top-N tracking.
Close(): Finalizes.kdis,spectrum.bin, and optionaltop_kmers.csv.OpenKmerSetGroup(dir): Loads existing group in read-only mode.
Access & Metadata
K(),M(),Partitions(), attributes viaGetStringAttribute(key).Contains(setID, kmer): Parallel membership check across partitions.Iterator(setID): Yields sorted k-mers via k-way merge.
Set Algebra & Similarity
- Set Operations:
Union(),Intersect(),Difference(),QuorumAtLeast(q)(≥ q sets), etc. - Pairwise Group Ops:
UnionWith(other),IntersectWith(other)(per-set, compatible groups only). - Similarity Metrics:
JaccardDistanceMatrix()= 1 − |A ∩ B| / |A ∪ B|
JaccardSimilarityMatrix()= |A ∩ B| / |A ∪ B|
Utilities
CopySetsByIDTo(ids, destDir),RemoveSetByID(id),MatchSetIDs(patterns)IsCompatibleWith(other): Validates(k, m, partitions).
K-mer Indexing & Matching (KmerMap)
Generic hash map associating canonical k-mers to sequences containing them.
Push(sequences): Builds index (optionally withmaxocclimit).Query(querySeq) → KmerMatch: Returns sequences sharing k-mers, with match counts.- Supports sparse mode (
SparseAt ≥ 0): Ignores central base (e.g., for ambiguous-position matching). - Result utilities:
FilterMinCount,.Max(),.Sequences().
K-mer Spectrum Analysis
SpectrumEntry{Frequency, Count},KmerSpectrum: Sorted frequency distribution.MapToSpectrum(),MergeTopN(), binary/CSV I/O (WriteSpectrum,ReadSpectrum).- Top-N collector via min-heap for streaming frequency tracking.
Utility & Helpers
HammingDistance(a, b): Bitwise distance between encoded k-mers.- Varint encoding/decoding (
EncodeVarint,DecodeVarint): 7-bit-per-byte compression for I/O. - Reverse complement: Constant-time via lookup tables (
revcompnuc,kmermask).
Design Principles
- Zero-allocation where possible (buffer reuse, iterators).
- Streaming-first: Avoids loading large datasets into memory.
- Disk-backed persistence for reproducibility and scalability.
- Canonicalization & symmetry: Strand-aware (reverse complement) or circular normalization for robustness.
Use Cases
- Metagenomic read clustering & error correction
- Minimizer-based sketching (e.g., Mash/Sourmash analogs)
- Scalable Jaccard-based similarity matrices across thousands of samples
- Low-complexity region detection via entropy filtering
All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.