mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

5.6 KiB

Raw Blame History

Semantic Description of `obikmer` Package

The obikmer package provides high-performance, disk-backed utilities for k-mer manipulation and comparison in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (.kdi, .skm), entropy filtering, and scalable set operations — all while minimizing allocations.

Core Encoding & Canonicalization

EncodeKmer, DecodeKmer: Encodes/decodes DNA sequences to/from compact 62-bit uint64s (2 bits/base), preserving top 2 bits for error metadata.
EncodeCanonicalKmer, CanonicalKmer: Normalizes k-mers to their biological canonical form — the lexicographically smaller of a k-mer and its reverse complement.
IterCanonicalKmers, IterCanonicalKmersWithErrors: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.

Minimizer-Based Partitioning

DefaultMinimizerSize(k), ValidateMinimizerSize(m, k, nworkers): Computes and validates minimizer size m for parallelization (e.g., ceil(k / 2.5)).
ExtractSuperKmers, IterSuperKmers(seq, k, m): Extracts super-k-mers — maximal contiguous regions where all embedded k-mers share the same minimizer. Uses monotone deque for O(n) time.

I/O Formats & Streaming

.kdi (K-Disk Index): Compact binary format for sorted uint64 k-mers using delta-varint encoding. Includes optional .kdx sparse index for fast SeekTo(target).
- APIs: NewKdiWriter, NewKdiReader, .Next() → (kmer, ok).
.skm: Binary storage for super-k-mers, with 2-bit nucleotide packing (4× compression vs ASCII).
.kdx: Sparse index for .kdi, storing (kmer, byteOffset) every stride entries (e.g., 4096), enabling O(log M) seeks.

K-Way Merge & Deduplication

KWayMerge([]*KdiReader): Merges sorted .kdi streams, aggregating k-mer counts across inputs.
- Uses min-heap for O(log k) per-output operations; supports streaming and deduplication.
- Ideal for combining k-mer sets across samples or batches.

Entropy Filtering & Complexity Detection

KmerEntropy(kmer, k, levelMax): Computes minimum normalized Shannon entropy across sub-word sizes (1 to levelMax) using circular canonical normalization.
- Values near 0 indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
KmerEntropyFilter: Precomputed filter for batch processing (no allocations), with Accept(kmer) and fast entropy lookup.

K-mer Set Management (`KmerSetGroup`)

A KmerSetGroup represents N disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.

Lifecycle & Construction

NewKmerSetGroupBuilder(...), AppendKmerSetGroupBuilder(dir): Builds or extends groups via:
- AddSequence(setID, bioseq): Extracts canonical k-mers (with optional filtering).
- Supports WithMinFrequency, WithEntropyFilter, and top-N tracking.
Close(): Finalizes .kdis, spectrum.bin, and optional top_kmers.csv.
OpenKmerSetGroup(dir): Loads existing group in read-only mode.

Access & Metadata

K(), M(), Partitions(), attributes via GetStringAttribute(key).
Contains(setID, kmer): Parallel membership check across partitions.
Iterator(setID): Yields sorted k-mers via k-way merge.

Set Algebra & Similarity

Set Operations: Union(), Intersect(), Difference(), QuorumAtLeast(q) (≥ q sets), etc.
Pairwise Group Ops: UnionWith(other), IntersectWith(other) (per-set, compatible groups only).
Similarity Metrics:
JaccardDistanceMatrix() = 1 − |A ∩ B| / |A ∪ B|
JaccardSimilarityMatrix() = |A ∩ B| / |A ∪ B|

Utilities

CopySetsByIDTo(ids, destDir), RemoveSetByID(id), MatchSetIDs(patterns)
IsCompatibleWith(other): Validates (k, m, partitions).

K-mer Indexing & Matching (`KmerMap`)

Generic hash map associating canonical k-mers to sequences containing them.

Push(sequences): Builds index (optionally with maxocc limit).
Query(querySeq) → KmerMatch: Returns sequences sharing k-mers, with match counts.
Supports sparse mode (SparseAt ≥ 0): Ignores central base (e.g., for ambiguous-position matching).
Result utilities: FilterMinCount, .Max(), .Sequences().

K-mer Spectrum Analysis

SpectrumEntry{Frequency, Count}, KmerSpectrum: Sorted frequency distribution.
MapToSpectrum(), MergeTopN(), binary/CSV I/O (WriteSpectrum, ReadSpectrum).
Top-N collector via min-heap for streaming frequency tracking.

Utility & Helpers

HammingDistance(a, b): Bitwise distance between encoded k-mers.
Varint encoding/decoding (EncodeVarint, DecodeVarint): 7-bit-per-byte compression for I/O.
Reverse complement: Constant-time via lookup tables (revcompnuc, kmermask).

Design Principles

Zero-allocation where possible (buffer reuse, iterators).
Streaming-first: Avoids loading large datasets into memory.
Disk-backed persistence for reproducibility and scalability.
Canonicalization & symmetry: Strand-aware (reverse complement) or circular normalization for robustness.

Use Cases

Metagenomic read clustering & error correction
Minimizer-based sketching (e.g., Mash/Sourmash analogs)
Scalable Jaccard-based similarity matrices across thousands of samples
Low-complexity region detection via entropy filtering

All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.

5.6 KiB Raw Blame History Unescape Escape

Semantic Description of obikmer Package