mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
102 lines
5.6 KiB
Markdown
102 lines
5.6 KiB
Markdown
# Semantic Description of `obikmer` Package
|
||
|
||
The `obikmer` package provides high-performance, disk-backed utilities for **k-mer manipulation and comparison** in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (`.kdi`, `.skm`), entropy filtering, and scalable set operations — all while minimizing allocations.
|
||
|
||
---
|
||
|
||
## Core Encoding & Canonicalization
|
||
|
||
- **`EncodeKmer`, `DecodeKmer`**: Encodes/decodes DNA sequences to/from compact 62-bit `uint64`s (2 bits/base), preserving top 2 bits for error metadata.
|
||
- **`EncodeCanonicalKmer`, ` CanonicalKmer`**: Normalizes k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
|
||
- **`IterCanonicalKmers`, `IterCanonicalKmersWithErrors`**: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.
|
||
|
||
## Minimizer-Based Partitioning
|
||
|
||
- **`DefaultMinimizerSize(k)`**, **`ValidateMinimizerSize(m, k, nworkers)`**: Computes and validates minimizer size `m` for parallelization (e.g., `ceil(k / 2.5)`).
|
||
- **`ExtractSuperKmers`, `IterSuperKmers(seq, k, m)`**: Extracts *super-k-mers* — maximal contiguous regions where all embedded `k`-mers share the same minimizer. Uses monotone deque for O(n) time.
|
||
|
||
## I/O Formats & Streaming
|
||
|
||
- **`.kdi` (K-Disk Index)**: Compact binary format for sorted `uint64` k-mers using delta-varint encoding. Includes optional `.kdx` sparse index for fast `SeekTo(target)`.
|
||
- APIs: `NewKdiWriter`, `NewKdiReader`, `.Next() → (kmer, ok)`.
|
||
- **`.skm`**: Binary storage for *super-k-mers*, with 2-bit nucleotide packing (4× compression vs ASCII).
|
||
- **`.kdx`**: Sparse index for `.kdi`, storing `(kmer, byteOffset)` every *stride* entries (e.g., 4096), enabling O(log M) seeks.
|
||
|
||
## K-Way Merge & Deduplication
|
||
|
||
- **`KWayMerge([]*KdiReader)`**: Merges sorted `.kdi` streams, aggregating k-mer counts across inputs.
|
||
- Uses min-heap for O(log *k*) per-output operations; supports streaming and deduplication.
|
||
- Ideal for combining k-mer sets across samples or batches.
|
||
|
||
## Entropy Filtering & Complexity Detection
|
||
|
||
- **`KmerEntropy(kmer, k, levelMax)`**: Computes minimum normalized Shannon entropy across sub-word sizes (1 to `levelMax`) using circular canonical normalization.
|
||
- Values near **0** indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
|
||
- **`KmerEntropyFilter`**: Precomputed filter for batch processing (no allocations), with `Accept(kmer)` and fast entropy lookup.
|
||
|
||
## K-mer Set Management (`KmerSetGroup`)
|
||
|
||
A `KmerSetGroup` represents *N* disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.
|
||
|
||
### Lifecycle & Construction
|
||
- **`NewKmerSetGroupBuilder(...)`**, **`AppendKmerSetGroupBuilder(dir)`**: Builds or extends groups via:
|
||
- `AddSequence(setID, bioseq)`: Extracts canonical k-mers (with optional filtering).
|
||
- Supports `WithMinFrequency`, `WithEntropyFilter`, and top-*N* tracking.
|
||
- **`Close()`**: Finalizes `.kdi`s, `spectrum.bin`, and optional `top_kmers.csv`.
|
||
- **`OpenKmerSetGroup(dir)`**: Loads existing group in read-only mode.
|
||
|
||
### Access & Metadata
|
||
- **`K()`, `M()`, `Partitions()`**, attributes via `GetStringAttribute(key)`.
|
||
- **`Contains(setID, kmer)`**: Parallel membership check across partitions.
|
||
- **`Iterator(setID)`**: Yields sorted k-mers via k-way merge.
|
||
|
||
### Set Algebra & Similarity
|
||
- **Set Operations**: `Union()`, `Intersect()`, `Difference()`, `QuorumAtLeast(q)` (≥ *q* sets), etc.
|
||
- **Pairwise Group Ops**: `UnionWith(other)`, `IntersectWith(other)` (per-set, compatible groups only).
|
||
- **Similarity Metrics**:
|
||
`JaccardDistanceMatrix()` = 1 − |A ∩ B| / |A ∪ B|
|
||
`JaccardSimilarityMatrix()` = |A ∩ B| / |A ∪ B|
|
||
|
||
### Utilities
|
||
- **`CopySetsByIDTo(ids, destDir)`**, `RemoveSetByID(id)`, `MatchSetIDs(patterns)`
|
||
- **`IsCompatibleWith(other)`**: Validates `(k, m, partitions)`.
|
||
|
||
## K-mer Indexing & Matching (`KmerMap`)
|
||
|
||
Generic hash map associating canonical k-mers to sequences containing them.
|
||
|
||
- **`Push(sequences)`**: Builds index (optionally with `maxocc` limit).
|
||
- **`Query(querySeq) → KmerMatch`**: Returns sequences sharing k-mers, with match counts.
|
||
- **Supports sparse mode** (`SparseAt ≥ 0`): Ignores central base (e.g., for ambiguous-position matching).
|
||
- **Result utilities**: `FilterMinCount`, `.Max()`, `.Sequences()`.
|
||
|
||
## K-mer Spectrum Analysis
|
||
|
||
- **`SpectrumEntry{Frequency, Count}`**, `KmerSpectrum`: Sorted frequency distribution.
|
||
- **`MapToSpectrum()`, `MergeTopN()`**, binary/CSV I/O (`WriteSpectrum`, `ReadSpectrum`).
|
||
- **Top-*N* collector** via min-heap for streaming frequency tracking.
|
||
|
||
## Utility & Helpers
|
||
|
||
- **`HammingDistance(a, b)`**: Bitwise distance between encoded k-mers.
|
||
- **Varint encoding/decoding** (`EncodeVarint`, `DecodeVarint`): 7-bit-per-byte compression for I/O.
|
||
- **Reverse complement**: Constant-time via lookup tables (`revcompnuc`, `kmermask`).
|
||
|
||
---
|
||
|
||
## Design Principles
|
||
|
||
- **Zero-allocation where possible** (buffer reuse, iterators).
|
||
- **Streaming-first**: Avoids loading large datasets into memory.
|
||
- **Disk-backed persistence** for reproducibility and scalability.
|
||
- **Canonicalization & symmetry**: Strand-aware (reverse complement) or circular normalization for robustness.
|
||
|
||
## Use Cases
|
||
|
||
- Metagenomic read clustering & error correction
|
||
- Minimizer-based sketching (e.g., Mash/Sourmash analogs)
|
||
- Scalable Jaccard-based similarity matrices across thousands of samples
|
||
- Low-complexity region detection via entropy filtering
|
||
|
||
All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.
|