Files
obitools4/autodoc/docmd/pkg_obikmer.md
T

102 lines
5.6 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, disk-backed utilities for **k-mer manipulation and comparison** in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (`.kdi`, `.skm`), entropy filtering, and scalable set operations — all while minimizing allocations.
---
## Core Encoding & Canonicalization
- **`EncodeKmer`, `DecodeKmer`**: Encodes/decodes DNA sequences to/from compact 62-bit `uint64`s (2 bits/base), preserving top 2 bits for error metadata.
- **`EncodeCanonicalKmer`, ` CanonicalKmer`**: Normalizes k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
- **`IterCanonicalKmers`, `IterCanonicalKmersWithErrors`**: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.
## Minimizer-Based Partitioning
- **`DefaultMinimizerSize(k)`**, **`ValidateMinimizerSize(m, k, nworkers)`**: Computes and validates minimizer size `m` for parallelization (e.g., `ceil(k / 2.5)`).
- **`ExtractSuperKmers`, `IterSuperKmers(seq, k, m)`**: Extracts *super-k-mers* — maximal contiguous regions where all embedded `k`-mers share the same minimizer. Uses monotone deque for O(n) time.
## I/O Formats & Streaming
- **`.kdi` (K-Disk Index)**: Compact binary format for sorted `uint64` k-mers using delta-varint encoding. Includes optional `.kdx` sparse index for fast `SeekTo(target)`.
- APIs: `NewKdiWriter`, `NewKdiReader`, `.Next() → (kmer, ok)`.
- **`.skm`**: Binary storage for *super-k-mers*, with 2-bit nucleotide packing (4× compression vs ASCII).
- **`.kdx`**: Sparse index for `.kdi`, storing `(kmer, byteOffset)` every *stride* entries (e.g., 4096), enabling O(log M) seeks.
## K-Way Merge & Deduplication
- **`KWayMerge([]*KdiReader)`**: Merges sorted `.kdi` streams, aggregating k-mer counts across inputs.
- Uses min-heap for O(log *k*) per-output operations; supports streaming and deduplication.
- Ideal for combining k-mer sets across samples or batches.
## Entropy Filtering & Complexity Detection
- **`KmerEntropy(kmer, k, levelMax)`**: Computes minimum normalized Shannon entropy across sub-word sizes (1 to `levelMax`) using circular canonical normalization.
- Values near **0** indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
- **`KmerEntropyFilter`**: Precomputed filter for batch processing (no allocations), with `Accept(kmer)` and fast entropy lookup.
## K-mer Set Management (`KmerSetGroup`)
A `KmerSetGroup` represents *N* disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.
### Lifecycle & Construction
- **`NewKmerSetGroupBuilder(...)`**, **`AppendKmerSetGroupBuilder(dir)`**: Builds or extends groups via:
- `AddSequence(setID, bioseq)`: Extracts canonical k-mers (with optional filtering).
- Supports `WithMinFrequency`, `WithEntropyFilter`, and top-*N* tracking.
- **`Close()`**: Finalizes `.kdi`s, `spectrum.bin`, and optional `top_kmers.csv`.
- **`OpenKmerSetGroup(dir)`**: Loads existing group in read-only mode.
### Access & Metadata
- **`K()`, `M()`, `Partitions()`**, attributes via `GetStringAttribute(key)`.
- **`Contains(setID, kmer)`**: Parallel membership check across partitions.
- **`Iterator(setID)`**: Yields sorted k-mers via k-way merge.
### Set Algebra & Similarity
- **Set Operations**: `Union()`, `Intersect()`, `Difference()`, `QuorumAtLeast(q)` (≥ *q* sets), etc.
- **Pairwise Group Ops**: `UnionWith(other)`, `IntersectWith(other)` (per-set, compatible groups only).
- **Similarity Metrics**:
`JaccardDistanceMatrix()` = 1 |A ∩ B| / |A B|
`JaccardSimilarityMatrix()` = |A ∩ B| / |A B|
### Utilities
- **`CopySetsByIDTo(ids, destDir)`**, `RemoveSetByID(id)`, `MatchSetIDs(patterns)`
- **`IsCompatibleWith(other)`**: Validates `(k, m, partitions)`.
## K-mer Indexing & Matching (`KmerMap`)
Generic hash map associating canonical k-mers to sequences containing them.
- **`Push(sequences)`**: Builds index (optionally with `maxocc` limit).
- **`Query(querySeq) → KmerMatch`**: Returns sequences sharing k-mers, with match counts.
- **Supports sparse mode** (`SparseAt ≥ 0`): Ignores central base (e.g., for ambiguous-position matching).
- **Result utilities**: `FilterMinCount`, `.Max()`, `.Sequences()`.
## K-mer Spectrum Analysis
- **`SpectrumEntry{Frequency, Count}`**, `KmerSpectrum`: Sorted frequency distribution.
- **`MapToSpectrum()`, `MergeTopN()`**, binary/CSV I/O (`WriteSpectrum`, `ReadSpectrum`).
- **Top-*N* collector** via min-heap for streaming frequency tracking.
## Utility & Helpers
- **`HammingDistance(a, b)`**: Bitwise distance between encoded k-mers.
- **Varint encoding/decoding** (`EncodeVarint`, `DecodeVarint`): 7-bit-per-byte compression for I/O.
- **Reverse complement**: Constant-time via lookup tables (`revcompnuc`, `kmermask`).
---
## Design Principles
- **Zero-allocation where possible** (buffer reuse, iterators).
- **Streaming-first**: Avoids loading large datasets into memory.
- **Disk-backed persistence** for reproducibility and scalability.
- **Canonicalization & symmetry**: Strand-aware (reverse complement) or circular normalization for robustness.
## Use Cases
- Metagenomic read clustering & error correction
- Minimizer-based sketching (e.g., Mash/Sourmash analogs)
- Scalable Jaccard-based similarity matrices across thousands of samples
- Low-complexity region detection via entropy filtering
All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.