Files
obitools4/autodoc/docmd/pkg_obikmer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

102 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, disk-backed utilities for **k-mer manipulation and comparison** in biological sequences. Designed for scalability (e.g., metagenomics, NGS read processing), it supports canonical encoding, minimizer-based partitioning, streaming I/O formats (`.kdi`, `.skm`), entropy filtering, and scalable set operations — all while minimizing allocations.
---
## Core Encoding & Canonicalization
- **`EncodeKmer`, `DecodeKmer`**: Encodes/decodes DNA sequences to/from compact 62-bit `uint64`s (2 bits/base), preserving top 2 bits for error metadata.
- **`EncodeCanonicalKmer`, ` CanonicalKmer`**: Normalizes k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
- **`IterCanonicalKmers`, `IterCanonicalKmersWithErrors`**: Memory-efficient streaming of canonical k-mers from sequences; optionally tags ambiguous bases in top 2 bits.
## Minimizer-Based Partitioning
- **`DefaultMinimizerSize(k)`**, **`ValidateMinimizerSize(m, k, nworkers)`**: Computes and validates minimizer size `m` for parallelization (e.g., `ceil(k / 2.5)`).
- **`ExtractSuperKmers`, `IterSuperKmers(seq, k, m)`**: Extracts *super-k-mers* — maximal contiguous regions where all embedded `k`-mers share the same minimizer. Uses monotone deque for O(n) time.
## I/O Formats & Streaming
- **`.kdi` (K-Disk Index)**: Compact binary format for sorted `uint64` k-mers using delta-varint encoding. Includes optional `.kdx` sparse index for fast `SeekTo(target)`.
- APIs: `NewKdiWriter`, `NewKdiReader`, `.Next() → (kmer, ok)`.
- **`.skm`**: Binary storage for *super-k-mers*, with 2-bit nucleotide packing (4× compression vs ASCII).
- **`.kdx`**: Sparse index for `.kdi`, storing `(kmer, byteOffset)` every *stride* entries (e.g., 4096), enabling O(log M) seeks.
## K-Way Merge & Deduplication
- **`KWayMerge([]*KdiReader)`**: Merges sorted `.kdi` streams, aggregating k-mer counts across inputs.
- Uses min-heap for O(log *k*) per-output operations; supports streaming and deduplication.
- Ideal for combining k-mer sets across samples or batches.
## Entropy Filtering & Complexity Detection
- **`KmerEntropy(kmer, k, levelMax)`**: Computes minimum normalized Shannon entropy across sub-word sizes (1 to `levelMax`) using circular canonical normalization.
- Values near **0** indicate repeats (e.g., homopolymers); ~1 indicates high complexity.
- **`KmerEntropyFilter`**: Precomputed filter for batch processing (no allocations), with `Accept(kmer)` and fast entropy lookup.
## K-mer Set Management (`KmerSetGroup`)
A `KmerSetGroup` represents *N* disjoint, sorted k-mer sets (e.g., per sample), persisted on disk.
### Lifecycle & Construction
- **`NewKmerSetGroupBuilder(...)`**, **`AppendKmerSetGroupBuilder(dir)`**: Builds or extends groups via:
- `AddSequence(setID, bioseq)`: Extracts canonical k-mers (with optional filtering).
- Supports `WithMinFrequency`, `WithEntropyFilter`, and top-*N* tracking.
- **`Close()`**: Finalizes `.kdi`s, `spectrum.bin`, and optional `top_kmers.csv`.
- **`OpenKmerSetGroup(dir)`**: Loads existing group in read-only mode.
### Access & Metadata
- **`K()`, `M()`, `Partitions()`**, attributes via `GetStringAttribute(key)`.
- **`Contains(setID, kmer)`**: Parallel membership check across partitions.
- **`Iterator(setID)`**: Yields sorted k-mers via k-way merge.
### Set Algebra & Similarity
- **Set Operations**: `Union()`, `Intersect()`, `Difference()`, `QuorumAtLeast(q)` (≥ *q* sets), etc.
- **Pairwise Group Ops**: `UnionWith(other)`, `IntersectWith(other)` (per-set, compatible groups only).
- **Similarity Metrics**:
`JaccardDistanceMatrix()` = 1 |A ∩ B| / |A B|
`JaccardSimilarityMatrix()` = |A ∩ B| / |A B|
### Utilities
- **`CopySetsByIDTo(ids, destDir)`**, `RemoveSetByID(id)`, `MatchSetIDs(patterns)`
- **`IsCompatibleWith(other)`**: Validates `(k, m, partitions)`.
## K-mer Indexing & Matching (`KmerMap`)
Generic hash map associating canonical k-mers to sequences containing them.
- **`Push(sequences)`**: Builds index (optionally with `maxocc` limit).
- **`Query(querySeq) → KmerMatch`**: Returns sequences sharing k-mers, with match counts.
- **Supports sparse mode** (`SparseAt ≥ 0`): Ignores central base (e.g., for ambiguous-position matching).
- **Result utilities**: `FilterMinCount`, `.Max()`, `.Sequences()`.
## K-mer Spectrum Analysis
- **`SpectrumEntry{Frequency, Count}`**, `KmerSpectrum`: Sorted frequency distribution.
- **`MapToSpectrum()`, `MergeTopN()`**, binary/CSV I/O (`WriteSpectrum`, `ReadSpectrum`).
- **Top-*N* collector** via min-heap for streaming frequency tracking.
## Utility & Helpers
- **`HammingDistance(a, b)`**: Bitwise distance between encoded k-mers.
- **Varint encoding/decoding** (`EncodeVarint`, `DecodeVarint`): 7-bit-per-byte compression for I/O.
- **Reverse complement**: Constant-time via lookup tables (`revcompnuc`, `kmermask`).
---
## Design Principles
- **Zero-allocation where possible** (buffer reuse, iterators).
- **Streaming-first**: Avoids loading large datasets into memory.
- **Disk-backed persistence** for reproducibility and scalability.
- **Canonicalization & symmetry**: Strand-aware (reverse complement) or circular normalization for robustness.
## Use Cases
- Metagenomic read clustering & error correction
- Minimizer-based sketching (e.g., Mash/Sourmash analogs)
- Scalable Jaccard-based similarity matrices across thousands of samples
- Low-complexity region detection via entropy filtering
All operations are tested, benchmarked, and optimized for high-throughput genomic workflows.