mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
50 lines
2.5 KiB
Markdown
50 lines
2.5 KiB
Markdown
|
|
# `obikmer` K-mer Set Group Builder — Functional Overview
|
|||
|
|
|
|||
|
|
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
|
|||
|
|
|
|||
|
|
## Core Features
|
|||
|
|
|
|||
|
|
- **K-mer & Minimizer Configuration**:
|
|||
|
|
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
|
|||
|
|
|
|||
|
|
- **Functional Options for Filtering**:
|
|||
|
|
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
|
|||
|
|
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
|
|||
|
|
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
|
|||
|
|
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
|
|||
|
|
|
|||
|
|
- **Concurrent & Pipeline-Aware Processing**:
|
|||
|
|
Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
|
|||
|
|
|
|||
|
|
- **Partitioned I/O & Thread Safety**:
|
|||
|
|
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
|
|||
|
|
|
|||
|
|
## Workflow
|
|||
|
|
|
|||
|
|
1. **Build Phase**:
|
|||
|
|
- Input sequences → super-kmers extracted via minimizer-based partitioning.
|
|||
|
|
- Super-kmers written to `.build/set_*/part_*.skm`.
|
|||
|
|
|
|||
|
|
2. **Finalization (`Close()`)**:
|
|||
|
|
- `.skm` files loaded → canonical k-mers extracted.
|
|||
|
|
- K-mers sorted, counted (frequency spectrum), and filtered per config.
|
|||
|
|
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
|
|||
|
|
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
|
|||
|
|
|
|||
|
|
3. **Append Mode**:
|
|||
|
|
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
|
|||
|
|
|
|||
|
|
## Output Artifacts
|
|||
|
|
|
|||
|
|
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
|
|||
|
|
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
|
|||
|
|
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
|
|||
|
|
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
|
|||
|
|
|
|||
|
|
## Design Highlights
|
|||
|
|
|
|||
|
|
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
|
|||
|
|
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
|
|||
|
|
- **Robust error handling**: Early termination on first failure; cleanup of partial state.
|
|||
|
|
|