mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
50 lines
2.5 KiB
Markdown
50 lines
2.5 KiB
Markdown
# `obikmer` K-mer Set Group Builder — Functional Overview
|
||
|
||
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
|
||
|
||
## Core Features
|
||
|
||
- **K-mer & Minimizer Configuration**:
|
||
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
|
||
|
||
- **Functional Options for Filtering**:
|
||
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
|
||
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
|
||
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
|
||
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
|
||
|
||
- **Concurrent & Pipeline-Aware Processing**:
|
||
Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
|
||
|
||
- **Partitioned I/O & Thread Safety**:
|
||
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
|
||
|
||
## Workflow
|
||
|
||
1. **Build Phase**:
|
||
- Input sequences → super-kmers extracted via minimizer-based partitioning.
|
||
- Super-kmers written to `.build/set_*/part_*.skm`.
|
||
|
||
2. **Finalization (`Close()`)**:
|
||
- `.skm` files loaded → canonical k-mers extracted.
|
||
- K-mers sorted, counted (frequency spectrum), and filtered per config.
|
||
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
|
||
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
|
||
|
||
3. **Append Mode**:
|
||
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
|
||
|
||
## Output Artifacts
|
||
|
||
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
|
||
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
|
||
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
|
||
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
|
||
|
||
## Design Highlights
|
||
|
||
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
|
||
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
|
||
- **Robust error handling**: Early termination on first failure; cleanup of partial state.
|
||
|