Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_builder.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

50 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obikmer` K-mer Set Group Builder — Functional Overview
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
## Core Features
- **K-mer & Minimizer Configuration**:
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
- **Functional Options for Filtering**:
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
- **Concurrent & Pipeline-Aware Processing**:
Uses a two-stage pipeline: *I/O-bound readers* (24 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
- **Partitioned I/O & Thread Safety**:
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
## Workflow
1. **Build Phase**:
- Input sequences → super-kmers extracted via minimizer-based partitioning.
- Super-kmers written to `.build/set_*/part_*.skm`.
2. **Finalization (`Close()`)**:
- `.skm` files loaded → canonical k-mers extracted.
- K-mers sorted, counted (frequency spectrum), and filtered per config.
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
3. **Append Mode**:
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
## Output Artifacts
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
## Design Highlights
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
- **Robust error handling**: Early termination on first failure; cleanup of partial state.