- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.5 KiB
obikmer K-mer Set Group Builder — Functional Overview
The KmerSetGroupBuilder enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: collection of super-kmers into partitioned temporary files (.skm), and finalization, where partitions are processed in parallel into final k-mer indexes (.kdi).
Core Features
-
K-mer & Minimizer Configuration:
Supportsk ∈ [2,31]; auto-computes optimal minimizer size (m ≈ k/2.5) and partition count (up to4^m, capped at 4096). -
Functional Options for Filtering:
WithMinFrequency(n): Keep only k-mers with frequency ≥ n (enables deduplication).WithMaxFrequency(n): Discard k-mers with frequency > n.WithEntropyFilter(threshold, levelMax): Remove low-complexity k-mers (entropy ≤ threshold).WithSaveFreqKmers(n): Save top-n most frequent k-mers per set totop_kmers.csv.
-
Concurrent & Pipeline-Aware Processing:
Uses a two-stage pipeline: I/O-bound readers (2–4 goroutines) feed k-mers to CPU-bound workers, one per core, maximizing throughput. -
Partitioned I/O & Thread Safety:
Super-kmers are written to per-partition.skmfiles using mutex-protected writers, enabling safe concurrentAddSequence()calls.
Workflow
-
Build Phase:
- Input sequences → super-kmers extracted via minimizer-based partitioning.
- Super-kmers written to
.build/set_*/part_*.skm.
-
Finalization (
Close()):.skmfiles loaded → canonical k-mers extracted.- K-mers sorted, counted (frequency spectrum), and filtered per config.
- Final
.kdifiles written;spectrum.bin, and optionallytop_kmers.csv. - Metadata (
metadata.toml) generated;.build/cleaned.
-
Append Mode:
AppendKmerSetGroupBuilder()extends an existing group, inheriting its parameters and appending new sets.
Output Artifacts
.kdi: Sorted, deduplicated (and optionally filtered) k-mers.spectrum.bin: Per-set frequency spectrum (count → #k-mers).top_kmers.csv(optional): Top N k-mers per set with counts.metadata.toml: Global and per-set metadata (k, m, partitions, counts).
Design Highlights
- Memory-efficient: Streams large
.skmfiles; reuses slices to minimize GC pressure. - Scalable: Parallel finalization scales with CPU cores and I/O bandwidth.
- Robust error handling: Early termination on first failure; cleanup of partial state.