mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

2.5 KiB

Raw Blame History

`obikmer` K-mer Set Group Builder — Functional Overview

The KmerSetGroupBuilder enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: collection of super-kmers into partitioned temporary files (.skm), and finalization, where partitions are processed in parallel into final k-mer indexes (.kdi).

Core Features

K-mer & Minimizer Configuration:
Supports k ∈ [2,31]; auto-computes optimal minimizer size (m ≈ k/2.5) and partition count (up to 4^m, capped at 4096).
Functional Options for Filtering:
- WithMinFrequency(n): Keep only k-mers with frequency ≥ n (enables deduplication).
- WithMaxFrequency(n): Discard k-mers with frequency > n.
- WithEntropyFilter(threshold, levelMax): Remove low-complexity k-mers (entropy ≤ threshold).
- WithSaveFreqKmers(n): Save top-n most frequent k-mers per set to top_kmers.csv.
Concurrent & Pipeline-Aware Processing:
Uses a two-stage pipeline: I/O-bound readers (2–4 goroutines) feed k-mers to CPU-bound workers, one per core, maximizing throughput.
Partitioned I/O & Thread Safety:
Super-kmers are written to per-partition .skm files using mutex-protected writers, enabling safe concurrent AddSequence() calls.

Workflow

Build Phase:
- Input sequences → super-kmers extracted via minimizer-based partitioning.
- Super-kmers written to .build/set_*/part_*.skm.
Finalization (Close()):
- .skm files loaded → canonical k-mers extracted.
- K-mers sorted, counted (frequency spectrum), and filtered per config.
- Final .kdi files written; spectrum.bin, and optionally top_kmers.csv.
- Metadata (metadata.toml) generated; .build/ cleaned.
Append Mode:
AppendKmerSetGroupBuilder() extends an existing group, inheriting its parameters and appending new sets.

Output Artifacts

.kdi: Sorted, deduplicated (and optionally filtered) k-mers.
spectrum.bin: Per-set frequency spectrum (count → #k-mers).
top_kmers.csv (optional): Top N k-mers per set with counts.
metadata.toml: Global and per-set metadata (k, m, partitions, counts).

Design Highlights

Memory-efficient: Streams large .skm files; reuses slices to minimize GC pressure.
Scalable: Parallel finalization scales with CPU cores and I/O bandwidth.
Robust error handling: Early termination on first failure; cleanup of partial state.

2.5 KiB Raw Blame History Unescape Escape

obikmer K-mer Set Group Builder — Functional Overview