Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_builder.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.5 KiB
Raw Blame History

obikmer K-mer Set Group Builder — Functional Overview

The KmerSetGroupBuilder enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: collection of super-kmers into partitioned temporary files (.skm), and finalization, where partitions are processed in parallel into final k-mer indexes (.kdi).

Core Features

  • K-mer & Minimizer Configuration:
    Supports k ∈ [2,31]; auto-computes optimal minimizer size (m ≈ k/2.5) and partition count (up to 4^m, capped at 4096).

  • Functional Options for Filtering:

    • WithMinFrequency(n): Keep only k-mers with frequency ≥ n (enables deduplication).
    • WithMaxFrequency(n): Discard k-mers with frequency > n.
    • WithEntropyFilter(threshold, levelMax): Remove low-complexity k-mers (entropy ≤ threshold).
    • WithSaveFreqKmers(n): Save top-n most frequent k-mers per set to top_kmers.csv.
  • Concurrent & Pipeline-Aware Processing:
    Uses a two-stage pipeline: I/O-bound readers (24 goroutines) feed k-mers to CPU-bound workers, one per core, maximizing throughput.

  • Partitioned I/O & Thread Safety:
    Super-kmers are written to per-partition .skm files using mutex-protected writers, enabling safe concurrent AddSequence() calls.

Workflow

  1. Build Phase:

    • Input sequences → super-kmers extracted via minimizer-based partitioning.
    • Super-kmers written to .build/set_*/part_*.skm.
  2. Finalization (Close()):

    • .skm files loaded → canonical k-mers extracted.
    • K-mers sorted, counted (frequency spectrum), and filtered per config.
    • Final .kdi files written; spectrum.bin, and optionally top_kmers.csv.
    • Metadata (metadata.toml) generated; .build/ cleaned.
  3. Append Mode:
    AppendKmerSetGroupBuilder() extends an existing group, inheriting its parameters and appending new sets.

Output Artifacts

  • .kdi: Sorted, deduplicated (and optionally filtered) k-mers.
  • spectrum.bin: Per-set frequency spectrum (count → #k-mers).
  • top_kmers.csv (optional): Top N k-mers per set with counts.
  • metadata.toml: Global and per-set metadata (k, m, partitions, counts).

Design Highlights

  • Memory-efficient: Streams large .skm files; reuses slices to minimize GC pressure.
  • Scalable: Parallel finalization scales with CPU cores and I/O bandwidth.
  • Robust error handling: Early termination on first failure; cleanup of partial state.