Files
obitools4/autodoc/docmd/pkg_obitools_obikmersim.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

4.8 KiB
Raw Blame History

obikmersim: K-merBased Sequence Similarity Analysis Package

obikmersim is a high-performance Python package for k-merdriven sequence comparison and alignment, tailored for biological read analysis (e.g., amplicons, metagenomes). It enables rapid matching of query sequences against reference databases using efficient k-mer indexing, followed by localized alignment with quality-aware consensus refinement. Designed for scalability and flexibility, it supports sparse k-mer representations, orientation detection (forward/reverse-complement), and configurable filtering thresholds.


Public API Overview

1. K-mer Indexing & Matching Workers

MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)

  • Purpose: Build a KmerMap from reference sequences and match queries via shared k-mers.
  • Functionality:
    • Indexes all k-mers (with optional sparsity mask) from reference sequences.
    • For each query, retrieves candidate references sharing ≥ min_count k-mers.
    • Returns annotated results: query ID, matched references, match count, k, and sparsity flag.
  • Use Case: Fast pre-screening for taxonomic assignment or read clustering.

MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)

  • Purpose: Perform k-merseeded local alignment with quality-aware consensus.
  • Functionality:
    • Uses shared k-mers from count_match_worker to seed alignment candidates.
    • Runs local pairwise alignments (via internal aligner) and builds quality-weighted consensus (ReadAlign, BuildQualityConsensus).
    • Computes:
      • % identity
      • Residual similarity (k-meraware alignment score)
      • Alignment length & orientation (+/)
    • Filters output by min_identity=80%, optional min alignment length.
  • Use Case: Precise read assignment, error correction via consensus.

2. CLI Configuration Options

KmerSimCountOptionSet

  • Defines CLI arguments for k-mer counting/matching:
    • --kmer-size (int, default=21)
    • --sparse (bool): Enable sparse k-mer masking
    • --reference <file>: Reference FASTA/FASTQ path(s)
    • --min-count (int, default=2): Minimum shared k-mer count
    • --self: Perform self-comparison (query = reference)

KmerSimMatchOptionSet

  • Extends counting options with alignment scoring parameters:
    • --delta (int, default=50): Max k-mer separation for seeding
    • --penalty-scale (float, default=1.0): Mismatch/gap scaling factor
    • --gap-factor (float, default=2): Gap penalty coefficient
    • --fast-absolute: Use fast absolute scoring (no dynamic programming)

Composite Sets

  • CountOptionSet / MatchOptionSet: Combine k-mer options with generic I/O conversion settings (e.g., via obiconvert).

3. CLI Helpers & Accessors

CLIKmerSize(args)

  • Returns parsed k-mer size from CLI args.

CLIReference(args, format="fasta")

  • Loads reference sequences into memory (supports batched/parallel reading).

CLISelf(args)

  • Returns boolean flag for self-comparison mode.

4. Core CLI Wrappers

CLILookForSharedKmers(args)

  • Orchestrates k-mer counting/matching pipeline:
    • Builds count_match_worker
    • Iterates over query sequences (from stdin or file)
    • Outputs match annotations in structured format.

CLIAlignSequences(args)

  • Runs full alignment pipeline:
    • Uses count_match_worker to seed candidates
    • Invokes kmer_align_worker
    • Outputs aligned pairs with identity, orientation, and quality metrics.

Key Technical Features

  • Sparse K-mers: Mask positions (e.g., Ns or degenerate bases) via bitmasks.
  • Orientation Handling: Auto-detect reverse-complement matches during seeding/alignment.
  • Fast Heuristic Scoring: Preliminary alignment score estimation before full path resolution (reduces compute).
  • Quality-Aware Consensus: Integrates base quality scores during alignment refinement.
  • Configurable Filtering: Thresholds on identity, length, and k-mer support.

Typical Workflows

Workflow Tools Used
Taxonomic screening of amplicons CLILookForSharedKmers + sparse mode
Read error correction via reference consensus CLIAlignSequences with quality-aware alignment
In silico PCR specificity check CLISelf() + min-count filtering
Large-scale metagenomic read assignment Batched parallel execution with CLIReference

Output Format

Results are returned as structured records (e.g., dictionaries or dataclasses) with fields:

  • query_id, reference_ids
  • match_count, kmer_size, sparse_mode
  • For alignments:
    %identity, alignment_length, orientation (+1/1)
    residual_similarity, consensus_quality

All public functions are documented with type hints and include unit tests.