mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

4.8 KiB

Raw Blame History

`obikmersim`: K-mer–Based Sequence Similarity Analysis Package

obikmersim is a high-performance Python package for k-mer–driven sequence comparison and alignment, tailored for biological read analysis (e.g., amplicons, metagenomes). It enables rapid matching of query sequences against reference databases using efficient k-mer indexing, followed by localized alignment with quality-aware consensus refinement. Designed for scalability and flexibility, it supports sparse k-mer representations, orientation detection (forward/reverse-complement), and configurable filtering thresholds.

Public API Overview

1. K-mer Indexing & Matching Workers

`MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)`

Purpose: Build a KmerMap from reference sequences and match queries via shared k-mers.
Functionality:
- Indexes all k-mers (with optional sparsity mask) from reference sequences.
- For each query, retrieves candidate references sharing ≥ min_count k-mers.
- Returns annotated results: query ID, matched references, match count, k, and sparsity flag.
Use Case: Fast pre-screening for taxonomic assignment or read clustering.

`MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)`

Purpose: Perform k-mer–seeded local alignment with quality-aware consensus.
Functionality:
- Uses shared k-mers from count_match_worker to seed alignment candidates.
- Runs local pairwise alignments (via internal aligner) and builds quality-weighted consensus (ReadAlign, BuildQualityConsensus).
- Computes:
  - % identity
  - Residual similarity (k-mer–aware alignment score)
  - Alignment length & orientation (+/−)
- Filters output by min_identity=80%, optional min alignment length.
Use Case: Precise read assignment, error correction via consensus.

2. CLI Configuration Options

`KmerSimCountOptionSet`

Defines CLI arguments for k-mer counting/matching:
- --kmer-size (int, default=21)
- --sparse (bool): Enable sparse k-mer masking
- --reference <file>: Reference FASTA/FASTQ path(s)
- --min-count (int, default=2): Minimum shared k-mer count
- --self: Perform self-comparison (query = reference)

`KmerSimMatchOptionSet`

Extends counting options with alignment scoring parameters:
- --delta (int, default=50): Max k-mer separation for seeding
- --penalty-scale (float, default=1.0): Mismatch/gap scaling factor
- --gap-factor (float, default=−2): Gap penalty coefficient
- --fast-absolute: Use fast absolute scoring (no dynamic programming)

Composite Sets

CountOptionSet / MatchOptionSet: Combine k-mer options with generic I/O conversion settings (e.g., via obiconvert).

3. CLI Helpers & Accessors

`CLIKmerSize(args)`

Returns parsed k-mer size from CLI args.

`CLIReference(args, format="fasta")`

Loads reference sequences into memory (supports batched/parallel reading).

`CLISelf(args)`

Returns boolean flag for self-comparison mode.

4. Core CLI Wrappers

`CLILookForSharedKmers(args)`

Orchestrates k-mer counting/matching pipeline:
- Builds count_match_worker
- Iterates over query sequences (from stdin or file)
- Outputs match annotations in structured format.

`CLIAlignSequences(args)`

Runs full alignment pipeline:
- Uses count_match_worker to seed candidates
- Invokes kmer_align_worker
- Outputs aligned pairs with identity, orientation, and quality metrics.

Key Technical Features

Sparse K-mers: Mask positions (e.g., Ns or degenerate bases) via bitmasks.
Orientation Handling: Auto-detect reverse-complement matches during seeding/alignment.
Fast Heuristic Scoring: Preliminary alignment score estimation before full path resolution (reduces compute).
Quality-Aware Consensus: Integrates base quality scores during alignment refinement.
Configurable Filtering: Thresholds on identity, length, and k-mer support.

Typical Workflows

Workflow	Tools Used
Taxonomic screening of amplicons	`CLILookForSharedKmers` + sparse mode
Read error correction via reference consensus	`CLIAlignSequences` with quality-aware alignment
In silico PCR specificity check	`CLISelf()` + min-count filtering
Large-scale metagenomic read assignment	Batched parallel execution with `CLIReference`

Output Format

Results are returned as structured records (e.g., dictionaries or dataclasses) with fields:

query_id, reference_ids
match_count, kmer_size, sparse_mode
For alignments:
%identity, alignment_length, orientation (+1/−1)
residual_similarity, consensus_quality

All public functions are documented with type hints and include unit tests.

4.8 KiB Raw Blame History Unescape Escape

obikmersim: K-mer–Based Sequence Similarity Analysis Package

Public API Overview

1. K-mer Indexing & Matching Workers

MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)

MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)