mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
108 lines
4.8 KiB
Markdown
108 lines
4.8 KiB
Markdown
# `obikmersim`: K-mer–Based Sequence Similarity Analysis Package
|
||
|
||
`obikmersim` is a high-performance Python package for **k-mer–driven sequence comparison and alignment**, tailored for biological read analysis (e.g., amplicons, metagenomes). It enables rapid matching of query sequences against reference databases using efficient k-mer indexing, followed by localized alignment with quality-aware consensus refinement. Designed for scalability and flexibility, it supports sparse k-mer representations, orientation detection (forward/reverse-complement), and configurable filtering thresholds.
|
||
|
||
---
|
||
|
||
## Public API Overview
|
||
|
||
### 1. **K-mer Indexing & Matching Workers**
|
||
#### `MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)`
|
||
- **Purpose**: Build a `KmerMap` from reference sequences and match queries via shared k-mers.
|
||
- **Functionality**:
|
||
- Indexes all *k*-mers (with optional sparsity mask) from reference sequences.
|
||
- For each query, retrieves candidate references sharing ≥ `min_count` k-mers.
|
||
- Returns annotated results: query ID, matched references, match count, *k*, and sparsity flag.
|
||
- **Use Case**: Fast pre-screening for taxonomic assignment or read clustering.
|
||
|
||
#### `MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)`
|
||
- **Purpose**: Perform *k*-mer–seeded local alignment with quality-aware consensus.
|
||
- **Functionality**:
|
||
- Uses shared k-mers from `count_match_worker` to seed alignment candidates.
|
||
- Runs local pairwise alignments (via internal aligner) and builds quality-weighted consensus (`ReadAlign`, `BuildQualityConsensus`).
|
||
- Computes:
|
||
- `% identity`
|
||
- Residual similarity (k-mer–aware alignment score)
|
||
- Alignment length & orientation (`+`/`−`)
|
||
- Filters output by `min_identity=80%`, optional min alignment length.
|
||
- **Use Case**: Precise read assignment, error correction via consensus.
|
||
|
||
---
|
||
|
||
### 2. **CLI Configuration Options**
|
||
#### `KmerSimCountOptionSet`
|
||
- Defines CLI arguments for k-mer counting/matching:
|
||
- `--kmer-size` (int, default=21)
|
||
- `--sparse` (bool): Enable sparse k-mer masking
|
||
- `--reference <file>`: Reference FASTA/FASTQ path(s)
|
||
- `--min-count` (int, default=2): Minimum shared k-mer count
|
||
- `--self`: Perform self-comparison (query = reference)
|
||
|
||
#### `KmerSimMatchOptionSet`
|
||
- Extends counting options with alignment scoring parameters:
|
||
- `--delta` (int, default=50): Max k-mer separation for seeding
|
||
- `--penalty-scale` (float, default=1.0): Mismatch/gap scaling factor
|
||
- `--gap-factor` (float, default=−2): Gap penalty coefficient
|
||
- `--fast-absolute`: Use fast absolute scoring (no dynamic programming)
|
||
|
||
#### Composite Sets
|
||
- `CountOptionSet` / `MatchOptionSet`: Combine k-mer options with generic I/O conversion settings (e.g., via `obiconvert`).
|
||
|
||
---
|
||
|
||
### 3. **CLI Helpers & Accessors**
|
||
#### `CLIKmerSize(args)`
|
||
- Returns parsed k-mer size from CLI args.
|
||
|
||
#### `CLIReference(args, format="fasta")`
|
||
- Loads reference sequences into memory (supports batched/parallel reading).
|
||
|
||
#### `CLISelf(args)`
|
||
- Returns boolean flag for self-comparison mode.
|
||
|
||
---
|
||
|
||
### 4. **Core CLI Wrappers**
|
||
#### `CLILookForSharedKmers(args)`
|
||
- Orchestrates k-mer counting/matching pipeline:
|
||
- Builds `count_match_worker`
|
||
- Iterates over query sequences (from stdin or file)
|
||
- Outputs match annotations in structured format.
|
||
|
||
#### `CLIAlignSequences(args)`
|
||
- Runs full alignment pipeline:
|
||
- Uses `count_match_worker` to seed candidates
|
||
- Invokes `kmer_align_worker`
|
||
- Outputs aligned pairs with identity, orientation, and quality metrics.
|
||
|
||
---
|
||
|
||
## Key Technical Features
|
||
- **Sparse K-mers**: Mask positions (e.g., Ns or degenerate bases) via bitmasks.
|
||
- **Orientation Handling**: Auto-detect reverse-complement matches during seeding/alignment.
|
||
- **Fast Heuristic Scoring**: Preliminary alignment score estimation before full path resolution (reduces compute).
|
||
- **Quality-Aware Consensus**: Integrates base quality scores during alignment refinement.
|
||
- **Configurable Filtering**: Thresholds on identity, length, and k-mer support.
|
||
|
||
---
|
||
|
||
## Typical Workflows
|
||
| Workflow | Tools Used |
|
||
|---------|------------|
|
||
| Taxonomic screening of amplicons | `CLILookForSharedKmers` + sparse mode |
|
||
| Read error correction via reference consensus | `CLIAlignSequences` with quality-aware alignment |
|
||
| *In silico* PCR specificity check | `CLISelf()` + min-count filtering |
|
||
| Large-scale metagenomic read assignment | Batched parallel execution with `CLIReference` |
|
||
|
||
---
|
||
|
||
## Output Format
|
||
Results are returned as structured records (e.g., dictionaries or dataclasses) with fields:
|
||
- `query_id`, `reference_ids`
|
||
- `match_count`, `kmer_size`, `sparse_mode`
|
||
- For alignments:
|
||
`%identity`, `alignment_length`, `orientation` (`+1`/`−1`)
|
||
`residual_similarity`, `consensus_quality`
|
||
|
||
All public functions are documented with type hints and include unit tests.
|