mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
108 lines
4.8 KiB
Markdown
108 lines
4.8 KiB
Markdown
|
|
# `obikmersim`: K-mer–Based Sequence Similarity Analysis Package
|
|||
|
|
|
|||
|
|
`obikmersim` is a high-performance Python package for **k-mer–driven sequence comparison and alignment**, tailored for biological read analysis (e.g., amplicons, metagenomes). It enables rapid matching of query sequences against reference databases using efficient k-mer indexing, followed by localized alignment with quality-aware consensus refinement. Designed for scalability and flexibility, it supports sparse k-mer representations, orientation detection (forward/reverse-complement), and configurable filtering thresholds.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Public API Overview
|
|||
|
|
|
|||
|
|
### 1. **K-mer Indexing & Matching Workers**
|
|||
|
|
#### `MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)`
|
|||
|
|
- **Purpose**: Build a `KmerMap` from reference sequences and match queries via shared k-mers.
|
|||
|
|
- **Functionality**:
|
|||
|
|
- Indexes all *k*-mers (with optional sparsity mask) from reference sequences.
|
|||
|
|
- For each query, retrieves candidate references sharing ≥ `min_count` k-mers.
|
|||
|
|
- Returns annotated results: query ID, matched references, match count, *k*, and sparsity flag.
|
|||
|
|
- **Use Case**: Fast pre-screening for taxonomic assignment or read clustering.
|
|||
|
|
|
|||
|
|
#### `MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)`
|
|||
|
|
- **Purpose**: Perform *k*-mer–seeded local alignment with quality-aware consensus.
|
|||
|
|
- **Functionality**:
|
|||
|
|
- Uses shared k-mers from `count_match_worker` to seed alignment candidates.
|
|||
|
|
- Runs local pairwise alignments (via internal aligner) and builds quality-weighted consensus (`ReadAlign`, `BuildQualityConsensus`).
|
|||
|
|
- Computes:
|
|||
|
|
- `% identity`
|
|||
|
|
- Residual similarity (k-mer–aware alignment score)
|
|||
|
|
- Alignment length & orientation (`+`/`−`)
|
|||
|
|
- Filters output by `min_identity=80%`, optional min alignment length.
|
|||
|
|
- **Use Case**: Precise read assignment, error correction via consensus.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. **CLI Configuration Options**
|
|||
|
|
#### `KmerSimCountOptionSet`
|
|||
|
|
- Defines CLI arguments for k-mer counting/matching:
|
|||
|
|
- `--kmer-size` (int, default=21)
|
|||
|
|
- `--sparse` (bool): Enable sparse k-mer masking
|
|||
|
|
- `--reference <file>`: Reference FASTA/FASTQ path(s)
|
|||
|
|
- `--min-count` (int, default=2): Minimum shared k-mer count
|
|||
|
|
- `--self`: Perform self-comparison (query = reference)
|
|||
|
|
|
|||
|
|
#### `KmerSimMatchOptionSet`
|
|||
|
|
- Extends counting options with alignment scoring parameters:
|
|||
|
|
- `--delta` (int, default=50): Max k-mer separation for seeding
|
|||
|
|
- `--penalty-scale` (float, default=1.0): Mismatch/gap scaling factor
|
|||
|
|
- `--gap-factor` (float, default=−2): Gap penalty coefficient
|
|||
|
|
- `--fast-absolute`: Use fast absolute scoring (no dynamic programming)
|
|||
|
|
|
|||
|
|
#### Composite Sets
|
|||
|
|
- `CountOptionSet` / `MatchOptionSet`: Combine k-mer options with generic I/O conversion settings (e.g., via `obiconvert`).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. **CLI Helpers & Accessors**
|
|||
|
|
#### `CLIKmerSize(args)`
|
|||
|
|
- Returns parsed k-mer size from CLI args.
|
|||
|
|
|
|||
|
|
#### `CLIReference(args, format="fasta")`
|
|||
|
|
- Loads reference sequences into memory (supports batched/parallel reading).
|
|||
|
|
|
|||
|
|
#### `CLISelf(args)`
|
|||
|
|
- Returns boolean flag for self-comparison mode.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4. **Core CLI Wrappers**
|
|||
|
|
#### `CLILookForSharedKmers(args)`
|
|||
|
|
- Orchestrates k-mer counting/matching pipeline:
|
|||
|
|
- Builds `count_match_worker`
|
|||
|
|
- Iterates over query sequences (from stdin or file)
|
|||
|
|
- Outputs match annotations in structured format.
|
|||
|
|
|
|||
|
|
#### `CLIAlignSequences(args)`
|
|||
|
|
- Runs full alignment pipeline:
|
|||
|
|
- Uses `count_match_worker` to seed candidates
|
|||
|
|
- Invokes `kmer_align_worker`
|
|||
|
|
- Outputs aligned pairs with identity, orientation, and quality metrics.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Technical Features
|
|||
|
|
- **Sparse K-mers**: Mask positions (e.g., Ns or degenerate bases) via bitmasks.
|
|||
|
|
- **Orientation Handling**: Auto-detect reverse-complement matches during seeding/alignment.
|
|||
|
|
- **Fast Heuristic Scoring**: Preliminary alignment score estimation before full path resolution (reduces compute).
|
|||
|
|
- **Quality-Aware Consensus**: Integrates base quality scores during alignment refinement.
|
|||
|
|
- **Configurable Filtering**: Thresholds on identity, length, and k-mer support.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Typical Workflows
|
|||
|
|
| Workflow | Tools Used |
|
|||
|
|
|---------|------------|
|
|||
|
|
| Taxonomic screening of amplicons | `CLILookForSharedKmers` + sparse mode |
|
|||
|
|
| Read error correction via reference consensus | `CLIAlignSequences` with quality-aware alignment |
|
|||
|
|
| *In silico* PCR specificity check | `CLISelf()` + min-count filtering |
|
|||
|
|
| Large-scale metagenomic read assignment | Batched parallel execution with `CLIReference` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Output Format
|
|||
|
|
Results are returned as structured records (e.g., dictionaries or dataclasses) with fields:
|
|||
|
|
- `query_id`, `reference_ids`
|
|||
|
|
- `match_count`, `kmer_size`, `sparse_mode`
|
|||
|
|
- For alignments:
|
|||
|
|
`%identity`, `alignment_length`, `orientation` (`+1`/`−1`)
|
|||
|
|
`residual_similarity`, `consensus_quality`
|
|||
|
|
|
|||
|
|
All public functions are documented with type hints and include unit tests.
|