obitools4/autodoc/docmd/pkg_obitools_obikmersim.md

# `obikmersim`: K-mer–Based Sequence Similarity Analysis Package

`obikmersim` is a high-performance Python package for **k-mer–driven sequence comparison and alignment**, tailored for biological read analysis (e.g., amplicons, metagenomes). It enables rapid matching of query sequences against reference databases using efficient k-mer indexing, followed by localized alignment with quality-aware consensus refinement. Designed for scalability and flexibility, it supports sparse k-mer representations, orientation detection (forward/reverse-complement), and configurable filtering thresholds.

---

## Public API Overview

### 1. **K-mer Indexing & Matching Workers**
#### `MakeCountMatchWorker(reference_sequences, k=21, min_count=2, sparse=False)`
- **Purpose**: Build a `KmerMap` from reference sequences and match queries via shared k-mers.
- **Functionality**:
  - Indexes all *k*-mers (with optional sparsity mask) from reference sequences.
  - For each query, retrieves candidate references sharing ≥ `min_count` k-mers.
  - Returns annotated results: query ID, matched references, match count, *k*, and sparsity flag.
- **Use Case**: Fast pre-screening for taxonomic assignment or read clustering.

#### `MakeKmerAlignWorker(count_match_worker, delta=50, penalty_scale=1.0, gap_factor=-2)`
- **Purpose**: Perform *k*-mer–seeded local alignment with quality-aware consensus.
- **Functionality**:
  - Uses shared k-mers from `count_match_worker` to seed alignment candidates.
  - Runs local pairwise alignments (via internal aligner) and builds quality-weighted consensus (`ReadAlign`, `BuildQualityConsensus`).
  - Computes:
    - `% identity`
    - Residual similarity (k-mer–aware alignment score)
    - Alignment length & orientation (`+`/`−`)
  - Filters output by `min_identity=80%`, optional min alignment length.
- **Use Case**: Precise read assignment, error correction via consensus.

---

### 2. **CLI Configuration Options**
#### `KmerSimCountOptionSet`
- Defines CLI arguments for k-mer counting/matching:
  - `--kmer-size` (int, default=21)
  - `--sparse` (bool): Enable sparse k-mer masking
  - `--reference <file>`: Reference FASTA/FASTQ path(s)
  - `--min-count` (int, default=2): Minimum shared k-mer count
  - `--self`: Perform self-comparison (query = reference)

#### `KmerSimMatchOptionSet`
- Extends counting options with alignment scoring parameters:
  - `--delta` (int, default=50): Max k-mer separation for seeding
  - `--penalty-scale` (float, default=1.0): Mismatch/gap scaling factor
  - `--gap-factor` (float, default=−2): Gap penalty coefficient
  - `--fast-absolute`: Use fast absolute scoring (no dynamic programming)

#### Composite Sets
- `CountOptionSet` / `MatchOptionSet`: Combine k-mer options with generic I/O conversion settings (e.g., via `obiconvert`).

---

### 3. **CLI Helpers & Accessors**
#### `CLIKmerSize(args)`
- Returns parsed k-mer size from CLI args.

#### `CLIReference(args, format="fasta")`
- Loads reference sequences into memory (supports batched/parallel reading).

#### `CLISelf(args)`
- Returns boolean flag for self-comparison mode.

---

### 4. **Core CLI Wrappers**
#### `CLILookForSharedKmers(args)`
- Orchestrates k-mer counting/matching pipeline:
  - Builds `count_match_worker`
  - Iterates over query sequences (from stdin or file)
  - Outputs match annotations in structured format.

#### `CLIAlignSequences(args)`
- Runs full alignment pipeline:
  - Uses `count_match_worker` to seed candidates
  - Invokes `kmer_align_worker`
  - Outputs aligned pairs with identity, orientation, and quality metrics.

---

## Key Technical Features
- **Sparse K-mers**: Mask positions (e.g., Ns or degenerate bases) via bitmasks.
- **Orientation Handling**: Auto-detect reverse-complement matches during seeding/alignment.
- **Fast Heuristic Scoring**: Preliminary alignment score estimation before full path resolution (reduces compute).
- **Quality-Aware Consensus**: Integrates base quality scores during alignment refinement.
- **Configurable Filtering**: Thresholds on identity, length, and k-mer support.

---

## Typical Workflows
| Workflow | Tools Used |
|---------|------------|
| Taxonomic screening of amplicons | `CLILookForSharedKmers` + sparse mode |
| Read error correction via reference consensus | `CLIAlignSequences` with quality-aware alignment |
| *In silico* PCR specificity check | `CLISelf()` + min-count filtering |
| Large-scale metagenomic read assignment | Batched parallel execution with `CLIReference` |

---

## Output Format
Results are returned as structured records (e.g., dictionaries or dataclasses) with fields:
- `query_id`, `reference_ids`
- `match_count`, `kmer_size`, `sparse_mode`
- For alignments:
  `%identity`, `alignment_length`, `orientation` (`+1`/`−1`)
  `residual_similarity`, `consensus_quality`

All public functions are documented with type hints and include unit tests.