Files
obitools4/autodoc/docmd/pkg_obitools_obiannotate.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

43 lines
3.0 KiB
Markdown

# `obiannotate`: Semantic Description of Public Features
The `obiannotate` package delivers modular, composable sequence annotation workers for biological sequences (FASTA/FASTQ) within the OBITools4 ecosystem. Each worker returns an `obiseq.SeqWorker`, enabling declarative pipeline construction via chaining or conditional execution. All functionality is exposed through both programmatic and CLI interfaces.
## 1️⃣ Attribute Management
Workers manipulate sequence annotations (metadata slots) with fine-grained control:
- **`DeleteAttributesWorker(keys)`**: Removes specified annotation keys; silently skips missing ones.
- **`ToBeKeptAttributesWorker(keys)`**: Retains only listed keys; discards all others.
- **`ClearAllAttributesWorker()`**: Strips *all* annotations from each sequence.
- **`RenameAttributeWorker(mapping)`**: Renames keys using a dict (e.g., `{"old": "new"}`); skips records if source key is absent.
## 2️⃣ Sequence Editing
Direct manipulation of sequence content and derived metadata:
- **`CutSequenceWorker(start, end)`**: Extracts subsequence from `start` to `end` (1-based; supports negative indices). Fails with error or discards sequence on invalid bounds.
- **`AddSeqLengthWorker()`**: Adds `seq_length = len(sequence)` annotation.
- **`EvalAttributeWorker(expr, target_slot=None)`**: Evaluates Python expressions (e.g., `"seq_length > 200"`) to set annotations; used internally by `EditAttributeWorker`.
## 3️⃣ Taxonomic Annotation
Enriches sequences with taxonomic context using NCBI taxonomy:
- **`AddTaxonAtRankWorker(rank)`**: Adds taxon name at specified rank (e.g., `"species"`) to slot `taxon_at_rank`.
- **`AddTaxonRankWorker()`**: Infers and annotates taxonomic rank (e.g., `"species"`).
- **`AddScientificNameWorker()`**: Adds `scientific_name = "Homo sapiens"`-style label.
- **`AddTaxonomicPathWorker()`**: Adds full lineage path (semicolon-separated).
## 4️⃣ Pattern Matching
Detects DNA motifs with tolerance for mismatches/indels:
- **`MatchPatternWorker(pattern, max_errors=0, allow_indel=False)`**:
- Scans both strands via reverse-complement.
- Annotates: `slot_location` (start/end), `slot_match`, and `slot_error`.
- Uses **Aho-Corasick** for efficient multi-pattern search (file-based via `obicorazick.AhoCorasickWorker`).
## 5️⃣ CLI-Driven Pipeline Construction
Bridges command-line flags to composable workers:
- **`CLIAnnotationWorker(args)`**: Builds a composite worker from CLI flags (e.g., `--pattern`, `--taxonomic-rank`).
- **`CLIAnnotationPipeline(args)`**: Wraps the worker in a conditional pipeline (using `obigrep` predicates) and parallelizes via multiprocessing.
## 6️⃣ Utility & Validation
- **`CLIHasPattern(pattern)`**: Returns a worker that filters sequences matching `pattern`.
- **`CLICut(start, end)`**: Returns a cut worker for CLI usage.
- All workers validate inputs (e.g., malformed `--cut` triggers fatal exit with log).
All public features are **stateless**, composable via `ChainWorkers`, and designed for high-throughput, scriptable annotation workflows.