Files
obitools4/autodoc/docmd/pkg_obitools_obiannotate.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.0 KiB

obiannotate: Semantic Description of Public Features

The obiannotate package delivers modular, composable sequence annotation workers for biological sequences (FASTA/FASTQ) within the OBITools4 ecosystem. Each worker returns an obiseq.SeqWorker, enabling declarative pipeline construction via chaining or conditional execution. All functionality is exposed through both programmatic and CLI interfaces.

1️⃣ Attribute Management

Workers manipulate sequence annotations (metadata slots) with fine-grained control:

  • DeleteAttributesWorker(keys): Removes specified annotation keys; silently skips missing ones.
  • ToBeKeptAttributesWorker(keys): Retains only listed keys; discards all others.
  • ClearAllAttributesWorker(): Strips all annotations from each sequence.
  • RenameAttributeWorker(mapping): Renames keys using a dict (e.g., {"old": "new"}); skips records if source key is absent.

2️⃣ Sequence Editing

Direct manipulation of sequence content and derived metadata:

  • CutSequenceWorker(start, end): Extracts subsequence from start to end (1-based; supports negative indices). Fails with error or discards sequence on invalid bounds.
  • AddSeqLengthWorker(): Adds seq_length = len(sequence) annotation.
  • EvalAttributeWorker(expr, target_slot=None): Evaluates Python expressions (e.g., "seq_length > 200") to set annotations; used internally by EditAttributeWorker.

3️⃣ Taxonomic Annotation

Enriches sequences with taxonomic context using NCBI taxonomy:

  • AddTaxonAtRankWorker(rank): Adds taxon name at specified rank (e.g., "species") to slot taxon_at_rank.
  • AddTaxonRankWorker(): Infers and annotates taxonomic rank (e.g., "species").
  • AddScientificNameWorker(): Adds scientific_name = "Homo sapiens"-style label.
  • AddTaxonomicPathWorker(): Adds full lineage path (semicolon-separated).

4️⃣ Pattern Matching

Detects DNA motifs with tolerance for mismatches/indels:

  • MatchPatternWorker(pattern, max_errors=0, allow_indel=False):
    • Scans both strands via reverse-complement.
    • Annotates: slot_location (start/end), slot_match, and slot_error.
    • Uses Aho-Corasick for efficient multi-pattern search (file-based via obicorazick.AhoCorasickWorker).

5️⃣ CLI-Driven Pipeline Construction

Bridges command-line flags to composable workers:

  • CLIAnnotationWorker(args): Builds a composite worker from CLI flags (e.g., --pattern, --taxonomic-rank).
  • CLIAnnotationPipeline(args): Wraps the worker in a conditional pipeline (using obigrep predicates) and parallelizes via multiprocessing.

6️⃣ Utility & Validation

  • CLIHasPattern(pattern): Returns a worker that filters sequences matching pattern.
  • CLICut(start, end): Returns a cut worker for CLI usage.
  • All workers validate inputs (e.g., malformed --cut triggers fatal exit with log).

All public features are stateless, composable via ChainWorkers, and designed for high-throughput, scriptable annotation workflows.