mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.3 KiB
2.3 KiB
Semantic Description of obirefidx Package
The obirefidx package implements a taxonomic indexing pipeline for biological sequences, enabling efficient reference-based classification via alignment-free and alignment-based methods.
Core Functionality
-
IndexSequence(seqidx, references, kmers, taxa, taxo)
Computes a taxonomic signature for a query sequence by comparing it against reference sequences. It:- Identifies Least Common Ancestors (LCAs) between the query and all references using a cached LCA lookup.
- Groups reference sequences by their shared LCAs with the query across taxonomic ranks.
- Uses 4-mer common counts for fast pre-filtering of candidates.
- Performs local alignment (via
FastLCSScoreor exact distanceD1Or0) to compute error counts (substitutions + indels). - Builds a strictly increasing vector
closest[]of minimal alignment errors per taxonomic rank. - Maps each error threshold to the most specific matching taxon (
"Taxon@Rank"), stored in a map keyed by error count.
-
IndexReferenceDB(iterator)
Processes an entire reference database:- Loads sequences and filters out those lacking valid taxonomic IDs.
- Precomputes 4-mer frequency tables for all sequences to accelerate k-mer comparisons.
- Parallelizes indexing in batches (10 seqs/worker), using
IndexSequenceper sequence. - Attaches the resulting taxonomic index (
obitag) to each copy of the sequence viaSetOBITagRefIndex. - Returns an iterator over batches, optionally displaying a progress bar.
Key Technical Features
- Taxonomy-aware filtering: Exploits hierarchical taxonomic structure to limit alignment scope.
- Hybrid similarity search: Combines k-mer sharing (fast) with LCS-based alignment (accurate).
- Caching & optimization: LCA results are cached; memory for alignments is reused via a shared
matrix. - Parallelization: Uses goroutines and channels to process sequences concurrently.
- Robust error handling & logging: Leverages
logrusfor detailed diagnostics and progress tracking.
Output Format
Each indexed sequence carries a map map[int]string, where:
- Keys = alignment error counts (e.g., mismatches + gaps),
- Values = taxonomic labels like
"Homo@genus"or"Vertebrata@subphylum", enabling rank-specific classification thresholds.