obitools4/autodoc/docmd/pkg/obitools/obirefidx/obirefidx.md at a4ce24a418f2730aa00fd5f19c0a69ce7d99da3b

mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

2.3 KiB

Raw Blame History

Semantic Description of `obirefidx` Package

The obirefidx package implements a taxonomic indexing pipeline for biological sequences, enabling efficient reference-based classification via alignment-free and alignment-based methods.

Core Functionality

IndexSequence(seqidx, references, kmers, taxa, taxo)
Computes a taxonomic signature for a query sequence by comparing it against reference sequences. It:
- Identifies Least Common Ancestors (LCAs) between the query and all references using a cached LCA lookup.
- Groups reference sequences by their shared LCAs with the query across taxonomic ranks.
- Uses 4-mer common counts for fast pre-filtering of candidates.
- Performs local alignment (via FastLCSScore or exact distance D1Or0) to compute error counts (substitutions + indels).
- Builds a strictly increasing vector closest[] of minimal alignment errors per taxonomic rank.
- Maps each error threshold to the most specific matching taxon ("Taxon@Rank"), stored in a map keyed by error count.
IndexReferenceDB(iterator)
Processes an entire reference database:
- Loads sequences and filters out those lacking valid taxonomic IDs.
- Precomputes 4-mer frequency tables for all sequences to accelerate k-mer comparisons.
- Parallelizes indexing in batches (10 seqs/worker), using IndexSequence per sequence.
- Attaches the resulting taxonomic index (obitag) to each copy of the sequence via SetOBITagRefIndex.
- Returns an iterator over batches, optionally displaying a progress bar.

Key Technical Features

Taxonomy-aware filtering: Exploits hierarchical taxonomic structure to limit alignment scope.
Hybrid similarity search: Combines k-mer sharing (fast) with LCS-based alignment (accurate).
Caching & optimization: LCA results are cached; memory for alignments is reused via a shared matrix.
Parallelization: Uses goroutines and channels to process sequences concurrently.
Robust error handling & logging: Leverages logrus for detailed diagnostics and progress tracking.

Output Format

Each indexed sequence carries a map map[int]string, where:

Keys = alignment error counts (e.g., mismatches + gaps),
Values = taxonomic labels like "Homo@genus" or "Vertebrata@subphylum", enabling rank-specific classification thresholds.

2.3 KiB Raw Blame History

Semantic Description of obirefidx Package

Core Functionality

Key Technical Features

Output Format

2.3 KiB

Raw Blame History

Semantic Description of `obirefidx` Package