Files
obitools4/autodoc/docmd/pkg_obitools_obirefidx.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

5.1 KiB
Raw Blame History

obirefidx Package: Semantic Overview

The obirefidx package implements a taxonomic reference indexing pipeline for high-throughput sequencing data, optimized for family-level classification. It combines k-mer-based pre-filtering with alignment-aware similarity scoring to build compact, taxonomically annotated reference indexes—enabling fast and accurate read assignment in metabarcoding workflows.


Public Functionalities

1. Reference Database Indexing Pipeline

IndexSequence(seqidx int, references []obiio.BioSeq, kmers obikmer.Table4mer, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)

Computes a taxonomic error-profile for one query sequence against all references:

  • Uses cached LCA lookups to group references by shared taxonomic ancestors.
  • Filters candidate sets using 4-mer overlap counts (fast).
  • Performs local alignment (FastLCSScore or D1Or0) to compute substitution+indel error counts.
  • Builds a strictly increasing vector of minimal errors per taxonomic rank (e.g., genus, family).
  • Outputs a map: error_count → "Taxon@Rank" (e.g., {0: "Homo@genus", 3: "Primates@order"}).

Key insight: Taxonomic resolution degrades predictably with alignment error.


IndexReferenceDB(iter obiio.SequenceIterator) (obiio.BatchedSequenceIterator)

Processes an entire reference database into indexed batches:

  • Validates sequences: skips those without valid taxonomic IDs.
  • Precomputes 4-mer frequency tables for all sequences (via obikmer.Table4mer).
  • Parallelizes indexing over chunks of 10 sequences using worker goroutines.
  • Calls IndexSequence for each sequence and attaches the result (obitag) to a copy.
  • Returns an iterator over batches, optionally displaying progress.

Design note: Memory reuse and batched I/O ensure scalability to large databases.


2. Clustering & Deduplication

MakeStartClusterSliceWorker(chunkSize int, identityThreshold float64) (func([]obiio.BioSeq) []ClusterSlice)

Performs greedy hierarchical clustering at family-level identity (hardcoded ≥90%):

  • Uses LCSS alignment with error tolerance derived from identityThreshold.
  • For each sequence, outputs:
    • clusterid: ID of its cluster centroid (head).
    • clusterhead: boolean flag indicating if it is the head.
    • clusteridentity: alignment-based identity to the centroid.

Use: Reduces redundancy before indexing—only centroids are re-indexed for efficiency.


3. Taxonomy & Geography-Aware Indexing

GeomIndexSesquence(seqidx int, references []obiio.BioSeq, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)

Computes a spatially-aware taxonomic index:

  • Retrieves geographic coordinates (lat/long) of the query sequence; fails if missing.
  • Computes Euclidean squared distances to all others in parallel.
  • Sorts neighbors by distance while preserving original indices (obiutils.Order).
  • Iteratively updates the LCA between query and neighbors, recording:
    • distance → "Taxon@Rank" map.
  • Stops early upon reaching root taxonomy.

Use case: Models taxonomic uncertainty bands based on nearest neighbors location + taxonomy.


4. Worker Utilities & Taxonomy Annotation

MakeSetFamilyTaxaWorker(), MakeSetGenusTaxaWorker(), etc.

Helper workers to annotate sequences with family/genus/species taxonomy:

  • Uses Taxonomy.LCA() and cached taxon IDs to assign ranks.
  • Parallelized over sequence batches (10 seqs/worker).
  • Ensures all indexed sequences carry full taxonomic context.

5. CLI Integration

OptionSet(options *getoptions.GetOpt)

Configures CLI options for the obiuniq tool:

  • Delegates to obiconvert.OptionSet(false) (no verbose logging).
  • Enables only options relevant for reference deduplication.
  • Ensures consistent, minimal interface across OBITools4 tools.

Technical Highlights

Feature Description
Parallelization Goroutines with obidefault.ParallelWorkers() for indexing, distance computation & clustering.
Memory Efficiency Reused buffers (matrix), batched processing, and sequence deduplication reduce RAM footprint.
Caching LCA lookups, 4-mer tables, and alignment matrices are cached to avoid recomputation.
Logging & Validation Structured logging via logrus; panics on critical errors (e.g., missing taxonomy).
Progress Tracking Optional progress bar via progressbar/v3 during large DB processing.

Output Format

Indexed sequences carry a map:

map[int]string // error_count → "Taxon@Rank"

Example:

{
  0: "Homo@genus",
  2: "Hominoidea@superfamily",
  5: "Primates@order"
}

Enables rank-specific classification thresholds (e.g., “assign to genus if ≤2 errors”).


Use Cases

  • Metabarcoding classification: Rapid assignment of reads to reference families.
  • Reference curation: Cluster & deduplicate large databases before indexing.
  • Ecological inference: Estimate taxonomic uncertainty from spatial proximity + taxonomy.

📌 Design principle: Align with OBITools4s philosophy—modular, parallelizable, and taxonomically aware.