# `obirefidx` Package: Semantic Overview The `obirefidx` package implements a **taxonomic reference indexing pipeline** for high-throughput sequencing data, optimized for family-level classification. It combines *k*-mer-based pre-filtering with alignment-aware similarity scoring to build compact, taxonomically annotated reference indexes—enabling fast and accurate read assignment in metabarcoding workflows. --- ## Public Functionalities ### 1. **Reference Database Indexing Pipeline** #### `IndexSequence(seqidx int, references []obiio.BioSeq, kmers obikmer.Table4mer, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)` Computes a **taxonomic error-profile** for one query sequence against all references: - Uses cached LCA lookups to group references by shared taxonomic ancestors. - Filters candidate sets using 4-mer overlap counts (fast). - Performs local alignment (`FastLCSScore` or `D1Or0`) to compute substitution+indel error counts. - Builds a strictly increasing vector of minimal errors per taxonomic rank (e.g., genus, family). - Outputs a map: `error_count → "Taxon@Rank"` (e.g., `{0: "Homo@genus", 3: "Primates@order"}`). > ✅ *Key insight*: Taxonomic resolution degrades predictably with alignment error. --- #### `IndexReferenceDB(iter obiio.SequenceIterator) (obiio.BatchedSequenceIterator)` Processes an entire reference database into indexed batches: - Validates sequences: skips those without valid taxonomic IDs. - Precomputes 4-mer frequency tables for all sequences (via `obikmer.Table4mer`). - Parallelizes indexing over chunks of 10 sequences using worker goroutines. - Calls `IndexSequence` for each sequence and attaches the result (`obitag`) to a copy. - Returns an iterator over batches, optionally displaying progress. > ✅ *Design note*: Memory reuse and batched I/O ensure scalability to large databases. --- ### 2. **Clustering & Deduplication** #### `MakeStartClusterSliceWorker(chunkSize int, identityThreshold float64) (func([]obiio.BioSeq) []ClusterSlice)` Performs **greedy hierarchical clustering** at family-level identity (hardcoded ≥90%): - Uses LCSS alignment with error tolerance derived from `identityThreshold`. - For each sequence, outputs: - `clusterid`: ID of its cluster centroid (head). - `clusterhead`: boolean flag indicating if it *is* the head. - `clusteridentity`: alignment-based identity to the centroid. > ✅ *Use*: Reduces redundancy before indexing—only centroids are re-indexed for efficiency. --- ### 3. **Taxonomy & Geography-Aware Indexing** #### `GeomIndexSesquence(seqidx int, references []obiio.BioSeq, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)` Computes a **spatially-aware taxonomic index**: - Retrieves geographic coordinates (lat/long) of the query sequence; fails if missing. - Computes Euclidean squared distances to all others in parallel. - Sorts neighbors by distance while preserving original indices (`obiutils.Order`). - Iteratively updates the LCA between query and neighbors, recording: - `distance → "Taxon@Rank"` map. - Stops early upon reaching root taxonomy. > ✅ *Use case*: Models taxonomic uncertainty bands based on nearest neighbors’ location + taxonomy. --- ### 4. **Worker Utilities & Taxonomy Annotation** #### `MakeSetFamilyTaxaWorker()`, `MakeSetGenusTaxaWorker()`, etc. Helper workers to annotate sequences with family/genus/species taxonomy: - Uses `Taxonomy.LCA()` and cached taxon IDs to assign ranks. - Parallelized over sequence batches (10 seqs/worker). - Ensures all indexed sequences carry full taxonomic context. --- ### 5. **CLI Integration** #### `OptionSet(options *getoptions.GetOpt)` Configures CLI options for the `obiuniq` tool: - Delegates to `obiconvert.OptionSet(false)` (no verbose logging). - Enables only options relevant for reference deduplication. - Ensures consistent, minimal interface across OBITools4 tools. --- ## Technical Highlights | Feature | Description | |--------|-------------| | **Parallelization** | Goroutines with `obidefault.ParallelWorkers()` for indexing, distance computation & clustering. | | **Memory Efficiency** | Reused buffers (`matrix`), batched processing, and sequence deduplication reduce RAM footprint. | | **Caching** | LCA lookups, 4-mer tables, and alignment matrices are cached to avoid recomputation. | | **Logging & Validation** | Structured logging via `logrus`; panics on critical errors (e.g., missing taxonomy). | | **Progress Tracking** | Optional progress bar via `progressbar/v3` during large DB processing. | --- ## Output Format Indexed sequences carry a map: ```go map[int]string // error_count → "Taxon@Rank" ``` Example: ```json { 0: "Homo@genus", 2: "Hominoidea@superfamily", 5: "Primates@order" } ``` Enables **rank-specific classification thresholds** (e.g., “assign to genus if ≤2 errors”). --- ## Use Cases - **Metabarcoding classification**: Rapid assignment of reads to reference families. - **Reference curation**: Cluster & deduplicate large databases before indexing. - **Ecological inference**: Estimate taxonomic uncertainty from spatial proximity + taxonomy. > 📌 *Design principle*: Align with OBITools4’s philosophy—modular, parallelizable, and taxonomically aware.