mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,120 @@
|
||||
# `obirefidx` Package: Semantic Overview
|
||||
|
||||
The `obirefidx` package implements a **taxonomic reference indexing pipeline** for high-throughput sequencing data, optimized for family-level classification. It combines *k*-mer-based pre-filtering with alignment-aware similarity scoring to build compact, taxonomically annotated reference indexes—enabling fast and accurate read assignment in metabarcoding workflows.
|
||||
|
||||
---
|
||||
|
||||
## Public Functionalities
|
||||
|
||||
### 1. **Reference Database Indexing Pipeline**
|
||||
|
||||
#### `IndexSequence(seqidx int, references []obiio.BioSeq, kmers obikmer.Table4mer, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)`
|
||||
Computes a **taxonomic error-profile** for one query sequence against all references:
|
||||
- Uses cached LCA lookups to group references by shared taxonomic ancestors.
|
||||
- Filters candidate sets using 4-mer overlap counts (fast).
|
||||
- Performs local alignment (`FastLCSScore` or `D1Or0`) to compute substitution+indel error counts.
|
||||
- Builds a strictly increasing vector of minimal errors per taxonomic rank (e.g., genus, family).
|
||||
- Outputs a map: `error_count → "Taxon@Rank"` (e.g., `{0: "Homo@genus", 3: "Primates@order"}`).
|
||||
|
||||
> ✅ *Key insight*: Taxonomic resolution degrades predictably with alignment error.
|
||||
|
||||
---
|
||||
|
||||
#### `IndexReferenceDB(iter obiio.SequenceIterator) (obiio.BatchedSequenceIterator)`
|
||||
Processes an entire reference database into indexed batches:
|
||||
- Validates sequences: skips those without valid taxonomic IDs.
|
||||
- Precomputes 4-mer frequency tables for all sequences (via `obikmer.Table4mer`).
|
||||
- Parallelizes indexing over chunks of 10 sequences using worker goroutines.
|
||||
- Calls `IndexSequence` for each sequence and attaches the result (`obitag`) to a copy.
|
||||
- Returns an iterator over batches, optionally displaying progress.
|
||||
|
||||
> ✅ *Design note*: Memory reuse and batched I/O ensure scalability to large databases.
|
||||
|
||||
---
|
||||
|
||||
### 2. **Clustering & Deduplication**
|
||||
|
||||
#### `MakeStartClusterSliceWorker(chunkSize int, identityThreshold float64) (func([]obiio.BioSeq) []ClusterSlice)`
|
||||
Performs **greedy hierarchical clustering** at family-level identity (hardcoded ≥90%):
|
||||
- Uses LCSS alignment with error tolerance derived from `identityThreshold`.
|
||||
- For each sequence, outputs:
|
||||
- `clusterid`: ID of its cluster centroid (head).
|
||||
- `clusterhead`: boolean flag indicating if it *is* the head.
|
||||
- `clusteridentity`: alignment-based identity to the centroid.
|
||||
|
||||
> ✅ *Use*: Reduces redundancy before indexing—only centroids are re-indexed for efficiency.
|
||||
|
||||
---
|
||||
|
||||
### 3. **Taxonomy & Geography-Aware Indexing**
|
||||
|
||||
#### `GeomIndexSesquence(seqidx int, references []obiio.BioSeq, taxa map[string]TaxonID, taxo TaxonomySlice) (map[int]string)`
|
||||
Computes a **spatially-aware taxonomic index**:
|
||||
- Retrieves geographic coordinates (lat/long) of the query sequence; fails if missing.
|
||||
- Computes Euclidean squared distances to all others in parallel.
|
||||
- Sorts neighbors by distance while preserving original indices (`obiutils.Order`).
|
||||
- Iteratively updates the LCA between query and neighbors, recording:
|
||||
- `distance → "Taxon@Rank"` map.
|
||||
- Stops early upon reaching root taxonomy.
|
||||
|
||||
> ✅ *Use case*: Models taxonomic uncertainty bands based on nearest neighbors’ location + taxonomy.
|
||||
|
||||
---
|
||||
|
||||
### 4. **Worker Utilities & Taxonomy Annotation**
|
||||
|
||||
#### `MakeSetFamilyTaxaWorker()`, `MakeSetGenusTaxaWorker()`, etc.
|
||||
Helper workers to annotate sequences with family/genus/species taxonomy:
|
||||
- Uses `Taxonomy.LCA()` and cached taxon IDs to assign ranks.
|
||||
- Parallelized over sequence batches (10 seqs/worker).
|
||||
- Ensures all indexed sequences carry full taxonomic context.
|
||||
|
||||
---
|
||||
|
||||
### 5. **CLI Integration**
|
||||
|
||||
#### `OptionSet(options *getoptions.GetOpt)`
|
||||
Configures CLI options for the `obiuniq` tool:
|
||||
- Delegates to `obiconvert.OptionSet(false)` (no verbose logging).
|
||||
- Enables only options relevant for reference deduplication.
|
||||
- Ensures consistent, minimal interface across OBITools4 tools.
|
||||
|
||||
---
|
||||
|
||||
## Technical Highlights
|
||||
|
||||
| Feature | Description |
|
||||
|--------|-------------|
|
||||
| **Parallelization** | Goroutines with `obidefault.ParallelWorkers()` for indexing, distance computation & clustering. |
|
||||
| **Memory Efficiency** | Reused buffers (`matrix`), batched processing, and sequence deduplication reduce RAM footprint. |
|
||||
| **Caching** | LCA lookups, 4-mer tables, and alignment matrices are cached to avoid recomputation. |
|
||||
| **Logging & Validation** | Structured logging via `logrus`; panics on critical errors (e.g., missing taxonomy). |
|
||||
| **Progress Tracking** | Optional progress bar via `progressbar/v3` during large DB processing. |
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
Indexed sequences carry a map:
|
||||
```go
|
||||
map[int]string // error_count → "Taxon@Rank"
|
||||
```
|
||||
Example:
|
||||
```json
|
||||
{
|
||||
0: "Homo@genus",
|
||||
2: "Hominoidea@superfamily",
|
||||
5: "Primates@order"
|
||||
}
|
||||
```
|
||||
Enables **rank-specific classification thresholds** (e.g., “assign to genus if ≤2 errors”).
|
||||
|
||||
---
|
||||
|
||||
## Use Cases
|
||||
|
||||
- **Metabarcoding classification**: Rapid assignment of reads to reference families.
|
||||
- **Reference curation**: Cluster & deduplicate large databases before indexing.
|
||||
- **Ecological inference**: Estimate taxonomic uncertainty from spatial proximity + taxonomy.
|
||||
|
||||
> 📌 *Design principle*: Align with OBITools4’s philosophy—modular, parallelizable, and taxonomically aware.
|
||||
Reference in New Issue
Block a user