mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
32 lines
1.7 KiB
Markdown
32 lines
1.7 KiB
Markdown
|
|
## NCBI Taxonomy Archive Support in `obiformats`
|
|||
|
|
|
|||
|
|
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
|
|||
|
|
|
|||
|
|
### Core Functionalities
|
|||
|
|
|
|||
|
|
1. **Archive Validation (`IsNCBITarTaxDump`)**
|
|||
|
|
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
|
|||
|
|
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
|
|||
|
|
|
|||
|
|
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
|
|||
|
|
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
|
|||
|
|
- Steps include:
|
|||
|
|
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
|
|||
|
|
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
|
|||
|
|
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
|
|||
|
|
- Sets the root taxon to NCBI’s default (`taxid = 1`, i.e., *root*).
|
|||
|
|
|
|||
|
|
3. **Integration with Other Modules**
|
|||
|
|
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
|
|||
|
|
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
|
|||
|
|
|
|||
|
|
### Key Parameters
|
|||
|
|
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
|
|||
|
|
- `seqAsTaxa`: Reserved for future use; currently unused.
|
|||
|
|
|
|||
|
|
### Logging & Error Handling
|
|||
|
|
- Uses `logrus` to log loading progress and counts.
|
|||
|
|
- Returns descriptive errors if required files or the root taxon are missing.
|
|||
|
|
|
|||
|
|
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
|