mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
42 lines
1.5 KiB
Markdown
42 lines
1.5 KiB
Markdown
|
|
# `obitax` Package: Taxon String Parser
|
||
|
|
|
||
|
|
The `obitax` package provides a robust parser for structured taxonomic strings used in biodiversity data processing.
|
||
|
|
|
||
|
|
## Core Functionality
|
||
|
|
|
||
|
|
- **`ParseTaxonString(taxonStr string)`**
|
||
|
|
Parses strings in the format: `code:taxid [scientific name]@rank`.
|
||
|
|
|
||
|
|
- **Input Format Requirements**
|
||
|
|
- `code`: Taxonomy identifier (e.g., "GBIF", "NCBI")
|
||
|
|
- `taxid`: Numeric or alphanumeric taxonomic ID (e.g., "123456")
|
||
|
|
- `scientific name`: Enclosed in square brackets (e.g., "[Homo sapiens]")
|
||
|
|
- `rank`: Optional taxonomic rank after `@` (e.g., "species", defaults to `"no rank"` if missing)
|
||
|
|
|
||
|
|
- **Robustness Features**
|
||
|
|
- Trims whitespace around all components.
|
||
|
|
- Handles multiple `@` symbols (returns error).
|
||
|
|
- Validates bracket pairing and ordering.
|
||
|
|
- Ensures `code:taxid` contains exactly one colon separator.
|
||
|
|
|
||
|
|
- **Error Handling**
|
||
|
|
Returns descriptive errors for:
|
||
|
|
- Missing or malformed brackets
|
||
|
|
- Invalid number of `@` separators
|
||
|
|
- Absent colon in code:taxid segment
|
||
|
|
- Empty fields (code, taxid, or scientific name)
|
||
|
|
|
||
|
|
- **Use Cases**
|
||
|
|
Ideal for parsing legacy biodiversity records (e.g., from OBIS, GBIF), where taxon strings are semi-structured and need reliable extraction before indexing or matching against reference databases.
|
||
|
|
|
||
|
|
## Example
|
||
|
|
|
||
|
|
Input: `"GBIF:248093 [Homo sapiens]@species"`
|
||
|
|
Output components:
|
||
|
|
- `code = "GBIF"`
|
||
|
|
- `taxid = "248093"`
|
||
|
|
- `scientificName = "Homo sapiens"`
|
||
|
|
- `rank = "species"`
|
||
|
|
|
||
|
|
Returns empty strings and an error for invalid inputs.
|