feat: introduce obitaxonomy crate for hierarchical taxonomy parsing
Adds the `obitaxonomy` crate to parse and validate hierarchical taxonomy paths using a strict `taxonomy:/name@rank/...` syntax. Replaces generic string-based path matching in predicates with structured `TaxPath` and `TaxPattern` types, enforcing explicit anchor constraints and rank-aware semantics. Updates filtering documentation to clarify optional leading slashes and segment-boundary matching rules.
This commit is contained in:
@@ -0,0 +1,143 @@
|
||||
# `obitaxonomy` — taxonomy concept paths
|
||||
|
||||
`obitaxonomy` is a dependency-free crate that defines a typed representation
|
||||
of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata.
|
||||
|
||||
---
|
||||
|
||||
## Concept path syntax
|
||||
|
||||
A concept path is stored as a metadata value with the prefix `taxonomy:/`:
|
||||
|
||||
```
|
||||
taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species
|
||||
```
|
||||
|
||||
Structure:
|
||||
|
||||
- The `taxonomy:/` prefix is the type discriminator. Any metadata value starting
|
||||
with it is parsed as a `TaxPath`; all others remain plain strings.
|
||||
- The remainder is one or more `/`-separated segments.
|
||||
- Each segment is `name` or `name@rank`, where `rank` is a label for the
|
||||
taxonomic level (e.g. `family`, `genus`, `species`).
|
||||
- Rank annotations are **optional per segment** and can be mixed freely.
|
||||
- Spaces are allowed in both names and ranks.
|
||||
|
||||
### Reserved character
|
||||
|
||||
`@` is reserved throughout the taxonomy system and may **not** appear in:
|
||||
|
||||
| Context | Constraint |
|
||||
|---------|------------|
|
||||
| Segment name | forbidden |
|
||||
| Rank/class label | forbidden |
|
||||
| Metadata key names | forbidden (used as `key@rank` in predicate syntax) |
|
||||
|
||||
`@` is freely allowed in plain-text metadata values (non-taxonomy).
|
||||
|
||||
### Parse errors
|
||||
|
||||
| Condition | Error |
|
||||
|-----------|-------|
|
||||
| Value does not start with `taxonomy:/` | `MissingPrefix` |
|
||||
| No segments after the prefix | `EmptyPath` |
|
||||
| Segment with empty name (consecutive `/`) | `EmptySegmentName` |
|
||||
| Segment with trailing `@` and no rank (`name@`) | `EmptyRankName` |
|
||||
| Segment with more than one `@` | `AmbiguousRank` |
|
||||
|
||||
---
|
||||
|
||||
## Public API
|
||||
|
||||
### `TaxSegment`
|
||||
|
||||
A single node: a name and an optional rank.
|
||||
|
||||
```rust
|
||||
seg.name() // &str
|
||||
seg.rank() // Option<&str>
|
||||
seg.to_string() // "name" or "name@rank"
|
||||
TaxSegment::parse(s) // Result<TaxSegment, TaxError>
|
||||
```
|
||||
|
||||
### `TaxPath`
|
||||
|
||||
```rust
|
||||
TaxPath::parse(s) // Result<TaxPath, TaxError>
|
||||
path.segments() // &[TaxSegment]
|
||||
path.depth() // usize — number of segments
|
||||
path.is_ancestor_of(&other) // bool — prefix match by name, ranks ignored
|
||||
path.name_at_rank("genus") // Option<&str>
|
||||
path.to_string() // reconstructs "taxonomy:/…"
|
||||
```
|
||||
|
||||
`is_ancestor_of` compares segment **names** only — rank annotations are
|
||||
informational and do not affect the ancestry relation.
|
||||
|
||||
```rust
|
||||
let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?;
|
||||
let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?;
|
||||
|
||||
assert!(a.is_ancestor_of(&b)); // true
|
||||
assert!(b.is_ancestor_of(&a)); // false
|
||||
assert!(a.is_ancestor_of(&a)); // true (equal ⇒ ancestor)
|
||||
|
||||
assert_eq!(b.name_at_rank("species"), Some("Escherichia coli"));
|
||||
assert_eq!(b.name_at_rank("genus"), Some("Escherichia"));
|
||||
assert_eq!(b.name_at_rank("order"), None);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with `GenomeInfo`
|
||||
|
||||
At index load time, every metadata value is inspected once:
|
||||
|
||||
- Starts with `taxonomy:/` → parsed into `TaxPath`, stored in `genome.taxonomy`.
|
||||
- Otherwise → kept as-is in `genome.meta`.
|
||||
|
||||
```rust
|
||||
struct GenomeInfo {
|
||||
label: String,
|
||||
meta: HashMap<String, String>, // plain text metadata
|
||||
taxonomy: HashMap<String, TaxPath>, // parsed taxonomy metadata
|
||||
}
|
||||
```
|
||||
|
||||
The raw string is not duplicated. `TaxPath::to_string()` reconstructs the
|
||||
original value losslessly for serialisation.
|
||||
|
||||
---
|
||||
|
||||
## Predicate operators (in `filter` / `select`)
|
||||
|
||||
Path predicates use the `~` / `!~` operators. The **stored value** always starts
|
||||
with `/` (rooted path); the **query pattern** does not need to.
|
||||
|
||||
### Path pattern syntax
|
||||
|
||||
| Pattern | Semantics |
|
||||
|---------|-----------|
|
||||
| `A/B` | contiguous sub-path A then B, anywhere in the value |
|
||||
| `/A/B` | value starts with A then B (start-anchored) |
|
||||
| `A/B$` | value ends with A then B (end-anchored) |
|
||||
| `/A/B$` | value is exactly A then B (fully anchored) |
|
||||
| `A@x/B` | A with class `x` followed by B with any class |
|
||||
| `A@x/B@y` | A with class `x` followed by B with class `y` |
|
||||
|
||||
A segment pattern without `@` matches the segment name regardless of its stored class.
|
||||
|
||||
### Rank-aware queries
|
||||
|
||||
```
|
||||
key@rank=value
|
||||
```
|
||||
|
||||
| Predicate form | Semantics |
|
||||
|----------------|-----------|
|
||||
| `key@rank=value` | genome's `key` has `value` at rank `rank` |
|
||||
| `key@rank!=value` | does not |
|
||||
| `key@rank=v1\|v2` | value at `rank` is `v1` or `v2` |
|
||||
|
||||
`~` combined with `@rank` on the key (e.g. `key@genus~pattern`) is not defined
|
||||
and is rejected at parse time.
|
||||
Reference in New Issue
Block a user