9356be4ec0
Adds the `obitaxonomy` crate to parse and validate hierarchical taxonomy paths using a strict `taxonomy:/name@rank/...` syntax. Replaces generic string-based path matching in predicates with structured `TaxPath` and `TaxPattern` types, enforcing explicit anchor constraints and rank-aware semantics. Updates filtering documentation to clarify optional leading slashes and segment-boundary matching rules.
144 lines
4.4 KiB
Markdown
144 lines
4.4 KiB
Markdown
# `obitaxonomy` — taxonomy concept paths
|
|
|
|
`obitaxonomy` is a dependency-free crate that defines a typed representation
|
|
of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata.
|
|
|
|
---
|
|
|
|
## Concept path syntax
|
|
|
|
A concept path is stored as a metadata value with the prefix `taxonomy:/`:
|
|
|
|
```
|
|
taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species
|
|
```
|
|
|
|
Structure:
|
|
|
|
- The `taxonomy:/` prefix is the type discriminator. Any metadata value starting
|
|
with it is parsed as a `TaxPath`; all others remain plain strings.
|
|
- The remainder is one or more `/`-separated segments.
|
|
- Each segment is `name` or `name@rank`, where `rank` is a label for the
|
|
taxonomic level (e.g. `family`, `genus`, `species`).
|
|
- Rank annotations are **optional per segment** and can be mixed freely.
|
|
- Spaces are allowed in both names and ranks.
|
|
|
|
### Reserved character
|
|
|
|
`@` is reserved throughout the taxonomy system and may **not** appear in:
|
|
|
|
| Context | Constraint |
|
|
|---------|------------|
|
|
| Segment name | forbidden |
|
|
| Rank/class label | forbidden |
|
|
| Metadata key names | forbidden (used as `key@rank` in predicate syntax) |
|
|
|
|
`@` is freely allowed in plain-text metadata values (non-taxonomy).
|
|
|
|
### Parse errors
|
|
|
|
| Condition | Error |
|
|
|-----------|-------|
|
|
| Value does not start with `taxonomy:/` | `MissingPrefix` |
|
|
| No segments after the prefix | `EmptyPath` |
|
|
| Segment with empty name (consecutive `/`) | `EmptySegmentName` |
|
|
| Segment with trailing `@` and no rank (`name@`) | `EmptyRankName` |
|
|
| Segment with more than one `@` | `AmbiguousRank` |
|
|
|
|
---
|
|
|
|
## Public API
|
|
|
|
### `TaxSegment`
|
|
|
|
A single node: a name and an optional rank.
|
|
|
|
```rust
|
|
seg.name() // &str
|
|
seg.rank() // Option<&str>
|
|
seg.to_string() // "name" or "name@rank"
|
|
TaxSegment::parse(s) // Result<TaxSegment, TaxError>
|
|
```
|
|
|
|
### `TaxPath`
|
|
|
|
```rust
|
|
TaxPath::parse(s) // Result<TaxPath, TaxError>
|
|
path.segments() // &[TaxSegment]
|
|
path.depth() // usize — number of segments
|
|
path.is_ancestor_of(&other) // bool — prefix match by name, ranks ignored
|
|
path.name_at_rank("genus") // Option<&str>
|
|
path.to_string() // reconstructs "taxonomy:/…"
|
|
```
|
|
|
|
`is_ancestor_of` compares segment **names** only — rank annotations are
|
|
informational and do not affect the ancestry relation.
|
|
|
|
```rust
|
|
let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?;
|
|
let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?;
|
|
|
|
assert!(a.is_ancestor_of(&b)); // true
|
|
assert!(b.is_ancestor_of(&a)); // false
|
|
assert!(a.is_ancestor_of(&a)); // true (equal ⇒ ancestor)
|
|
|
|
assert_eq!(b.name_at_rank("species"), Some("Escherichia coli"));
|
|
assert_eq!(b.name_at_rank("genus"), Some("Escherichia"));
|
|
assert_eq!(b.name_at_rank("order"), None);
|
|
```
|
|
|
|
---
|
|
|
|
## Integration with `GenomeInfo`
|
|
|
|
At index load time, every metadata value is inspected once:
|
|
|
|
- Starts with `taxonomy:/` → parsed into `TaxPath`, stored in `genome.taxonomy`.
|
|
- Otherwise → kept as-is in `genome.meta`.
|
|
|
|
```rust
|
|
struct GenomeInfo {
|
|
label: String,
|
|
meta: HashMap<String, String>, // plain text metadata
|
|
taxonomy: HashMap<String, TaxPath>, // parsed taxonomy metadata
|
|
}
|
|
```
|
|
|
|
The raw string is not duplicated. `TaxPath::to_string()` reconstructs the
|
|
original value losslessly for serialisation.
|
|
|
|
---
|
|
|
|
## Predicate operators (in `filter` / `select`)
|
|
|
|
Path predicates use the `~` / `!~` operators. The **stored value** always starts
|
|
with `/` (rooted path); the **query pattern** does not need to.
|
|
|
|
### Path pattern syntax
|
|
|
|
| Pattern | Semantics |
|
|
|---------|-----------|
|
|
| `A/B` | contiguous sub-path A then B, anywhere in the value |
|
|
| `/A/B` | value starts with A then B (start-anchored) |
|
|
| `A/B$` | value ends with A then B (end-anchored) |
|
|
| `/A/B$` | value is exactly A then B (fully anchored) |
|
|
| `A@x/B` | A with class `x` followed by B with any class |
|
|
| `A@x/B@y` | A with class `x` followed by B with class `y` |
|
|
|
|
A segment pattern without `@` matches the segment name regardless of its stored class.
|
|
|
|
### Rank-aware queries
|
|
|
|
```
|
|
key@rank=value
|
|
```
|
|
|
|
| Predicate form | Semantics |
|
|
|----------------|-----------|
|
|
| `key@rank=value` | genome's `key` has `value` at rank `rank` |
|
|
| `key@rank!=value` | does not |
|
|
| `key@rank=v1\|v2` | value at `rank` is `v1` or `v2` |
|
|
|
|
`~` combined with `@rank` on the key (e.g. `key@genus~pattern`) is not defined
|
|
and is rejected at parse time.
|