Adds the `obitaxonomy` crate to parse and validate hierarchical taxonomy paths using a strict `taxonomy:/name@rank/...` syntax. Replaces generic string-based path matching in predicates with structured `TaxPath` and `TaxPattern` types, enforcing explicit anchor constraints and rank-aware semantics. Updates filtering documentation to clarify optional leading slashes and segment-boundary matching rules.
4.4 KiB
obitaxonomy — taxonomy concept paths
obitaxonomy is a dependency-free crate that defines a typed representation
of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata.
Concept path syntax
A concept path is stored as a metadata value with the prefix taxonomy:/:
taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species
Structure:
- The
taxonomy:/prefix is the type discriminator. Any metadata value starting with it is parsed as aTaxPath; all others remain plain strings. - The remainder is one or more
/-separated segments. - Each segment is
nameorname@rank, whererankis a label for the taxonomic level (e.g.family,genus,species). - Rank annotations are optional per segment and can be mixed freely.
- Spaces are allowed in both names and ranks.
Reserved character
@ is reserved throughout the taxonomy system and may not appear in:
| Context | Constraint |
|---|---|
| Segment name | forbidden |
| Rank/class label | forbidden |
| Metadata key names | forbidden (used as key@rank in predicate syntax) |
@ is freely allowed in plain-text metadata values (non-taxonomy).
Parse errors
| Condition | Error |
|---|---|
Value does not start with taxonomy:/ |
MissingPrefix |
| No segments after the prefix | EmptyPath |
Segment with empty name (consecutive /) |
EmptySegmentName |
Segment with trailing @ and no rank (name@) |
EmptyRankName |
Segment with more than one @ |
AmbiguousRank |
Public API
TaxSegment
A single node: a name and an optional rank.
seg.name() // &str
seg.rank() // Option<&str>
seg.to_string() // "name" or "name@rank"
TaxSegment::parse(s) // Result<TaxSegment, TaxError>
TaxPath
TaxPath::parse(s) // Result<TaxPath, TaxError>
path.segments() // &[TaxSegment]
path.depth() // usize — number of segments
path.is_ancestor_of(&other) // bool — prefix match by name, ranks ignored
path.name_at_rank("genus") // Option<&str>
path.to_string() // reconstructs "taxonomy:/…"
is_ancestor_of compares segment names only — rank annotations are
informational and do not affect the ancestry relation.
let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?;
let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?;
assert!(a.is_ancestor_of(&b)); // true
assert!(b.is_ancestor_of(&a)); // false
assert!(a.is_ancestor_of(&a)); // true (equal ⇒ ancestor)
assert_eq!(b.name_at_rank("species"), Some("Escherichia coli"));
assert_eq!(b.name_at_rank("genus"), Some("Escherichia"));
assert_eq!(b.name_at_rank("order"), None);
Integration with GenomeInfo
At index load time, every metadata value is inspected once:
- Starts with
taxonomy:/→ parsed intoTaxPath, stored ingenome.taxonomy. - Otherwise → kept as-is in
genome.meta.
struct GenomeInfo {
label: String,
meta: HashMap<String, String>, // plain text metadata
taxonomy: HashMap<String, TaxPath>, // parsed taxonomy metadata
}
The raw string is not duplicated. TaxPath::to_string() reconstructs the
original value losslessly for serialisation.
Predicate operators (in filter / select)
Path predicates use the ~ / !~ operators. The stored value always starts
with / (rooted path); the query pattern does not need to.
Path pattern syntax
| Pattern | Semantics |
|---|---|
A/B |
contiguous sub-path A then B, anywhere in the value |
/A/B |
value starts with A then B (start-anchored) |
A/B$ |
value ends with A then B (end-anchored) |
/A/B$ |
value is exactly A then B (fully anchored) |
A@x/B |
A with class x followed by B with any class |
A@x/B@y |
A with class x followed by B with class y |
A segment pattern without @ matches the segment name regardless of its stored class.
Rank-aware queries
key@rank=value
| Predicate form | Semantics |
|---|---|
key@rank=value |
genome's key has value at rank rank |
key@rank!=value |
does not |
key@rank=v1|v2 |
value at rank is v1 or v2 |
~ combined with @rank on the key (e.g. key@genus~pattern) is not defined
and is rejected at parse time.