Files
obikmer/docmd/implementation/obitaxonomy.md
T
Eric Coissac 9356be4ec0 feat: introduce obitaxonomy crate for hierarchical taxonomy parsing
Adds the `obitaxonomy` crate to parse and validate hierarchical taxonomy paths using a strict `taxonomy:/name@rank/...` syntax. Replaces generic string-based path matching in predicates with structured `TaxPath` and `TaxPattern` types, enforcing explicit anchor constraints and rank-aware semantics. Updates filtering documentation to clarify optional leading slashes and segment-boundary matching rules.
2026-06-22 10:24:04 +02:00

4.4 KiB

obitaxonomy — taxonomy concept paths

obitaxonomy is a dependency-free crate that defines a typed representation of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata.


Concept path syntax

A concept path is stored as a metadata value with the prefix taxonomy:/:

taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species

Structure:

  • The taxonomy:/ prefix is the type discriminator. Any metadata value starting with it is parsed as a TaxPath; all others remain plain strings.
  • The remainder is one or more /-separated segments.
  • Each segment is name or name@rank, where rank is a label for the taxonomic level (e.g. family, genus, species).
  • Rank annotations are optional per segment and can be mixed freely.
  • Spaces are allowed in both names and ranks.

Reserved character

@ is reserved throughout the taxonomy system and may not appear in:

Context Constraint
Segment name forbidden
Rank/class label forbidden
Metadata key names forbidden (used as key@rank in predicate syntax)

@ is freely allowed in plain-text metadata values (non-taxonomy).

Parse errors

Condition Error
Value does not start with taxonomy:/ MissingPrefix
No segments after the prefix EmptyPath
Segment with empty name (consecutive /) EmptySegmentName
Segment with trailing @ and no rank (name@) EmptyRankName
Segment with more than one @ AmbiguousRank

Public API

TaxSegment

A single node: a name and an optional rank.

seg.name()            // &str
seg.rank()            // Option<&str>
seg.to_string()       // "name" or "name@rank"
TaxSegment::parse(s)  // Result<TaxSegment, TaxError>

TaxPath

TaxPath::parse(s)               // Result<TaxPath, TaxError>
path.segments()                 // &[TaxSegment]
path.depth()                    // usize — number of segments
path.is_ancestor_of(&other)     // bool — prefix match by name, ranks ignored
path.name_at_rank("genus")      // Option<&str>
path.to_string()                // reconstructs "taxonomy:/…"

is_ancestor_of compares segment names only — rank annotations are informational and do not affect the ancestry relation.

let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?;
let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?;

assert!(a.is_ancestor_of(&b));   // true
assert!(b.is_ancestor_of(&a));   // false
assert!(a.is_ancestor_of(&a));   // true  (equal ⇒ ancestor)

assert_eq!(b.name_at_rank("species"), Some("Escherichia coli"));
assert_eq!(b.name_at_rank("genus"),   Some("Escherichia"));
assert_eq!(b.name_at_rank("order"),   None);

Integration with GenomeInfo

At index load time, every metadata value is inspected once:

  • Starts with taxonomy:/ → parsed into TaxPath, stored in genome.taxonomy.
  • Otherwise → kept as-is in genome.meta.
struct GenomeInfo {
    label:    String,
    meta:     HashMap<String, String>,    // plain text metadata
    taxonomy: HashMap<String, TaxPath>,   // parsed taxonomy metadata
}

The raw string is not duplicated. TaxPath::to_string() reconstructs the original value losslessly for serialisation.


Predicate operators (in filter / select)

Path predicates use the ~ / !~ operators. The stored value always starts with / (rooted path); the query pattern does not need to.

Path pattern syntax

Pattern Semantics
A/B contiguous sub-path A then B, anywhere in the value
/A/B value starts with A then B (start-anchored)
A/B$ value ends with A then B (end-anchored)
/A/B$ value is exactly A then B (fully anchored)
A@x/B A with class x followed by B with any class
A@x/B@y A with class x followed by B with class y

A segment pattern without @ matches the segment name regardless of its stored class.

Rank-aware queries

key@rank=value
Predicate form Semantics
key@rank=value genome's key has value at rank rank
key@rank!=value does not
key@rank=v1|v2 value at rank is v1 or v2

~ combined with @rank on the key (e.g. key@genus~pattern) is not defined and is rejected at parse time.