# `obitaxonomy` — taxonomy concept paths `obitaxonomy` is a dependency-free crate that defines a typed representation of hierarchical concept paths (taxonomic or otherwise) stored in genome metadata. --- ## Concept path syntax A concept path is stored as a metadata value with the prefix `taxonomy:/`: ``` taxonomy:/enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species ``` Structure: - The `taxonomy:/` prefix is the type discriminator. Any metadata value starting with it is parsed as a `TaxPath`; all others remain plain strings. - The remainder is one or more `/`-separated segments. - Each segment is `name` or `name@rank`, where `rank` is a label for the taxonomic level (e.g. `family`, `genus`, `species`). - Rank annotations are **optional per segment** and can be mixed freely. - Spaces are allowed in both names and ranks. ### Reserved character `@` is reserved throughout the taxonomy system and may **not** appear in: | Context | Constraint | |---------|------------| | Segment name | forbidden | | Rank/class label | forbidden | | Metadata key names | forbidden (used as `key@rank` in predicate syntax) | `@` is freely allowed in plain-text metadata values (non-taxonomy). ### Parse errors | Condition | Error | |-----------|-------| | Value does not start with `taxonomy:/` | `MissingPrefix` | | No segments after the prefix | `EmptyPath` | | Segment with empty name (consecutive `/`) | `EmptySegmentName` | | Segment with trailing `@` and no rank (`name@`) | `EmptyRankName` | | Segment with more than one `@` | `AmbiguousRank` | --- ## Public API ### `TaxSegment` A single node: a name and an optional rank. ```rust seg.name() // &str seg.rank() // Option<&str> seg.to_string() // "name" or "name@rank" TaxSegment::parse(s) // Result ``` ### `TaxPath` ```rust TaxPath::parse(s) // Result path.segments() // &[TaxSegment] path.depth() // usize — number of segments path.is_ancestor_of(&other) // bool — prefix match by name, ranks ignored path.name_at_rank("genus") // Option<&str> path.to_string() // reconstructs "taxonomy:/…" ``` `is_ancestor_of` compares segment **names** only — rank annotations are informational and do not affect the ancestry relation. ```rust let a: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus".parse()?; let b: TaxPath = "taxonomy:/Enterobacteriaceae@family/Escherichia@genus/Escherichia coli@species".parse()?; assert!(a.is_ancestor_of(&b)); // true assert!(b.is_ancestor_of(&a)); // false assert!(a.is_ancestor_of(&a)); // true (equal ⇒ ancestor) assert_eq!(b.name_at_rank("species"), Some("Escherichia coli")); assert_eq!(b.name_at_rank("genus"), Some("Escherichia")); assert_eq!(b.name_at_rank("order"), None); ``` --- ## Integration with `GenomeInfo` At index load time, every metadata value is inspected once: - Starts with `taxonomy:/` → parsed into `TaxPath`, stored in `genome.taxonomy`. - Otherwise → kept as-is in `genome.meta`. ```rust struct GenomeInfo { label: String, meta: HashMap, // plain text metadata taxonomy: HashMap, // parsed taxonomy metadata } ``` The raw string is not duplicated. `TaxPath::to_string()` reconstructs the original value losslessly for serialisation. --- ## Predicate operators (in `filter` / `select`) Path predicates use the `~` / `!~` operators. The **stored value** always starts with `/` (rooted path); the **query pattern** does not need to. ### Path pattern syntax | Pattern | Semantics | |---------|-----------| | `A/B` | contiguous sub-path A then B, anywhere in the value | | `/A/B` | value starts with A then B (start-anchored) | | `A/B$` | value ends with A then B (end-anchored) | | `/A/B$` | value is exactly A then B (fully anchored) | | `A@x/B` | A with class `x` followed by B with any class | | `A@x/B@y` | A with class `x` followed by B with class `y` | A segment pattern without `@` matches the segment name regardless of its stored class. ### Rank-aware queries ``` key@rank=value ``` | Predicate form | Semantics | |----------------|-----------| | `key@rank=value` | genome's `key` has `value` at rank `rank` | | `key@rank!=value` | does not | | `key@rank=v1\|v2` | value at `rank` is `v1` or `v2` | `~` combined with `@rank` on the key (e.g. `key@genus~pattern`) is not defined and is rejected at parse time.