- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
3.5 KiB
Semantic Description of obijoin Package
The obijoin package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4’s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.
Core Components & Functionalities
IndexedSequenceSlice
A composite structure combining a biological sequence slice (BioSequenceSlice) with precomputed indices. Each index maps attribute values (e.g., "sample=S1", "barcode=ATGC") to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.
Get(keys...)
Performs multi-key intersection queries across indexes: returns sequences satisfying all provided attribute constraints (e.g., Get("sample=S1", "barcode=ATGC")). Keys must match exactly; supports arbitrary string attributes via GetStringAttribute().
BuildIndexedSequenceSlice()
Constructs the index structure in O(n) time by scanning sequences once and grouping them per attribute. Accepts a BioSequenceSlice and returns an IndexedSequenceSlice. Handles any annotation attribute supported by the sequence system.
MakeJoinWorker()
Returns a functional SeqWorker implementing join logic:
- For each input sequence, extracts join keys (e.g.,
sample,barcode) from annotations. - Uses the index to find matching partner sequences (
join_with). - Outputs one sequence per match, copying original data and enriching it with partner annotations.
- Optionally updates ID/sequence/quality fields only if corresponding flags (
--update-id, etc.) are enabled.
CLIJoinSequences()
Top-level CLI entry point:
- Reads primary input (stdin or file).
- Loads secondary dataset (
--join-with), builds index viaBuildIndexedSequenceSlice(). - Applies join using worker from
MakeJoinWorker()with flags (--by,-i/-s/-q). - Integrates seamlessly into OBITools4’s streaming iterator model.
Join Semantics
| Feature | Behavior |
|---|---|
| Join type | Left outer join (primary dataset fully preserved) |
| Key matching | Exact string equality; no regex/fuzzy logic implied |
| Updates | Controlled by flags: -i/--update-id, -s/--update-sequence, -q/--update-quality |
| Metadata handling | Partner annotations are appended unless fields are overwritten |
CLI Options
-j/--join-with(required): Path to secondary sequence file (FASTA/FASTQ/TAB).-b/--by: Join key mapping, e.g."id=id"or"sample=well". Defaults to["id"].-i/--update-id: Replace sequence identifiers with partner values.-s/--update-sequence: Overwrite nucleotide/amino acid sequences from partners.-q/--update-quality: Replace quality scores (FASTQ only).
Usage Example
obijoin -i input.fastq \
--join-with annotations.tsv \
--by "id=name" \
-i -s
→ Joins input.fastq with TSV annotations, matching on id == name; updates IDs and sequences.
Design Principles
- Efficiency: Indexing avoids repeated full scans; uses optimized
obiutils.Set[int]for fast intersection. - Extensibility: Works with any annotation attribute supported by
BioSequence. - Modularity: CLI logic is configuration-only — no I/O or core algorithms embedded.
- Composability: Extends
obiconvert.OptionSet(); inherits standard format options (-f,-o) and follows OBITools4 CLI conventions.