Files
obitools4/autodoc/docmd/pkg_obitools_obijoin.md
T

63 lines
3.5 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Semantic Description of `obijoin` Package
The `obijoin` package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.
## Core Components & Functionalities
### `IndexedSequenceSlice`
A composite structure combining a biological sequence slice (`BioSequenceSlice`) with precomputed indices. Each index maps attribute values (e.g., `"sample=S1"`, `"barcode=ATGC"`) to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.
### `Get(keys...)`
Performs multi-key *intersection* queries across indexes: returns sequences satisfying **all** provided attribute constraints (e.g., `Get("sample=S1", "barcode=ATGC")`). Keys must match *exactly*; supports arbitrary string attributes via `GetStringAttribute()`.
### `BuildIndexedSequenceSlice()`
Constructs the index structure in **O(*n*)** time by scanning sequences once and grouping them per attribute. Accepts a `BioSequenceSlice` and returns an `IndexedSequenceSlice`. Handles any annotation attribute supported by the sequence system.
### `MakeJoinWorker()`
Returns a functional `SeqWorker` implementing join logic:
- For each input sequence, extracts join keys (e.g., `sample`, `barcode`) from annotations.
- Uses the index to find matching partner sequences (`join_with`).
- Outputs one sequence per match, copying original data and enriching it with partner annotations.
- Optionally updates ID/sequence/quality fields *only if* corresponding flags (`--update-id`, etc.) are enabled.
### `CLIJoinSequences()`
Top-level CLI entry point:
- Reads primary input (stdin or file).
- Loads secondary dataset (`--join-with`), builds index via `BuildIndexedSequenceSlice()`.
- Applies join using worker from `MakeJoinWorker()` with flags (`--by`, `-i/-s/-q`).
- Integrates seamlessly into OBITools4s streaming iterator model.
## Join Semantics
| Feature | Behavior |
|--------|----------|
| **Join type** | Left outer join (primary dataset fully preserved) |
| **Key matching** | Exact string equality; no regex/fuzzy logic implied |
| **Updates** | Controlled by flags: `-i/--update-id`, `-s/--update-sequence`, `-q/--update-quality` |
| **Metadata handling** | Partner annotations are appended unless fields are overwritten |
## CLI Options
- `-j/--join-with` *(required)*: Path to secondary sequence file (FASTA/FASTQ/TAB).
- `-b/--by`: Join key mapping, e.g. `"id=id"` or `"sample=well"`. Defaults to `["id"]`.
- `-i/--update-id`: Replace sequence identifiers with partner values.
- `-s/--update-sequence`: Overwrite nucleotide/amino acid sequences from partners.
- `-q/--update-quality`: Replace quality scores (FASTQ only).
## Usage Example
```bash
obijoin -i input.fastq \
--join-with annotations.tsv \
--by "id=name" \
-i -s
```
→ Joins `input.fastq` with TSV annotations, matching on `id == name`; updates IDs and sequences.
## Design Principles
- **Efficiency**: Indexing avoids repeated full scans; uses optimized `obiutils.Set[int]` for fast intersection.
- **Extensibility**: Works with any annotation attribute supported by `BioSequence`.
- **Modularity**: CLI logic is configuration-only — no I/O or core algorithms embedded.
- **Composability**: Extends `obiconvert.OptionSet()`; inherits standard format options (`-f`, `-o`) and follows OBITools4 CLI conventions.