mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
63 lines
3.5 KiB
Markdown
63 lines
3.5 KiB
Markdown
|
|
# Semantic Description of `obijoin` Package
|
|||
|
|
|
|||
|
|
The `obijoin` package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4’s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.
|
|||
|
|
|
|||
|
|
## Core Components & Functionalities
|
|||
|
|
|
|||
|
|
### `IndexedSequenceSlice`
|
|||
|
|
A composite structure combining a biological sequence slice (`BioSequenceSlice`) with precomputed indices. Each index maps attribute values (e.g., `"sample=S1"`, `"barcode=ATGC"`) to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.
|
|||
|
|
|
|||
|
|
### `Get(keys...)`
|
|||
|
|
Performs multi-key *intersection* queries across indexes: returns sequences satisfying **all** provided attribute constraints (e.g., `Get("sample=S1", "barcode=ATGC")`). Keys must match *exactly*; supports arbitrary string attributes via `GetStringAttribute()`.
|
|||
|
|
|
|||
|
|
### `BuildIndexedSequenceSlice()`
|
|||
|
|
Constructs the index structure in **O(*n*)** time by scanning sequences once and grouping them per attribute. Accepts a `BioSequenceSlice` and returns an `IndexedSequenceSlice`. Handles any annotation attribute supported by the sequence system.
|
|||
|
|
|
|||
|
|
### `MakeJoinWorker()`
|
|||
|
|
Returns a functional `SeqWorker` implementing join logic:
|
|||
|
|
- For each input sequence, extracts join keys (e.g., `sample`, `barcode`) from annotations.
|
|||
|
|
- Uses the index to find matching partner sequences (`join_with`).
|
|||
|
|
- Outputs one sequence per match, copying original data and enriching it with partner annotations.
|
|||
|
|
- Optionally updates ID/sequence/quality fields *only if* corresponding flags (`--update-id`, etc.) are enabled.
|
|||
|
|
|
|||
|
|
### `CLIJoinSequences()`
|
|||
|
|
Top-level CLI entry point:
|
|||
|
|
- Reads primary input (stdin or file).
|
|||
|
|
- Loads secondary dataset (`--join-with`), builds index via `BuildIndexedSequenceSlice()`.
|
|||
|
|
- Applies join using worker from `MakeJoinWorker()` with flags (`--by`, `-i/-s/-q`).
|
|||
|
|
- Integrates seamlessly into OBITools4’s streaming iterator model.
|
|||
|
|
|
|||
|
|
## Join Semantics
|
|||
|
|
|
|||
|
|
| Feature | Behavior |
|
|||
|
|
|--------|----------|
|
|||
|
|
| **Join type** | Left outer join (primary dataset fully preserved) |
|
|||
|
|
| **Key matching** | Exact string equality; no regex/fuzzy logic implied |
|
|||
|
|
| **Updates** | Controlled by flags: `-i/--update-id`, `-s/--update-sequence`, `-q/--update-quality` |
|
|||
|
|
| **Metadata handling** | Partner annotations are appended unless fields are overwritten |
|
|||
|
|
|
|||
|
|
## CLI Options
|
|||
|
|
|
|||
|
|
- `-j/--join-with` *(required)*: Path to secondary sequence file (FASTA/FASTQ/TAB).
|
|||
|
|
- `-b/--by`: Join key mapping, e.g. `"id=id"` or `"sample=well"`. Defaults to `["id"]`.
|
|||
|
|
- `-i/--update-id`: Replace sequence identifiers with partner values.
|
|||
|
|
- `-s/--update-sequence`: Overwrite nucleotide/amino acid sequences from partners.
|
|||
|
|
- `-q/--update-quality`: Replace quality scores (FASTQ only).
|
|||
|
|
|
|||
|
|
## Usage Example
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
obijoin -i input.fastq \
|
|||
|
|
--join-with annotations.tsv \
|
|||
|
|
--by "id=name" \
|
|||
|
|
-i -s
|
|||
|
|
```
|
|||
|
|
→ Joins `input.fastq` with TSV annotations, matching on `id == name`; updates IDs and sequences.
|
|||
|
|
|
|||
|
|
## Design Principles
|
|||
|
|
|
|||
|
|
- **Efficiency**: Indexing avoids repeated full scans; uses optimized `obiutils.Set[int]` for fast intersection.
|
|||
|
|
- **Extensibility**: Works with any annotation attribute supported by `BioSequence`.
|
|||
|
|
- **Modularity**: CLI logic is configuration-only — no I/O or core algorithms embedded.
|
|||
|
|
- **Composability**: Extends `obiconvert.OptionSet()`; inherits standard format options (`-f`, `-o`) and follows OBITools4 CLI conventions.
|