mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
63 lines
3.5 KiB
Markdown
63 lines
3.5 KiB
Markdown
# Semantic Description of `obijoin` Package
|
||
|
||
The `obijoin` package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4’s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.
|
||
|
||
## Core Components & Functionalities
|
||
|
||
### `IndexedSequenceSlice`
|
||
A composite structure combining a biological sequence slice (`BioSequenceSlice`) with precomputed indices. Each index maps attribute values (e.g., `"sample=S1"`, `"barcode=ATGC"`) to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.
|
||
|
||
### `Get(keys...)`
|
||
Performs multi-key *intersection* queries across indexes: returns sequences satisfying **all** provided attribute constraints (e.g., `Get("sample=S1", "barcode=ATGC")`). Keys must match *exactly*; supports arbitrary string attributes via `GetStringAttribute()`.
|
||
|
||
### `BuildIndexedSequenceSlice()`
|
||
Constructs the index structure in **O(*n*)** time by scanning sequences once and grouping them per attribute. Accepts a `BioSequenceSlice` and returns an `IndexedSequenceSlice`. Handles any annotation attribute supported by the sequence system.
|
||
|
||
### `MakeJoinWorker()`
|
||
Returns a functional `SeqWorker` implementing join logic:
|
||
- For each input sequence, extracts join keys (e.g., `sample`, `barcode`) from annotations.
|
||
- Uses the index to find matching partner sequences (`join_with`).
|
||
- Outputs one sequence per match, copying original data and enriching it with partner annotations.
|
||
- Optionally updates ID/sequence/quality fields *only if* corresponding flags (`--update-id`, etc.) are enabled.
|
||
|
||
### `CLIJoinSequences()`
|
||
Top-level CLI entry point:
|
||
- Reads primary input (stdin or file).
|
||
- Loads secondary dataset (`--join-with`), builds index via `BuildIndexedSequenceSlice()`.
|
||
- Applies join using worker from `MakeJoinWorker()` with flags (`--by`, `-i/-s/-q`).
|
||
- Integrates seamlessly into OBITools4’s streaming iterator model.
|
||
|
||
## Join Semantics
|
||
|
||
| Feature | Behavior |
|
||
|--------|----------|
|
||
| **Join type** | Left outer join (primary dataset fully preserved) |
|
||
| **Key matching** | Exact string equality; no regex/fuzzy logic implied |
|
||
| **Updates** | Controlled by flags: `-i/--update-id`, `-s/--update-sequence`, `-q/--update-quality` |
|
||
| **Metadata handling** | Partner annotations are appended unless fields are overwritten |
|
||
|
||
## CLI Options
|
||
|
||
- `-j/--join-with` *(required)*: Path to secondary sequence file (FASTA/FASTQ/TAB).
|
||
- `-b/--by`: Join key mapping, e.g. `"id=id"` or `"sample=well"`. Defaults to `["id"]`.
|
||
- `-i/--update-id`: Replace sequence identifiers with partner values.
|
||
- `-s/--update-sequence`: Overwrite nucleotide/amino acid sequences from partners.
|
||
- `-q/--update-quality`: Replace quality scores (FASTQ only).
|
||
|
||
## Usage Example
|
||
|
||
```bash
|
||
obijoin -i input.fastq \
|
||
--join-with annotations.tsv \
|
||
--by "id=name" \
|
||
-i -s
|
||
```
|
||
→ Joins `input.fastq` with TSV annotations, matching on `id == name`; updates IDs and sequences.
|
||
|
||
## Design Principles
|
||
|
||
- **Efficiency**: Indexing avoids repeated full scans; uses optimized `obiutils.Set[int]` for fast intersection.
|
||
- **Extensibility**: Works with any annotation attribute supported by `BioSequence`.
|
||
- **Modularity**: CLI logic is configuration-only — no I/O or core algorithms embedded.
|
||
- **Composability**: Extends `obiconvert.OptionSet()`; inherits standard format options (`-f`, `-o`) and follows OBITools4 CLI conventions.
|