Files
obitools4/autodoc/docmd/pkg_obitools_obijoin.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

63 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obijoin` Package
The `obijoin` package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.
## Core Components & Functionalities
### `IndexedSequenceSlice`
A composite structure combining a biological sequence slice (`BioSequenceSlice`) with precomputed indices. Each index maps attribute values (e.g., `"sample=S1"`, `"barcode=ATGC"`) to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.
### `Get(keys...)`
Performs multi-key *intersection* queries across indexes: returns sequences satisfying **all** provided attribute constraints (e.g., `Get("sample=S1", "barcode=ATGC")`). Keys must match *exactly*; supports arbitrary string attributes via `GetStringAttribute()`.
### `BuildIndexedSequenceSlice()`
Constructs the index structure in **O(*n*)** time by scanning sequences once and grouping them per attribute. Accepts a `BioSequenceSlice` and returns an `IndexedSequenceSlice`. Handles any annotation attribute supported by the sequence system.
### `MakeJoinWorker()`
Returns a functional `SeqWorker` implementing join logic:
- For each input sequence, extracts join keys (e.g., `sample`, `barcode`) from annotations.
- Uses the index to find matching partner sequences (`join_with`).
- Outputs one sequence per match, copying original data and enriching it with partner annotations.
- Optionally updates ID/sequence/quality fields *only if* corresponding flags (`--update-id`, etc.) are enabled.
### `CLIJoinSequences()`
Top-level CLI entry point:
- Reads primary input (stdin or file).
- Loads secondary dataset (`--join-with`), builds index via `BuildIndexedSequenceSlice()`.
- Applies join using worker from `MakeJoinWorker()` with flags (`--by`, `-i/-s/-q`).
- Integrates seamlessly into OBITools4s streaming iterator model.
## Join Semantics
| Feature | Behavior |
|--------|----------|
| **Join type** | Left outer join (primary dataset fully preserved) |
| **Key matching** | Exact string equality; no regex/fuzzy logic implied |
| **Updates** | Controlled by flags: `-i/--update-id`, `-s/--update-sequence`, `-q/--update-quality` |
| **Metadata handling** | Partner annotations are appended unless fields are overwritten |
## CLI Options
- `-j/--join-with` *(required)*: Path to secondary sequence file (FASTA/FASTQ/TAB).
- `-b/--by`: Join key mapping, e.g. `"id=id"` or `"sample=well"`. Defaults to `["id"]`.
- `-i/--update-id`: Replace sequence identifiers with partner values.
- `-s/--update-sequence`: Overwrite nucleotide/amino acid sequences from partners.
- `-q/--update-quality`: Replace quality scores (FASTQ only).
## Usage Example
```bash
obijoin -i input.fastq \
--join-with annotations.tsv \
--by "id=name" \
-i -s
```
→ Joins `input.fastq` with TSV annotations, matching on `id == name`; updates IDs and sequences.
## Design Principles
- **Efficiency**: Indexing avoids repeated full scans; uses optimized `obiutils.Set[int]` for fast intersection.
- **Extensibility**: Works with any annotation attribute supported by `BioSequence`.
- **Modularity**: CLI logic is configuration-only — no I/O or core algorithms embedded.
- **Composability**: Extends `obiconvert.OptionSet()`; inherits standard format options (`-f`, `-o`) and follows OBITools4 CLI conventions.