Files
obitools4/autodoc/docmd/pkg_obitools_obijoin.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.5 KiB
Raw Blame History

Semantic Description of obijoin Package

The obijoin package enables efficient, declarative sequence joins in biological data pipelines. Built on OBITools4s streaming architecture, it supports left-outer joins between sequence datasets using user-defined attribute keys — ideal for merging paired-end reads, annotating amplicons with metadata, or enriching references.

Core Components & Functionalities

IndexedSequenceSlice

A composite structure combining a biological sequence slice (BioSequenceSlice) with precomputed indices. Each index maps attribute values (e.g., "sample=S1", "barcode=ATGC") to sets of matching sequence indices. Enables sublinear-time filtering via key-based intersection.

Get(keys...)

Performs multi-key intersection queries across indexes: returns sequences satisfying all provided attribute constraints (e.g., Get("sample=S1", "barcode=ATGC")). Keys must match exactly; supports arbitrary string attributes via GetStringAttribute().

BuildIndexedSequenceSlice()

Constructs the index structure in O(n) time by scanning sequences once and grouping them per attribute. Accepts a BioSequenceSlice and returns an IndexedSequenceSlice. Handles any annotation attribute supported by the sequence system.

MakeJoinWorker()

Returns a functional SeqWorker implementing join logic:

  • For each input sequence, extracts join keys (e.g., sample, barcode) from annotations.
  • Uses the index to find matching partner sequences (join_with).
  • Outputs one sequence per match, copying original data and enriching it with partner annotations.
  • Optionally updates ID/sequence/quality fields only if corresponding flags (--update-id, etc.) are enabled.

CLIJoinSequences()

Top-level CLI entry point:

  • Reads primary input (stdin or file).
  • Loads secondary dataset (--join-with), builds index via BuildIndexedSequenceSlice().
  • Applies join using worker from MakeJoinWorker() with flags (--by, -i/-s/-q).
  • Integrates seamlessly into OBITools4s streaming iterator model.

Join Semantics

Feature Behavior
Join type Left outer join (primary dataset fully preserved)
Key matching Exact string equality; no regex/fuzzy logic implied
Updates Controlled by flags: -i/--update-id, -s/--update-sequence, -q/--update-quality
Metadata handling Partner annotations are appended unless fields are overwritten

CLI Options

  • -j/--join-with (required): Path to secondary sequence file (FASTA/FASTQ/TAB).
  • -b/--by: Join key mapping, e.g. "id=id" or "sample=well". Defaults to ["id"].
  • -i/--update-id: Replace sequence identifiers with partner values.
  • -s/--update-sequence: Overwrite nucleotide/amino acid sequences from partners.
  • -q/--update-quality: Replace quality scores (FASTQ only).

Usage Example

obijoin -i input.fastq \
        --join-with annotations.tsv \
        --by "id=name" \
        -i -s

→ Joins input.fastq with TSV annotations, matching on id == name; updates IDs and sequences.

Design Principles

  • Efficiency: Indexing avoids repeated full scans; uses optimized obiutils.Set[int] for fast intersection.
  • Extensibility: Works with any annotation attribute supported by BioSequence.
  • Modularity: CLI logic is configuration-only — no I/O or core algorithms embedded.
  • Composability: Extends obiconvert.OptionSet(); inherits standard format options (-f, -o) and follows OBITools4 CLI conventions.