Files
obitools4/autodoc/docmd/pkg_obitools_obilandmark.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.4 KiB

obilandmark Package: Semantic Documentation

The obilandmark package implements a reference-free, landmark-based embedding and indexing pipeline for biological sequences within the OBITools4 ecosystem. It enables scalable, low-dimensional representation of sequence libraries by projecting them into a distance space defined by curated landmark sequences—ideal for clustering, classification, and fast similarity search in metabarcoding or metagenomics workflows.

Public Functionalities

MapOnLandmarkSequences(library, landmarks)

Projects each sequence in a biological library onto Euclidean coordinates using pre-selected landmark sequences.

  • Input: A sequence library (e.g., FASTA/FASTQ iterator) and a list of landmark sequences.
  • Algorithm: Computes pairwise alignment scores between each sequence and all landmarks using FastLCSScore, converting them into distance-based coordinates.
  • Output: A matrix of shape (n_sequences, n_landmarks) where each row is a point in landmark space (seqworld).
  • Features: Parallel execution (configurable workers), progress bar, and buffered streaming for large datasets.

CLISelectLandmarkSequences(options)

Main orchestration function that performs landmark selection, embedding, and annotation in a single CLI-driven pipeline.

  • Landmark Selection: Iteratively selects n landmarks (default: 200) via k-means clustering on initial random samples, minimizing cluster inertia over two refinement rounds.
  • Embedding: Calls MapOnLandmarkSequences() to compute coordinates for all sequences in the library.
  • Annotation: Augments each sequence record with:
    • landmark_coord: full coordinate vector (distances to all landmarks),
    • optional landmark_id for sequences selected as landmark representatives.
  • Taxonomic Indexing: If taxonomy is provided, builds a GeomIndexSequence per sequence—enabling efficient taxonomic search via geometric proximity.

LandmarkOptionSet(options)

Registers CLI options specific to landmark configuration.

  • Adds the -n / --center flag (type: integer), defaulting to 200, controlling the number of landmarks selected.

OptionSet(options)

Aggregates option sets required by the pipeline:

  • Input/output handling (obiconvert.InputOptionSet, .OutputOptionSet)
  • Taxonomy loading support (optional, via obioptions.LoadTaxonomyOptionSet)
  • Landmark-specific options (LandmarkOptionSet)

CLINCenter()

Returns the integer value of -n / --center, i.e., the number of landmarks to select (default: 200).

Design Principles

  • Scalability: Uses buffered I/O and parallel workers to process large sequence libraries efficiently.
  • Modularity: Integrates with core OBITools4 modules (obialign, obistats, obiutils, obitax, obirefidx).
  • CLI-first: Designed for batch processing pipelines; defaults ensure sensible behavior out-of-the-box.
  • Extensibility: Annotation schema supports future enhancements (e.g., landmark_class via commented stubs).

Use Cases

  • Reference-free sequence clustering and dimensionality reduction
  • Fast similarity search via geometric indexing in taxonomic space
  • Preprocessing for machine learning on sequence libraries (e.g., classification, anomaly detection)

Note

: Only public interfaces are documented. Internal helpers (e.g., clustering utilities, alignment wrappers) remain implementation details.