- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
3.4 KiB
obilandmark Package: Semantic Documentation
The obilandmark package implements a reference-free, landmark-based embedding and indexing pipeline for biological sequences within the OBITools4 ecosystem. It enables scalable, low-dimensional representation of sequence libraries by projecting them into a distance space defined by curated landmark sequences—ideal for clustering, classification, and fast similarity search in metabarcoding or metagenomics workflows.
Public Functionalities
MapOnLandmarkSequences(library, landmarks)
Projects each sequence in a biological library onto Euclidean coordinates using pre-selected landmark sequences.
- Input: A sequence
library(e.g., FASTA/FASTQ iterator) and a list of landmark sequences. - Algorithm: Computes pairwise alignment scores between each sequence and all landmarks using
FastLCSScore, converting them into distance-based coordinates. - Output: A matrix of shape
(n_sequences, n_landmarks)where each row is a point in landmark space (seqworld). - Features: Parallel execution (configurable workers), progress bar, and buffered streaming for large datasets.
CLISelectLandmarkSequences(options)
Main orchestration function that performs landmark selection, embedding, and annotation in a single CLI-driven pipeline.
- Landmark Selection: Iteratively selects
nlandmarks (default: 200) via k-means clustering on initial random samples, minimizing cluster inertia over two refinement rounds. - Embedding: Calls
MapOnLandmarkSequences()to compute coordinates for all sequences in the library. - Annotation: Augments each sequence record with:
landmark_coord: full coordinate vector (distances to all landmarks),- optional
landmark_idfor sequences selected as landmark representatives.
- Taxonomic Indexing: If taxonomy is provided, builds a
GeomIndexSequenceper sequence—enabling efficient taxonomic search via geometric proximity.
LandmarkOptionSet(options)
Registers CLI options specific to landmark configuration.
- Adds the
-n/--centerflag (type: integer), defaulting to 200, controlling the number of landmarks selected.
OptionSet(options)
Aggregates option sets required by the pipeline:
- Input/output handling (
obiconvert.InputOptionSet,.OutputOptionSet) - Taxonomy loading support (optional, via
obioptions.LoadTaxonomyOptionSet) - Landmark-specific options (
LandmarkOptionSet)
CLINCenter()
Returns the integer value of -n / --center, i.e., the number of landmarks to select (default: 200).
Design Principles
- Scalability: Uses buffered I/O and parallel workers to process large sequence libraries efficiently.
- Modularity: Integrates with core OBITools4 modules (
obialign,obistats,obiutils,obitax,obirefidx). - CLI-first: Designed for batch processing pipelines; defaults ensure sensible behavior out-of-the-box.
- Extensibility: Annotation schema supports future enhancements (e.g.,
landmark_classvia commented stubs).
Use Cases
- Reference-free sequence clustering and dimensionality reduction
- Fast similarity search via geometric indexing in taxonomic space
- Preprocessing for machine learning on sequence libraries (e.g., classification, anomaly detection)
Note
: Only public interfaces are documented. Internal helpers (e.g., clustering utilities, alignment wrappers) remain implementation details.