Files
obitools4/autodoc/docmd/pkg_obitools_obilandmark.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

50 lines
3.4 KiB
Markdown

# `obilandmark` Package: Semantic Documentation
The `obilandmark` package implements a **reference-free, landmark-based embedding and indexing pipeline** for biological sequences within the OBITools4 ecosystem. It enables scalable, low-dimensional representation of sequence libraries by projecting them into a distance space defined by curated landmark sequences—ideal for clustering, classification, and fast similarity search in metabarcoding or metagenomics workflows.
## Public Functionalities
### `MapOnLandmarkSequences(library, landmarks)`
Projects each sequence in a biological library onto Euclidean coordinates using pre-selected landmark sequences.
- **Input**: A sequence `library` (e.g., FASTA/FASTQ iterator) and a list of landmark sequences.
- **Algorithm**: Computes pairwise alignment scores between each sequence and all landmarks using `FastLCSScore`, converting them into distance-based coordinates.
- **Output**: A matrix of shape `(n_sequences, n_landmarks)` where each row is a point in landmark space (`seqworld`).
- **Features**: Parallel execution (configurable workers), progress bar, and buffered streaming for large datasets.
### `CLISelectLandmarkSequences(options)`
Main orchestration function that performs landmark selection, embedding, and annotation in a single CLI-driven pipeline.
- **Landmark Selection**: Iteratively selects `n` landmarks (default: 200) via k-means clustering on initial random samples, minimizing cluster inertia over two refinement rounds.
- **Embedding**: Calls `MapOnLandmarkSequences()` to compute coordinates for all sequences in the library.
- **Annotation**: Augments each sequence record with:
- `landmark_coord`: full coordinate vector (distances to all landmarks),
- optional `landmark_id` for sequences selected as landmark representatives.
- **Taxonomic Indexing**: If taxonomy is provided, builds a `GeomIndexSequence` per sequence—enabling efficient taxonomic search via geometric proximity.
### `LandmarkOptionSet(options)`
Registers CLI options specific to landmark configuration.
- Adds the `-n` / `--center` flag (type: integer), defaulting to **200**, controlling the number of landmarks selected.
### `OptionSet(options)`
Aggregates option sets required by the pipeline:
- Input/output handling (`obiconvert.InputOptionSet`, `.OutputOptionSet`)
- Taxonomy loading support (optional, via `obioptions.LoadTaxonomyOptionSet`)
- Landmark-specific options (`LandmarkOptionSet`)
### `CLINCenter()`
Returns the integer value of `-n / --center`, i.e., the number of landmarks to select (default: 200).
## Design Principles
- **Scalability**: Uses buffered I/O and parallel workers to process large sequence libraries efficiently.
- **Modularity**: Integrates with core OBITools4 modules (`obialign`, `obistats`, `obiutils`, `obitax`, `obirefidx`).
- **CLI-first**: Designed for batch processing pipelines; defaults ensure sensible behavior out-of-the-box.
- **Extensibility**: Annotation schema supports future enhancements (e.g., `landmark_class` via commented stubs).
## Use Cases
- Reference-free sequence clustering and dimensionality reduction
- Fast similarity search via geometric indexing in taxonomic space
- Preprocessing for machine learning on sequence libraries (e.g., classification, anomaly detection)
> **Note**: Only public interfaces are documented. Internal helpers (e.g., clustering utilities, alignment wrappers) remain implementation details.