obitools4/autodoc/docmd/pkg_obitools_obilandmark.md

# `obilandmark` Package: Semantic Documentation

The `obilandmark` package implements a **reference-free, landmark-based embedding and indexing pipeline** for biological sequences within the OBITools4 ecosystem. It enables scalable, low-dimensional representation of sequence libraries by projecting them into a distance space defined by curated landmark sequences—ideal for clustering, classification, and fast similarity search in metabarcoding or metagenomics workflows.

## Public Functionalities

### `MapOnLandmarkSequences(library, landmarks)`
Projects each sequence in a biological library onto Euclidean coordinates using pre-selected landmark sequences.
- **Input**: A sequence `library` (e.g., FASTA/FASTQ iterator) and a list of landmark sequences.
- **Algorithm**: Computes pairwise alignment scores between each sequence and all landmarks using `FastLCSScore`, converting them into distance-based coordinates.
- **Output**: A matrix of shape `(n_sequences, n_landmarks)` where each row is a point in landmark space (`seqworld`).
- **Features**: Parallel execution (configurable workers), progress bar, and buffered streaming for large datasets.

### `CLISelectLandmarkSequences(options)`
Main orchestration function that performs landmark selection, embedding, and annotation in a single CLI-driven pipeline.
- **Landmark Selection**: Iteratively selects `n` landmarks (default: 200) via k-means clustering on initial random samples, minimizing cluster inertia over two refinement rounds.
- **Embedding**: Calls `MapOnLandmarkSequences()` to compute coordinates for all sequences in the library.
- **Annotation**: Augments each sequence record with:
  - `landmark_coord`: full coordinate vector (distances to all landmarks),
  - optional `landmark_id` for sequences selected as landmark representatives.
- **Taxonomic Indexing**: If taxonomy is provided, builds a `GeomIndexSequence` per sequence—enabling efficient taxonomic search via geometric proximity.

### `LandmarkOptionSet(options)`
Registers CLI options specific to landmark configuration.
- Adds the `-n` / `--center` flag (type: integer), defaulting to **200**, controlling the number of landmarks selected.

### `OptionSet(options)`
Aggregates option sets required by the pipeline:
- Input/output handling (`obiconvert.InputOptionSet`, `.OutputOptionSet`)
- Taxonomy loading support (optional, via `obioptions.LoadTaxonomyOptionSet`)
- Landmark-specific options (`LandmarkOptionSet`)

### `CLINCenter()`
Returns the integer value of `-n / --center`, i.e., the number of landmarks to select (default: 200).

## Design Principles

- **Scalability**: Uses buffered I/O and parallel workers to process large sequence libraries efficiently.
- **Modularity**: Integrates with core OBITools4 modules (`obialign`, `obistats`, `obiutils`, `obitax`, `obirefidx`).
- **CLI-first**: Designed for batch processing pipelines; defaults ensure sensible behavior out-of-the-box.
- **Extensibility**: Annotation schema supports future enhancements (e.g., `landmark_class` via commented stubs).

## Use Cases

- Reference-free sequence clustering and dimensionality reduction
- Fast similarity search via geometric indexing in taxonomic space
- Preprocessing for machine learning on sequence libraries (e.g., classification, anomaly detection)

> **Note**: Only public interfaces are documented. Internal helpers (e.g., clustering utilities, alignment wrappers) remain implementation details.