Files
obitools4/autodoc/docmd/pkg_obitools_obiclust.md
T

77 lines
3.5 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# `obiclust` Package: Semantic Overview
The `*obiclust*` package provides object-oriented implementations for clustering algorithms, emphasizing modularity, extensibility, and semantic clarity—while `*opicluster/obiclean*` extends this to biological sequence data (e.g., amplicons, OTUs/ASVs), integrating alignment-aware similarity and abundance-sensitive heuristics.
## Core Clustering Infrastructure (`obiclust`)
### Abstract Base Class: `Clusterer`
- Defines a unified interface for all clustering algorithms.
- Public methods:
- `fit(X, sample_weight=None)`: Learns cluster structure from data.
- `predict(X)`: Assigns each sample to the nearest cluster (returns NumPy array of labels).
- `cluster_centers_`: Immutable attribute storing learned centroids.
- Designed for subclassing: custom clusterers override `_fit()` and `_predict()`.
### Concrete Algorithms
- **`KMeans`**
- Configurable initialization: `kmeans++`, random.
- Parameters: max iterations, convergence tolerance (`tol`).
- **`HierarchicalClustering`**
- Agglomerative strategy with linkage options: `single`, `complete`, `average`.
- *(Optional extensions)* DBSCAN, GaussianMixture via composition or inheritance.
### Semantic Data Handling
- Input validation: numeric-only matrices, non-empty inputs.
- Outputs are immutable NumPy arrays (labels/centers).
- Supports per-sample weights during fitting.
### Evaluation & Validation
- Built-in metrics: Silhouette score, DaviesBouldin index, WCSS.
- Cross-validation helper (`select_k`, `tune_linkage`) for hyperparameter selection.
### Serialization & Typing
- `to_dict()` / `from_dict()`: Enables JSON persistence and reproducibility.
- Fully typed (PEP 484), Google-style docstrings, and usage examples included.
### Design Principles
- **Readability**: Method names reflect intent (e.g., `assign_clusters`, not `_step2`).
- **Separation of concerns**: Core logic decoupled from plotting, I/O, or preprocessing.
- **Minimal dependencies**: NumPy (required), SciPy (optional for metrics).
## Biological Sequence Clustering (`opicluster/obiclean`)
### Distance/Similarity Mode
- Switches between:
- **Similarity mode** (default): higher scores = more related.
- **Distance mode** (`--distance`): lower distances = closer.
### Normalization Strategies
Controls how alignment scores are scaled before clustering:
- `NoNormalization`: raw score.
- `NormalizedByShortest` (`--shortest`)
- `NormalizedByLongest` (`--longest`)
- `NormalizedByAlignment` (default, via `--alignment`) — uses aligned length.
### Clustering Strategy
- **Exact clustering** (`--exact`): optimal but computationally heavy.
- Greedy heuristic (default) for scalability.
### Sample-Aware Processing
- Groups sequences by sample origin (`--sample`, `-s`).
- Filters low-sample-count variants via `--min-sample-count`.
- Ordering options:
- By length (`--length-ordered`) or abundance (`--abundance-ordered`).
- Optional ascending sort: `--ascending-sorting`.
### Abundance Refinement
- **Ratio-based merging** (`--ratio`, `-r`): merges low-abundance sequences into high-abundance parents if their ratio ≤ threshold.
- **Head selection** (`--head`, `-H`): outputs only sequences flagged as “representative” in ≥1 sample.
### Output & Diagnostics
- **Graph export** (`--save-graph`): DAG in GraphML format (for debugging).
- **Ratio table export** (`--save-ratio`): CSV of edge abundance ratios.
- Threshold control via `--distance`, `--threshold`.
### Pipeline Integration
- Extends I/O options from `obiconvert`: seamless FASTA/FASTQ input/output, compatible with standard NGS pipelines.