mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
77 lines
3.5 KiB
Markdown
77 lines
3.5 KiB
Markdown
# `obiclust` Package: Semantic Overview
|
||
|
||
The `*obiclust*` package provides object-oriented implementations for clustering algorithms, emphasizing modularity, extensibility, and semantic clarity—while `*opicluster/obiclean*` extends this to biological sequence data (e.g., amplicons, OTUs/ASVs), integrating alignment-aware similarity and abundance-sensitive heuristics.
|
||
|
||
## Core Clustering Infrastructure (`obiclust`)
|
||
|
||
### Abstract Base Class: `Clusterer`
|
||
- Defines a unified interface for all clustering algorithms.
|
||
- Public methods:
|
||
- `fit(X, sample_weight=None)`: Learns cluster structure from data.
|
||
- `predict(X)`: Assigns each sample to the nearest cluster (returns NumPy array of labels).
|
||
- `cluster_centers_`: Immutable attribute storing learned centroids.
|
||
- Designed for subclassing: custom clusterers override `_fit()` and `_predict()`.
|
||
|
||
### Concrete Algorithms
|
||
- **`KMeans`**
|
||
- Configurable initialization: `kmeans++`, random.
|
||
- Parameters: max iterations, convergence tolerance (`tol`).
|
||
- **`HierarchicalClustering`**
|
||
- Agglomerative strategy with linkage options: `single`, `complete`, `average`.
|
||
- *(Optional extensions)* DBSCAN, GaussianMixture via composition or inheritance.
|
||
|
||
### Semantic Data Handling
|
||
- Input validation: numeric-only matrices, non-empty inputs.
|
||
- Outputs are immutable NumPy arrays (labels/centers).
|
||
- Supports per-sample weights during fitting.
|
||
|
||
### Evaluation & Validation
|
||
- Built-in metrics: Silhouette score, Davies–Bouldin index, WCSS.
|
||
- Cross-validation helper (`select_k`, `tune_linkage`) for hyperparameter selection.
|
||
|
||
### Serialization & Typing
|
||
- `to_dict()` / `from_dict()`: Enables JSON persistence and reproducibility.
|
||
- Fully typed (PEP 484), Google-style docstrings, and usage examples included.
|
||
|
||
### Design Principles
|
||
- **Readability**: Method names reflect intent (e.g., `assign_clusters`, not `_step2`).
|
||
- **Separation of concerns**: Core logic decoupled from plotting, I/O, or preprocessing.
|
||
- **Minimal dependencies**: NumPy (required), SciPy (optional for metrics).
|
||
|
||
## Biological Sequence Clustering (`opicluster/obiclean`)
|
||
|
||
### Distance/Similarity Mode
|
||
- Switches between:
|
||
- **Similarity mode** (default): higher scores = more related.
|
||
- **Distance mode** (`--distance`): lower distances = closer.
|
||
|
||
### Normalization Strategies
|
||
Controls how alignment scores are scaled before clustering:
|
||
- `NoNormalization`: raw score.
|
||
- `NormalizedByShortest` (`--shortest`)
|
||
- `NormalizedByLongest` (`--longest`)
|
||
- `NormalizedByAlignment` (default, via `--alignment`) — uses aligned length.
|
||
|
||
### Clustering Strategy
|
||
- **Exact clustering** (`--exact`): optimal but computationally heavy.
|
||
- Greedy heuristic (default) for scalability.
|
||
|
||
### Sample-Aware Processing
|
||
- Groups sequences by sample origin (`--sample`, `-s`).
|
||
- Filters low-sample-count variants via `--min-sample-count`.
|
||
- Ordering options:
|
||
- By length (`--length-ordered`) or abundance (`--abundance-ordered`).
|
||
- Optional ascending sort: `--ascending-sorting`.
|
||
|
||
### Abundance Refinement
|
||
- **Ratio-based merging** (`--ratio`, `-r`): merges low-abundance sequences into high-abundance parents if their ratio ≤ threshold.
|
||
- **Head selection** (`--head`, `-H`): outputs only sequences flagged as “representative” in ≥1 sample.
|
||
|
||
### Output & Diagnostics
|
||
- **Graph export** (`--save-graph`): DAG in GraphML format (for debugging).
|
||
- **Ratio table export** (`--save-ratio`): CSV of edge abundance ratios.
|
||
- Threshold control via `--distance`, `--threshold`.
|
||
|
||
### Pipeline Integration
|
||
- Extends I/O options from `obiconvert`: seamless FASTA/FASTQ input/output, compatible with standard NGS pipelines.
|