Files
obitools4/autodoc/docmd/pkg_obitools_obiclust.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

77 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obiclust` Package: Semantic Overview
The `*obiclust*` package provides object-oriented implementations for clustering algorithms, emphasizing modularity, extensibility, and semantic clarity—while `*opicluster/obiclean*` extends this to biological sequence data (e.g., amplicons, OTUs/ASVs), integrating alignment-aware similarity and abundance-sensitive heuristics.
## Core Clustering Infrastructure (`obiclust`)
### Abstract Base Class: `Clusterer`
- Defines a unified interface for all clustering algorithms.
- Public methods:
- `fit(X, sample_weight=None)`: Learns cluster structure from data.
- `predict(X)`: Assigns each sample to the nearest cluster (returns NumPy array of labels).
- `cluster_centers_`: Immutable attribute storing learned centroids.
- Designed for subclassing: custom clusterers override `_fit()` and `_predict()`.
### Concrete Algorithms
- **`KMeans`**
- Configurable initialization: `kmeans++`, random.
- Parameters: max iterations, convergence tolerance (`tol`).
- **`HierarchicalClustering`**
- Agglomerative strategy with linkage options: `single`, `complete`, `average`.
- *(Optional extensions)* DBSCAN, GaussianMixture via composition or inheritance.
### Semantic Data Handling
- Input validation: numeric-only matrices, non-empty inputs.
- Outputs are immutable NumPy arrays (labels/centers).
- Supports per-sample weights during fitting.
### Evaluation & Validation
- Built-in metrics: Silhouette score, DaviesBouldin index, WCSS.
- Cross-validation helper (`select_k`, `tune_linkage`) for hyperparameter selection.
### Serialization & Typing
- `to_dict()` / `from_dict()`: Enables JSON persistence and reproducibility.
- Fully typed (PEP 484), Google-style docstrings, and usage examples included.
### Design Principles
- **Readability**: Method names reflect intent (e.g., `assign_clusters`, not `_step2`).
- **Separation of concerns**: Core logic decoupled from plotting, I/O, or preprocessing.
- **Minimal dependencies**: NumPy (required), SciPy (optional for metrics).
## Biological Sequence Clustering (`opicluster/obiclean`)
### Distance/Similarity Mode
- Switches between:
- **Similarity mode** (default): higher scores = more related.
- **Distance mode** (`--distance`): lower distances = closer.
### Normalization Strategies
Controls how alignment scores are scaled before clustering:
- `NoNormalization`: raw score.
- `NormalizedByShortest` (`--shortest`)
- `NormalizedByLongest` (`--longest`)
- `NormalizedByAlignment` (default, via `--alignment`) — uses aligned length.
### Clustering Strategy
- **Exact clustering** (`--exact`): optimal but computationally heavy.
- Greedy heuristic (default) for scalability.
### Sample-Aware Processing
- Groups sequences by sample origin (`--sample`, `-s`).
- Filters low-sample-count variants via `--min-sample-count`.
- Ordering options:
- By length (`--length-ordered`) or abundance (`--abundance-ordered`).
- Optional ascending sort: `--ascending-sorting`.
### Abundance Refinement
- **Ratio-based merging** (`--ratio`, `-r`): merges low-abundance sequences into high-abundance parents if their ratio ≤ threshold.
- **Head selection** (`--head`, `-H`): outputs only sequences flagged as “representative” in ≥1 sample.
### Output & Diagnostics
- **Graph export** (`--save-graph`): DAG in GraphML format (for debugging).
- **Ratio table export** (`--save-ratio`): CSV of edge abundance ratios.
- Threshold control via `--distance`, `--threshold`.
### Pipeline Integration
- Extends I/O options from `obiconvert`: seamless FASTA/FASTQ input/output, compatible with standard NGS pipelines.