# `obiclust` Package: Semantic Overview

The `*obiclust*` package provides object-oriented implementations for clustering algorithms, emphasizing modularity, extensibility, and semantic clarity—while `*opicluster/obiclean*` extends this to biological sequence data (e.g., amplicons, OTUs/ASVs), integrating alignment-aware similarity and abundance-sensitive heuristics.

## Core Clustering Infrastructure (`obiclust`)

### Abstract Base Class: `Clusterer`
- Defines a unified interface for all clustering algorithms.
- Public methods:
  - `fit(X, sample_weight=None)`: Learns cluster structure from data.
  - `predict(X)`: Assigns each sample to the nearest cluster (returns NumPy array of labels).
  - `cluster_centers_`: Immutable attribute storing learned centroids.
- Designed for subclassing: custom clusterers override `_fit()` and `_predict()`.
  
### Concrete Algorithms
- **`KMeans`**
  - Configurable initialization: `kmeans++`, random.
  - Parameters: max iterations, convergence tolerance (`tol`).
- **`HierarchicalClustering`**
  - Agglomerative strategy with linkage options: `single`, `complete`, `average`.
- *(Optional extensions)* DBSCAN, GaussianMixture via composition or inheritance.

### Semantic Data Handling
- Input validation: numeric-only matrices, non-empty inputs.
- Outputs are immutable NumPy arrays (labels/centers).
- Supports per-sample weights during fitting.

### Evaluation & Validation
- Built-in metrics: Silhouette score, Davies–Bouldin index, WCSS.
- Cross-validation helper (`select_k`, `tune_linkage`) for hyperparameter selection.

### Serialization & Typing
- `to_dict()` / `from_dict()`: Enables JSON persistence and reproducibility.
- Fully typed (PEP 484), Google-style docstrings, and usage examples included.

### Design Principles
- **Readability**: Method names reflect intent (e.g., `assign_clusters`, not `_step2`).
- **Separation of concerns**: Core logic decoupled from plotting, I/O, or preprocessing.
- **Minimal dependencies**: NumPy (required), SciPy (optional for metrics).

## Biological Sequence Clustering (`opicluster/obiclean`)

### Distance/Similarity Mode
- Switches between:
  - **Similarity mode** (default): higher scores = more related.
  - **Distance mode** (`--distance`): lower distances = closer.

### Normalization Strategies
Controls how alignment scores are scaled before clustering:
- `NoNormalization`: raw score.
- `NormalizedByShortest` (`--shortest`)
- `NormalizedByLongest` (`--longest`)
- `NormalizedByAlignment` (default, via `--alignment`) — uses aligned length.

### Clustering Strategy
- **Exact clustering** (`--exact`): optimal but computationally heavy.
- Greedy heuristic (default) for scalability.

### Sample-Aware Processing
- Groups sequences by sample origin (`--sample`, `-s`).
- Filters low-sample-count variants via `--min-sample-count`.
- Ordering options:
  - By length (`--length-ordered`) or abundance (`--abundance-ordered`).
  - Optional ascending sort: `--ascending-sorting`.

### Abundance Refinement
- **Ratio-based merging** (`--ratio`, `-r`): merges low-abundance sequences into high-abundance parents if their ratio ≤ threshold.
- **Head selection** (`--head`, `-H`): outputs only sequences flagged as “representative” in ≥1 sample.

### Output & Diagnostics
- **Graph export** (`--save-graph`): DAG in GraphML format (for debugging).
- **Ratio table export** (`--save-ratio`): CSV of edge abundance ratios.
- Threshold control via `--distance`, `--threshold`.

### Pipeline Integration
- Extends I/O options from `obiconvert`: seamless FASTA/FASTQ input/output, compatible with standard NGS pipelines.