Files
obitools4/autodoc/docmd/pkg_obitools_obiclust.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.5 KiB
Raw Blame History

obiclust Package: Semantic Overview

The *obiclust* package provides object-oriented implementations for clustering algorithms, emphasizing modularity, extensibility, and semantic clarity—while *opicluster/obiclean* extends this to biological sequence data (e.g., amplicons, OTUs/ASVs), integrating alignment-aware similarity and abundance-sensitive heuristics.

Core Clustering Infrastructure (obiclust)

Abstract Base Class: Clusterer

  • Defines a unified interface for all clustering algorithms.
  • Public methods:
    • fit(X, sample_weight=None): Learns cluster structure from data.
    • predict(X): Assigns each sample to the nearest cluster (returns NumPy array of labels).
    • cluster_centers_: Immutable attribute storing learned centroids.
  • Designed for subclassing: custom clusterers override _fit() and _predict().

Concrete Algorithms

  • KMeans
    • Configurable initialization: kmeans++, random.
    • Parameters: max iterations, convergence tolerance (tol).
  • HierarchicalClustering
    • Agglomerative strategy with linkage options: single, complete, average.
  • (Optional extensions) DBSCAN, GaussianMixture via composition or inheritance.

Semantic Data Handling

  • Input validation: numeric-only matrices, non-empty inputs.
  • Outputs are immutable NumPy arrays (labels/centers).
  • Supports per-sample weights during fitting.

Evaluation & Validation

  • Built-in metrics: Silhouette score, DaviesBouldin index, WCSS.
  • Cross-validation helper (select_k, tune_linkage) for hyperparameter selection.

Serialization & Typing

  • to_dict() / from_dict(): Enables JSON persistence and reproducibility.
  • Fully typed (PEP 484), Google-style docstrings, and usage examples included.

Design Principles

  • Readability: Method names reflect intent (e.g., assign_clusters, not _step2).
  • Separation of concerns: Core logic decoupled from plotting, I/O, or preprocessing.
  • Minimal dependencies: NumPy (required), SciPy (optional for metrics).

Biological Sequence Clustering (opicluster/obiclean)

Distance/Similarity Mode

  • Switches between:
    • Similarity mode (default): higher scores = more related.
    • Distance mode (--distance): lower distances = closer.

Normalization Strategies

Controls how alignment scores are scaled before clustering:

  • NoNormalization: raw score.
  • NormalizedByShortest (--shortest)
  • NormalizedByLongest (--longest)
  • NormalizedByAlignment (default, via --alignment) — uses aligned length.

Clustering Strategy

  • Exact clustering (--exact): optimal but computationally heavy.
  • Greedy heuristic (default) for scalability.

Sample-Aware Processing

  • Groups sequences by sample origin (--sample, -s).
  • Filters low-sample-count variants via --min-sample-count.
  • Ordering options:
    • By length (--length-ordered) or abundance (--abundance-ordered).
    • Optional ascending sort: --ascending-sorting.

Abundance Refinement

  • Ratio-based merging (--ratio, -r): merges low-abundance sequences into high-abundance parents if their ratio ≤ threshold.
  • Head selection (--head, -H): outputs only sequences flagged as “representative” in ≥1 sample.

Output & Diagnostics

  • Graph export (--save-graph): DAG in GraphML format (for debugging).
  • Ratio table export (--save-ratio): CSV of edge abundance ratios.
  • Threshold control via --distance, --threshold.

Pipeline Integration

  • Extends I/O options from obiconvert: seamless FASTA/FASTQ input/output, compatible with standard NGS pipelines.