Files
obitools4/autodoc/docmd/pkg_obitools_obimatrix.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.9 KiB
Raw Blame History

obimatrix Package: Semantic Overview

The obimatrix package enables high-performance construction, manipulation, and export of biological sequence count matrices (e.g., OTU/ASV tables) in the OBITools4 ecosystem. Built around a sparse matrix representation, it supports flexible attribute handling, parallelized input processing, and multiple output formats—ideal for downstream ecological or bioinformatic analysis.

Core Functionalities

Matrix Construction & Management

  • MakeMatrixData() / NewMatrixData(naVal string, fixedCols []string):
    Initializes a new MatrixData instance with configurable NA placeholder and fixed column headers (e.g., "id", "count").
  • Update(seq obiseq.BioSequence, mapKey string):
    Populates the matrix using a biological sequences annotations. Extracts per-taxon counts or arbitrary map attributes (e.g., sample IDs), inserting them into the sparse matrix under row = seq.ID, with dynamic column detection.
  • TransposeMatrixData(md *MatrixData):
    Flips rows/columns: original columns become new "id" attributes; preserves metadata and NA handling.

Merging & Parallelization

  • MergeMatrixData(a, b *MatrixData):
    Combines two matrices row-wise; panics on duplicate sequence IDs to prevent silent overwrites.
  • IMatrix(iter obiseq.Iterator, mapKey string):
    Builds a full matrix in parallel from an iterator of sequences. Auto-detects extra sample columns if enabled (via --auto-cols), supporting dynamic batch processing.

Export & CLI Integration

  • CLIWriteCSVToStdout(md *MatrixData):
    Outputs a wide-format CSV: rows = sequences, columns = fixed attributes + detected samples. Handles Phred encoding (ASCII 33/64) for quality strings and supports transpose via --transpose.
  • CLIWriteThreeColumnsToStdout(md *MatrixData):
    Outputs a long-format CSV with columns: sample, sequence ID, and value—suited for tools expecting tidy data.
  • CLI Option Aggregation:
    Integrates with getoptions to expose flags like:
    • -m, --map-attribute: grouping key (default: "merged_sample")
    • --value-name, --sample-name: column headers (defaults: "count", "sample")
    • -t, --transpose: toggle row/column orientation
    • --allow-empty, --strict-attributes: control handling of missing annotations

Robustness & Flexibility

  • NA Handling: Replaces absent mapping attributes with a configurable placeholder (default: "0").
  • Strict Mode: Panics on type mismatches or uncastable values (e.g., non-numeric counts in numeric context).
  • Attribute Extensibility: Supports arbitrary metadata (taxonomic labels, quality strings) via dynamic column inference.

Design Philosophy

Focused on speed, type safety, and reproducibility for amplicon sequencing workflows. The package avoids implicit defaults beyond core conventions, favoring explicit CLI configuration and clear error signaling for data integrity.