Files
obitools4/autodoc/docmd/pkg_obitools_obisummary.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.1 KiB

obisummary Package: Semantic Description

The obisumsummary package delivers lightweight, high-performance statistical summarization of biological sequence data processed by OBITools4. It enables rapid profiling of metadata and content-level features across large sequence sets—especially useful post-processing (e.g., after obiclean or merging)—while supporting parallel execution for scalability.

Core Data Model

  • DataSummary struct: Central container tracking:
    • Global metrics: number of reads, unique variants (distinct sequences), and total symbols.
    • Presence flags for special annotations: merged_sample, obiclean_status/weight.
    • Categorized annotation metadata:
      • Scalar attributes (single-value per sequence).
      • Map-like tags (map_tags), where each key maps to counts.
      • Vector or vector-like attributes (multi-value per sequence).
    • Per-sample statistics: variant count, singleton detection, and obiclean-related flags (e.g., bad reads).

Low-Level Helpers

  • Map aggregation utilities:

    • sumUpdateIntMap: Accumulates integer values across maps.
    • countUpdateIntMap, plusOne/PlusUpdateIntMap: Increment counters for keys (e.g., attribute or sample names).
  • Add() method: Thread-safe merge of two DataSummarys—enables parallel accumulation.

Main Processing Logic

  • Update() method: Processes one BioSequence, updating:

    • Read count (via .Count()) and sequence-level metrics.
    • Variant detection via unique sequences; symbol count (total length).
    • Sample-aware logic: detects merged_sample or per-sample annotations to populate sample-level stats (e.g., singleton identification).
    • Annotation classification: routes keys into scalar, map, or vector buckets.
  • ISummary() function: Parallel summarization engine:

    • Distributes work across nproc goroutines.
    • Aggregates partial summaries via atomic operations (Add()).
    • Returns a structured map with:
      {
        "count": { "variants", "reads", "total_length" },
        "annotations": {
          "scalar_attributes",
          "map_attributes",
          "vector_attributes",
          "keys": { scalar: {...}, map: {...}, vector: {...} }
        },
        "samples": {
          "sample_count",
          "sample_stats": { sample_name: { reads, variants, singletons [, obiclean_bad] } }
        }
      }
      

CLI Integration (obisummary subpackage)

  • Option registration:

    • SummaryOptionSet(): Registers flags for output format (--json-output, --yaml-output) and map attributes to summarize (-map <attr>).
    • OptionSet(): Extends above with input-handling options (e.g., file/iterator sources) from obiconvert.
  • Runtime introspection:

    • CLIOutFormat(): Returns "yaml" (default) or "json", with YAML only active if JSON is not requested.
    • CLIHasMapSummary() / CLIMapSummary(): Check and retrieve requested map attributes.
  • Design notes:

    • Uses global state (e.g., __json_output__, __map_summary__) for compatibility with go-getoptions.
    • Scope strictly limited to CLI configuration—no data processing logic resides here.