mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
65 lines
3.1 KiB
Markdown
65 lines
3.1 KiB
Markdown
# `obisummary` Package: Semantic Description
|
|
|
|
The `obisumsummary` package delivers lightweight, high-performance statistical summarization of biological sequence data processed by OBITools4. It enables rapid profiling of metadata and content-level features across large sequence sets—especially useful post-processing (e.g., after `obiclean` or merging)—while supporting parallel execution for scalability.
|
|
|
|
## Core Data Model
|
|
|
|
- **`DataSummary` struct**: Central container tracking:
|
|
- Global metrics: number of reads, unique variants (distinct sequences), and total symbols.
|
|
- Presence flags for special annotations: `merged_sample`, `obiclean_status`/`weight`.
|
|
- Categorized annotation metadata:
|
|
- Scalar attributes (single-value per sequence).
|
|
- Map-like tags (`map_tags`), where each key maps to counts.
|
|
- Vector or vector-like attributes (multi-value per sequence).
|
|
- Per-sample statistics: variant count, singleton detection, and `obiclean`-related flags (e.g., bad reads).
|
|
|
|
## Low-Level Helpers
|
|
|
|
- **Map aggregation utilities**:
|
|
- `sumUpdateIntMap`: Accumulates integer values across maps.
|
|
- `countUpdateIntMap`, `plusOne/PlusUpdateIntMap`: Increment counters for keys (e.g., attribute or sample names).
|
|
|
|
- **`Add()` method**: Thread-safe merge of two `DataSummary`s—enables parallel accumulation.
|
|
|
|
## Main Processing Logic
|
|
|
|
- **`Update()` method**: Processes one `BioSequence`, updating:
|
|
- Read count (via `.Count()`) and sequence-level metrics.
|
|
- Variant detection via unique sequences; symbol count (total length).
|
|
- Sample-aware logic: detects `merged_sample` or per-sample annotations to populate sample-level stats (e.g., singleton identification).
|
|
- Annotation classification: routes keys into scalar, map, or vector buckets.
|
|
|
|
- **`ISummary()` function**: Parallel summarization engine:
|
|
- Distributes work across `nproc` goroutines.
|
|
- Aggregates partial summaries via atomic operations (`Add()`).
|
|
- Returns a structured map with:
|
|
```json
|
|
{
|
|
"count": { "variants", "reads", "total_length" },
|
|
"annotations": {
|
|
"scalar_attributes",
|
|
"map_attributes",
|
|
"vector_attributes",
|
|
"keys": { scalar: {...}, map: {...}, vector: {...} }
|
|
},
|
|
"samples": {
|
|
"sample_count",
|
|
"sample_stats": { sample_name: { reads, variants, singletons [, obiclean_bad] } }
|
|
}
|
|
}
|
|
```
|
|
|
|
## CLI Integration (`obisummary` subpackage)
|
|
|
|
- **Option registration**:
|
|
- `SummaryOptionSet()`: Registers flags for output format (`--json-output`, `--yaml-output`) and map attributes to summarize (`-map <attr>`).
|
|
- `OptionSet()`: Extends above with input-handling options (e.g., file/iterator sources) from `obiconvert`.
|
|
|
|
- **Runtime introspection**:
|
|
- `CLIOutFormat()`: Returns `"yaml"` (default) or `"json"`, with YAML only active if JSON is *not* requested.
|
|
- `CLIHasMapSummary()` / `CLIMapSummary()`: Check and retrieve requested map attributes.
|
|
|
|
- **Design notes**:
|
|
- Uses global state (e.g., `__json_output__`, `__map_summary__`) for compatibility with [`go-getoptions`](https://github.com/DavidGamba/go-getoptions).
|
|
- Scope strictly limited to CLI configuration—no data processing logic resides here.
|