Files
obitools4/autodoc/docmd/pkg/obiseq/biosequenceslice.md
T

38 lines
2.1 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# `obiseq` Package: BioSequence Collection Management
The `obiseq` package provides a high-performance, memory-efficient implementation for managing collections of biological sequences (`BioSequence`) in Go. Its core type is `BioSequenceSlice`, a slice of pointers to `BioSequence` objects, optimized for batch processing in metagenomic pipelines.
### Key Functionalities
- **Memory Pooling & Allocation Control**:
`NewBioSequenceSlice` and `MakeBioSequenceSlice` allow creating slices with optional capacity hints.
`EnsureCapacity(capacity)` dynamically grows the underlying slice while logging warnings or panicking on persistent allocation failures.
- **Efficient Element Management**:
- `Push(sequence)`: Appends a sequence to the end.
- `Pop()`: Removes and returns the last element (nil-safe).
- `Pop0()`: Efficiently removes and returns the first element.
- **Collection Metadata Queries**:
- `Len()`: Returns number of sequences in the slice.
- `Size()`: Computes total sequence length (summing all `.Len()`).
- `NotEmpty()`: Boolean check for non-empty collections.
- **Attribute Aggregation**:
`AttributeKeys(skip_map, skip_definition)` aggregates all attribute keys across sequences into a set—useful for schema inference or validation.
- **Sorting Capabilities**:
- `SortOnCount(reverse)`: Sorts by read count (descending/ascending).
- `SortOnLength(reverse)`: Sorts by sequence length.
- **Taxonomy Integration**:
`ExtractTaxonomy(taxonomy, seqAsTaxa)` builds or extends a taxonomic tree from sequence paths.
When `seqAsTaxa=true`, it injects pseudo-taxonomic labels for individual sequences (e.g., `OTU:SEQ0000012345 [seqID]@sequence`), enabling unified taxonomic/rarefaction workflows.
### Design Highlights
- Minimal allocations via manual slice management and `slices.Grow`.
- Explicit niling of popped elements to aid garbage collection.
- Integrated logging (via `logrus`) for allocation issues—critical in large-scale NGS data processing.
- Designed to support `BioSequenceBatch`, a higher-level abstraction for streaming or parallelizable sequence batches.