- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.1 KiB
obiseq Package: BioSequence Collection Management
The obiseq package provides a high-performance, memory-efficient implementation for managing collections of biological sequences (BioSequence) in Go. Its core type is BioSequenceSlice, a slice of pointers to BioSequence objects, optimized for batch processing in metagenomic pipelines.
Key Functionalities
-
Memory Pooling & Allocation Control:
NewBioSequenceSliceandMakeBioSequenceSliceallow creating slices with optional capacity hints.
EnsureCapacity(capacity)dynamically grows the underlying slice while logging warnings or panicking on persistent allocation failures. -
Efficient Element Management:
Push(sequence): Appends a sequence to the end.Pop(): Removes and returns the last element (nil-safe).Pop0(): Efficiently removes and returns the first element.
-
Collection Metadata Queries:
Len(): Returns number of sequences in the slice.Size(): Computes total sequence length (summing all.Len()).NotEmpty(): Boolean check for non-empty collections.
-
Attribute Aggregation:
AttributeKeys(skip_map, skip_definition)aggregates all attribute keys across sequences into a set—useful for schema inference or validation. -
Sorting Capabilities:
SortOnCount(reverse): Sorts by read count (descending/ascending).SortOnLength(reverse): Sorts by sequence length.
-
Taxonomy Integration:
ExtractTaxonomy(taxonomy, seqAsTaxa)builds or extends a taxonomic tree from sequence paths.
WhenseqAsTaxa=true, it injects pseudo-taxonomic labels for individual sequences (e.g.,OTU:SEQ0000012345 [seqID]@sequence), enabling unified taxonomic/rarefaction workflows.
Design Highlights
- Minimal allocations via manual slice management and
slices.Grow. - Explicit niling of popped elements to aid garbage collection.
- Integrated logging (via
logrus) for allocation issues—critical in large-scale NGS data processing. - Designed to support
BioSequenceBatch, a higher-level abstraction for streaming or parallelizable sequence batches.