Files
obitools4/autodoc/docmd/pkg/obiiter/fragment.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

29 lines
1.8 KiB
Markdown

# `IFragments` Functionality Overview
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
## Core Parameters
- `minsize`: Minimum sequence length to skip fragmentation.
- `length`: Desired fragment size (in bases/amino acids).
- `overlap`: Number of overlapping residues between consecutive fragments.
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
## Workflow
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
2. **Parallel Fragmentation**:
- Each worker processes a subset of batches independently using goroutines.
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
3. **Resource Management**:
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
## Key Features
- **Overlap handling**: Ensures contiguous coverage without gaps.
- **Memory efficiency**: Uses recycling and batched output.
- **Scalability**: Leverages Go concurrency via `nworkers`.
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
## Use Case
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).