Files
obitools4/autodoc/docmd/pkg/obiiter/fragment.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB

IFragments Functionality Overview

The IFragments() function in the obiiter package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.

Core Parameters

  • minsize: Minimum sequence length to skip fragmentation.
  • length: Desired fragment size (in bases/amino acids).
  • overlap: Number of overlapping residues between consecutive fragments.
  • size, nworkers: Batch size and number of worker goroutines (currently unused in active logic).

Workflow

  1. Batch Sorting: Input sequences are batched and sorted for efficient processing.
  2. Parallel Fragmentation:
    • Each worker processes a subset of batches independently using goroutines.
    • For each sequence longer than minsize, it is split into overlapping fragments of length length with step size = length - overlap.
    • The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
  3. Resource Management:
    • Original sequences are recycled (s.Recycle()) to optimize memory usage.
    • Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.

Key Features

  • Overlap handling: Ensures contiguous coverage without gaps.
  • Memory efficiency: Uses recycling and batched output.
  • Scalability: Leverages Go concurrency via nworkers.
  • Error safety: Panics on subsequence errors (e.g., invalid indices).

Use Case

Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).