⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+1 -1
View File
@@ -23,7 +23,7 @@ xx
/.vscode
/build
/bugs
autodoc
/ncbitaxo
!/obitests/**
+300
View File
@@ -0,0 +1,300 @@
# NAME
obicomplement — reverse complement of sequences
---
# SYNOPSIS
```
obicomplement [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-with <FILENAME>] [--raw-taxid] [--silent-warning]
[--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
[--update-taxid] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obicomplement` computes the reverse complement of every sequence in the
input. For each input sequence, the nucleotides are first reversed, then
each base is replaced by its WatsonCrick complement (A↔T, C↔G), yielding
the strand that would pair with the original sequence read in the opposite
direction.
When quality scores are present (FASTQ data), they are reversed in the same
order as the sequence so that each quality value remains associated with its
corresponding base. Ambiguous IUPAC characters (e.g. `N`, `R`, `Y`) are
handled correctly and preserved in the output.
This operation is commonly needed when sequences have been sequenced on the
wrong strand, when a primer is designed on the reverse strand, or when
preparing sequences for strand-aware downstream analyses.
The command reads from standard input or from one or more files, processes
sequences in parallel, and writes the result to standard output or to the
file specified with `--out`.
---
# INPUT
`obicomplement` accepts biological sequence data in FASTA, FASTQ, EMBL,
GenBank, ecoPCR output, and CSV formats. When no format flag is given, the
format is inferred automatically from the file contents or extension.
Input is read from standard input when no filename argument is provided, or
from one or more files passed as positional arguments. Gzip-compressed files
are handled transparently.
Paired-end data can be provided with `--paired-with`, which specifies the
file containing the second mate. Both mates are reverse-complemented and
written to separate output files.
---
# OUTPUT
The output is a sequence file in which every sequence is the reverse
complement of the corresponding input sequence. The output format matches
the input by default (FASTA if no quality data, FASTQ if quality data are
present), and can be overridden with `--fasta-output`, `--fastq-output`, or
`--json-output`.
All annotations (attributes stored in the sequence header) are preserved
unchanged. Quality scores, when present, are reversed to stay aligned with
their bases.
## Observed output example
```
>seq001 {"definition":"basic DNA sequence"}
cgatcgatcgatcgatcgat
>seq002 {"definition":"GC-rich sequence"}
gcgcgcgcgcgcgcgcgcgc
>seq003 {"definition":"AT-rich sequence"}
atatatatatatatatatat
>seq004 {"definition":"palindromic sequence"}
aattccggaattccggaatt
>seq005 {"definition":"mixed sequence"}
agctagcatgcatagccgat
```
---
# OPTIONS
## Input format
**`--fasta`**
: Default: false. Force parsing of input as FASTA format.
**`--fastq`**
: Default: false. Force parsing of input as FASTQ format.
**`--embl`**
: Default: false. Force parsing of input as EMBL flatfile format.
**`--genbank`**
: Default: false. Force parsing of input as GenBank flatfile format.
**`--ecopcr`**
: Default: false. Force parsing of input as ecoPCR output format.
**`--csv`**
: Default: false. Force parsing of input as CSV format.
**`--solexa`**
: Default: false. Decode quality scores using the Solexa/Illumina pre-1.3
convention instead of the standard Phred+33 encoding.
**`--input-OBI-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using the OBI
key=value format.
**`--input-json-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using JSON
format.
**`--no-order`**
: Default: false. When several input files are given, declare that no
ordering relationship exists among them, allowing the reader to interleave
records freely.
**`--paired-with <FILENAME>`**
: Default: none. File containing the paired (R2) reads. When set,
`obicomplement` processes both mates and writes them to separate output
files.
## Sequence preprocessing
**`--u-to-t`**
: Default: false. Convert Uracil (U) to Thymine (T) before computing the
reverse complement. Useful when processing RNA sequences that must be
treated as DNA.
**`--skip-empty`**
: Default: false. Discard sequences of length zero from the output.
## Output format
**`--fasta-output`**
: Default: false. Write output in FASTA format regardless of whether quality
scores are present.
**`--fastq-output`**
: Default: false. Write output in FASTQ format (requires quality data).
**`--json-output`**
: Default: false. Write output in JSON format.
**`--out|-o <FILENAME>`**
: Default: `-` (standard output). File used to save the output.
**`--output-OBI-header|-O`**
: Default: false. Write FASTA/FASTQ header annotations in OBI key=value
format.
**`--output-json-header`**
: Default: false. Write FASTA/FASTQ header annotations in JSON format.
**`--compress|-Z`**
: Default: false. Compress the output with gzip.
## Taxonomy
**`--taxonomy|-t <string>`**
: Default: none. Path to a taxonomy database. Required only when the input
sequences carry taxid annotations that need to be validated or updated.
**`--fail-on-taxonomy`**
: Default: false. Cause `obicomplement` to exit with an error if a taxid
referenced in the data is not a currently valid node in the loaded
taxonomy.
**`--update-taxid`**
: Default: false. Automatically replace taxids that have been declared
merged into a newer node by the taxonomy database.
**`--raw-taxid`**
: Default: false. Print taxids without appending the taxon name and rank.
**`--with-leaves`**
: Default: false. When the taxonomy is extracted from the sequence file,
attach sequences as leaves of their taxid node.
## Performance and diagnostics
**`--max-cpu <int>`**
: Default: 16 (env: `OBIMAXCPU`). Number of parallel threads used to
process sequences.
**`--batch-size <int>`**
: Default: 1 (env: `OBIBATCHSIZE`). Minimum number of sequences per
processing batch.
**`--batch-size-max <int>`**
: Default: 2000 (env: `OBIBATCHSIZEMAX`). Maximum number of sequences per
processing batch.
**`--batch-mem <string>`**
: Default: `128M` (env: `OBIBATCHMEM`). Maximum memory allocated per batch
(e.g. `128K`, `64M`, `1G`). Set to `0` to disable the memory limit.
**`--no-progressbar`**
: Default: false. Disable the progress bar printed to stderr.
**`--silent-warning`**
: Default: false (env: `OBIWARNING`). Suppress warning messages.
**`--debug`**
: Default: false (env: `OBIDEBUG`). Enable debug logging.
---
# EXAMPLES
```bash
# Reverse complement all sequences in a FASTA file
obicomplement sequences.fasta > out_default.fasta
```
**Expected output:** 5 sequences written to `out_default.fasta`.
```bash
# Reverse complement a FASTQ file, preserving quality scores
obicomplement reads.fastq --fastq-output --out out_fastq.fastq
```
**Expected output:** 5 sequences written to `out_fastq.fastq`.
```bash
# Convert RNA sequences to their reverse complement DNA strand
obicomplement --u-to-t rna_sequences.fasta > out_rna_rc.fasta
```
**Expected output:** 3 sequences written to `out_rna_rc.fasta`.
```bash
# Reverse complement paired-end reads into two separate output files
obicomplement R1.fastq --paired-with R2.fastq --out out_paired.fastq
```
**Expected output:** 3 sequences written to `out_paired_R1.fastq` and 3 sequences to `out_paired_R2.fastq`.
```bash
# Reverse complement and compress output, skipping any empty sequences
obicomplement --skip-empty --compress sequences.fasta --out out_compressed.fasta.gz
```
**Expected output:** 5 sequences written to `out_compressed.fasta.gz` (gzip-compressed FASTA).
```bash
# Reverse complement with OBI-format header output
obicomplement --output-OBI-header sequences.fasta --out out_obi.fasta
```
**Expected output:** 5 sequences written to `out_obi.fasta`.
```bash
# Reverse complement with explicit JSON-format header output
obicomplement --output-json-header sequences.fasta --out out_jsonheader.fasta
```
**Expected output:** 5 sequences written to `out_jsonheader.fasta`.
```bash
# Reverse complement and write full JSON output format
obicomplement --json-output sequences.fasta --out out_json.json
```
**Expected output:** 5 sequences written to `out_json.json`.
---
# SEE ALSO
- `obiconvert` — format conversion and sequence filtering pipeline
- `obipairing` — paired-end read merging (uses reverse complement internally)
- `obigrep` — sequence filtering and selection
---
# NOTES
Quality scores (Phred-scaled) are reversed in lock-step with the sequence
so that positional quality information remains valid after the reverse
complement operation. This is essential for downstream tools that rely on
per-base quality for alignment or variant calling.
Ambiguous IUPAC characters and gap symbols (`-`) are handled gracefully:
standard ambiguous bases are complemented according to IUPAC rules, while
gap and missing-data symbols are preserved unchanged.
+188
View File
@@ -0,0 +1,188 @@
# obiconsensus(1) — OBITools4 Manual
## NAME
`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
## SYNOPSIS
```
obiconsensus [OPTIONS] [FILE...]
```
## DESCRIPTION
`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.
The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.
Two denoising strategies are available:
- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.
## INPUT FORMATS
`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## DENOISING OPTIONS
`--distance INT`, `-d INT`
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
`--cluster`, `-C`
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
`--kmer-size SIZE`
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.
`--no-singleton`
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
`--low-coverage FLOAT`
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
`--sample ATTRIBUTE`, `-s ATTRIBUTE`
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.
## OUTPUT ANNOTATION OPTIONS
`--unique`, `-U`
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.
`--save-graph DIRECTORY`
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
`--save-ratio FILE`
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
## PERFORMANCE OPTIONS
`--max-cpu INT`
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
`--batch-size INT`
: Minimum number of sequences processed together in a single batch. Default: 1.
`--batch-size-max INT`
: Maximum number of sequences in a single batch. Default: 2000.
`--batch-mem STRING`
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.
`--no-progressbar`
: Disable the progress bar.
`--no-order`
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
## OTHER OPTIONS
`--u-to-t`
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
`--skip-empty`
: Remove sequences of length zero from the output.
`--solexa`
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
`--silent-warning`
: Suppress warning messages.
`--debug`
: Enable detailed logging for troubleshooting.
`--version`
: Print the version number and exit.
`--help`, `-h`
: Display a brief help message and exit.
## OUTPUT ATTRIBUTES
Each output consensus sequence carries several annotation attributes describing how it was built:
| Attribute | Description |
|-----------|-------------|
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
| `count` | Total number of reads merged into this consensus across all samples |
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
| `seq_length` | Length of the consensus sequence |
## EXAMPLES
**Basic denoising of a FASTQ file:**
```sh
obiconsensus reads.fastq > denoised.fastq
```
**Increase the allowed distance between reads to 2:**
```sh
obiconsensus --distance 2 reads.fastq > denoised.fastq
```
**Use clustering mode and remove singletons:**
```sh
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
```
**Denoise, then dereplicate the output:**
```sh
obiconsensus --unique reads.fastq > denoised_uniq.fastq
```
**Save denoising graphs for inspection:**
```sh
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
```
**Specify the sample annotation attribute:**
```sh
obiconsensus --sample library reads.fastq > denoised.fastq
```
## SEE ALSO
`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)
## NOTES
`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.
The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
+179
View File
@@ -0,0 +1,179 @@
# NAME
obiconvert — convertion of sequence files to various formats
---
# SYNOPSIS
```
obiconvert [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header] [--paired-with <FILENAME>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--raw-taxid]
[--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
obiconvert is a versatile command-line tool that converts biological sequence data between multiple standard bioinformatics formats. It enables biologists to process large datasets by reading from one format and writing to another, with support for quality scores, taxonomic annotations, and various input/output combinations. The tool is optimized for high-performance processing with configurable batching, parallel execution, and memory management.
Biologists use obiconvert to standardize sequence data for compatibility with different bioinformatics tools, extract quality information from FASTQ files into more readable formats, or convert between FASTA and FASTQ when working with DNA/RNA sequences that have associated quality data. The tool automatically detects input formats and intelligently selects output formats based on data presence (e.g., FASTQ when quality scores exist, FASTA otherwise). To force a specific output format regardless of input content, use the explicit output flags (`--fasta-output`, `--fastq-output`, `--json-output`). <!-- corrected: without --fasta-output, a FASTQ input with quality scores stays in FASTQ format even when the output filename has a .fasta extension -->
---
# INPUT
obiconvert accepts input in multiple biological sequence formats:
- **FASTA**: Standard text-based format with `>` headers and sequence data
- **FASTQ**: Binary quality-score format (default when both sequence and quality data present)
- **GenBank**: Comprehensive biological record format with annotations
- **EMBL**: EMBL flatfile format for sequence and feature information
- **ecoPCR**: Specialized output format from ecoPCR amplification tools
- **CSV**: Tabular sequence data with configurable delimiters
Input is provided as positional arguments (file paths or `-` for stdin). The tool automatically detects the input format based on file content and can handle multiple files in sequence. When paired-end sequencing is used, the `--paired-with` option specifies the mate read file.
---
# OUTPUT
obiconvert produces sequence data in several output formats depending on input content and user selection:
- **FASTA**: Text format with sequence only (use `--fasta-output` to force)
- **FASTQ**: Format including quality scores (default when quality data present; use `--fastq-output` to force)
- **JSON**: Structured output with all sequence metadata and attributes (use `--json-output`)
The tool preserves all sequence annotations (taxonomic information, custom attributes) during conversion. When converting to FASTA/FASTQ formats, title line annotations can be formatted as OBI structured data or JSON using the `--output-OBI-header`/`--output-json-header` flags. Sequences of zero length can be suppressed with `--skip-empty`.
## Observed output example
```
>seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002 {"definition":"Second sequence with moderate quality scores"}
gctagctagctagctagctagctagctagctagctagct
>seq003 {"definition":"Third sequence with high quality scores"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
>seq004 {"definition":"Fourth sequence with variable quality scores"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
```
---
# OPTIONS
## Input Format Options
- **--fasta**: Read data following the fasta format. (default: false)
- **--fastq**: Read data following the fastq format. (default: false)
- **--genbank**: Read data following the Genbank flatfile format. (default: false)
- **--embl**: Read data following the EMBL flatfile format. (default: false)
- **--ecopcr**: Read data following the ecoPCR output format. (default: false)
- **--csv**: Read data following the CSV format. (default: false)
## Input Header Options
- **--input-OBI-header**: FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--input-json-header**: FASTA/FASTQ title line annotations follow json format. (default: false)
## Output Format Options
- **--fasta-output**: Write sequence in fasta format (default if no quality data available). (default: false)
- **--fastq-output**: Write sequence in fastq format (default if quality data available). (default: false)
- **--json-output**: Write sequence in json format. (default: false)
## Output Header Options
- **--output-OBI-header|-O**: output FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--output-json-header**: output FASTA/FASTQ title line annotations follow json format. (default: false)
## Processing Options
- **--skip-empty**: Sequences of length equal to zero are suppressed from the output (default: false)
- **--no-order**: When several input files are provided, indicates that there is no order among them. (default: false)
- **--u-to-t**: Convert Uracil to Thymine. (default: false)
- **--update-taxid**: Make obitools automatically updating the taxid that are declared merged to a newest one. (default: false)
- **--raw-taxid**: When set, taxids are printed in files with any supplementary information (taxon name and rank) (default: false)
- **--fail-on-taxonomy**: Make obitools failing on error if a used taxid is not a currently valid one (default: false)
- **--with-leaves**: If taxonomy is extracted from a sequence file, sequences are added as leave of their taxid annotation (default: false)
## File Options
- **--out|-o <FILENAME>**: Filename used for saving the output (default: "-")
- **--paired-with <FILENAME>**: Filename containing the paired reads (default: "")
## Performance Options
- **--batch-mem <string>**: Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- **--batch-size <int>**: Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- **--batch-size-max <int>**: Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- **--max-cpu <int>**: Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
- **--compress|-Z**: Compress all the result using gzip (default: false)
## Debug Options
- **--debug**: Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- **--silent-warning**: Stop printing of the warning message (default: false, env: OBIWARNING)
- **--no-progressbar**: Disable the progress bar printing (default: false)
## Profiling Options
- **--pprof**: Enable pprof server. Look at the log for details. (default: false)
- **--pprof-goroutine <int>**: Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- **--pprof-mutex <int>**: Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
## Utility Options
- **--taxonomy|-t <string>**: Path to the taxonomy database. (default: "")
- **--solexa**: Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- **--help|-h|-?**: Show help message (default: false)
- **--version**: Prints the version and exits. (default: false)
---
# EXAMPLES
## Convert FASTQ to FASTA
```bash
# Convert quality-score data from FASTQ to readable FASTA format
obiconvert --fastq --fasta-output input.fastq -o output.fasta
```
**Expected output:** 4 sequences written to `output.fasta`.
## Convert FASTA to JSON
```bash
# Convert sequences to structured JSON format preserving all annotations
obiconvert --fasta --json-output input.fasta -o output.json
```
**Expected output:** 3 sequences written to `output.json`.
## Process paired-end sequencing data
```bash
# Convert paired FASTQ files preserving read pairing
obiconvert --fastq --fasta-output forward.fastq --paired-with reverse.fastq -o merged_sequences.fasta
```
**Expected output:** 4 sequences written to `merged_sequences_R1.fasta` and `merged_sequences_R2.fasta`.
---
# SEE ALSO
- obiannotate: Add taxonomic and functional annotations to sequences
- obicount: Count sequences in files
- obigrep: Filter sequences based on attributes or patterns
- obisummary: Generate statistics from sequence files
- obiuniq: Remove duplicate sequences
---
# NOTES
obiconvert automatically selects the optimal output format based on input data presence, preferring FASTQ when quality scores are available and FASTA otherwise. To force a specific output format, use `--fasta-output`, `--fastq-output`, or `--json-output` explicitly. <!-- corrected: the output format is NOT determined by the output filename extension; it must be forced with explicit flags -->
Memory usage is controlled through batch processing, with configurable memory limits per batch to handle large datasets efficiently. Progress reporting can be disabled for scripting purposes using `--no-progressbar`.
When working with taxonomic data, ensure the taxonomy database is accessible and properly formatted to avoid failures during sequence annotation processing.
+190
View File
@@ -0,0 +1,190 @@
# NAME
obicount — counts the sequences present in a file of sequences
---
# SYNOPSIS
```
obicount [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq]
[--genbank] [--help|-h|-?] [--input-OBI-header]
[--input-json-header] [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--reads|-r]
[--silent-warning] [--solexa] [--symbols|-s] [--u-to-t]
[--variants|-v] [--version] [<args>]
```
---
# DESCRIPTION
obicount is a command-line tool designed to count biological sequences from various input formats. It helps biologists quickly obtain quantitative metrics about sequence collections, which is essential for quality control, data assessment, and pipeline monitoring. The tool can count reads (total sequences), variants (unique sequence strings), or symbols (sum of character lengths), providing flexibility to focus on specific aspects of sequence data depending on the analysis needs.
---
# INPUT
obicount accepts input from files or stdin, supporting multiple biological sequence formats:
- FASTA (.fasta[.gz])
- FASTQ (.fastq[.fq][.gz])
- GenBank/EMBL (.gb|.gbff|.dat[.gz])
- ecoPCR format (.ecopcr[.gz])
- CSV format (--csv flag)
Input can be provided as multiple filenames or read from stdin. The tool automatically detects file formats and parses sequences accordingly.
---
# OUTPUT
obicount outputs one or more of the following metrics, depending on the flags used:
- **Read counts**: Total number of sequences in the input
- **Variant counts**: Number of unique sequence strings (distinct sequences)
- **Symbol counts**: Sum of all character lengths across all sequences
When no specific counting flags are provided (-r, -v, -s), all three metrics are reported by default. Output is printed to stdout in CSV format with headers: `entities,n` for the type of entity counted, followed by the count value.
---
# OPTIONS
## General Options
- --help|-h|-?
Show help message and exit.
- --max-cpu <int>
Number of parallel threads computing the result (default: 16, env: OBIMAXCPU).
- --debug
Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- --silent-warning
Stop printing of the warning message (default: false, env: OBIWARNING)
## Input Format Options
- --fasta
Read data following the fasta format. (default: false)
- --fastq
Read data following the fastq format. (default: false)
- --genbank
Read data following the Genbank flatfile format. (default: false)
- --embl
Read data following the EMBL flatfile format. (default: false)
- --ecopcr
Read data following the ecoPCR output format. (default: false)
- --csv
Read data following the CSV format. (default: false)
## Input Header Options
- --input-OBI-header
FASTA/FASTQ title line annotations follow OBI format. (default: false)
- --input-json-header
FASTA/FASTQ title line annotations follow json format. (default: false)
## Counting Mode Options
- --reads|-r
Prints read counts. (default: false)
- --variants|-v
Prints variant counts. (default: false)
- --symbols|-s
Prints symbol counts. (default: false)
## Processing Options
- --u-to-t
Convert Uracil to Thymine. (default: false, env: OBISOLEXA)
- --solexa
Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- --no-order
When several input files are provided, indicates that there is no order among them. (default: false)
## Performance Options
- --batch-mem <string>
Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- --batch-size <int>
Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- --batch-size-max <int>
Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- --max-cpu <int>
Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
## Profiling Options
- --pprof
Enable pprof server. Look at the log for details. (default: false)
- --pprof-goroutine <int>
Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- --pprof-mutex <int>
Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
- --version
Prints the version and exits. (default: false)
---
# EXAMPLES
# Count total number of sequences in a FASTA file
# Useful for quick assessment of dataset size
obicount input.fasta
**Expected output:** 4 sequences, out_default.txt
# Count only the number of unique sequence variants
# Helpful for identifying genetic diversity in population data
obicount --variants input.fasta
**Expected output:** 4 sequences, out_variants.txt
# Count sum of all sequence symbol lengths (nucleotides/amino acids)
# Useful for estimating total data volume or computing average read length
obicount --symbols input.fasta
**Expected output:** 4 sequences, out_symbols.txt
# Count reads from FASTQ format with quality scores
# Essential for assessing read throughput in sequencing data
obicount --fastq --reads input.fastq
**Expected output:** 4 sequences, out_fastq_reads.txt
---
# OUTPUT
## Observed output example
```
time="2026-04-02T19:33:11+02:00" level=info msg="Number of workers set 16"
time="2026-04-02T19:33:11+02:00" level=info msg="Found 1 files to process"
time="2026-04-02T19:33:11+02:00" level=info msg="input.fasta mime type: text/fasta"
entities,n
variants,5
reads,5
symbols,435
```
---
# SEE ALSO
- obiconvert - Convert between biological sequence file formats
- obiuniq - Remove duplicate sequences from files
---
# NOTES
_(not available)_
+315
View File
@@ -0,0 +1,315 @@
# NAME
obicsv — converts sequence files to CSV format
---
# SYNOPSIS
```
obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
[--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
[--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
[--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
[--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
[--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
[--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
obicsv converts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.
Columns must be explicitly selected: use `--ids` for the identifier, `--sequence` for the nucleotide sequence, `--quality` for quality scores, `--taxon` for taxonomic information, `--auto` to auto-detect annotation attributes, or `--keep` for specific named attributes. Multiple flags can be combined freely.
The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.
---
# INPUT
obicsv accepts input from files or stdin. The input format is automatically detected based on the file extension, but can be explicitly specified using format flags.
Supported input formats:
- FASTA (`--fasta`)
- FASTQ (`--fastq`)
- GenBank (`--genbank`)
- EMBL (`--embl`)
- ecoPCR output (`--ecopcr`)
- CSV (`--csv`)
Input sources:
- Local files (specified as arguments)
- stdin (when no input file is provided)
- Remote URLs (`http://`, `https://`, `ftp://`)
- Directories (automatically scanned for valid files)
Header formats:
- OBI format (`--input-OBI-header`)
- JSON format (`--input-json-header`)
- Auto-detection (default)
Taxonomy database can be provided with `--taxonomy|-t`.
---
# OUTPUT
The output is a CSV file with one row per sequence. The columns included depend on the flags used:
| Column | Flag | Description |
|--------|------|-------------|
| id | `--ids\|-i` | Sequence identifier |
| sequence | `--sequence\|-s` | DNA/RNA sequence |
| qualities | `--quality\|-q` | Quality scores (ASCII-encoded) |
| definition | `--definition\|-d` | Sequence description/annotation |
| count | `--count` | Number of reads represented by this sequence |
| taxid | `--taxon` | NCBI taxonomy identifier |
| scientific_name | `--taxon` | Taxonomic scientific name |
| custom attributes | `--keep\|-k` | Any attribute stored in sequence annotations |
If `--auto` is used, columns are automatically determined based on the attributes present in the first batch of sequences.
Missing values are written as the NA value (default: "NA").
## Observed output example
```csv
id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
```
---
# OPTIONS
## Output Columns
These flags control which columns appear in the CSV output.
- **`--ids|-i`**
- Default: `false`
- Meaning: Include the sequence identifier column. Useful for tracking or linking sequences.
- **`--sequence|-s`**
- Default: `false`
- Meaning: Include the nucleotide or amino acid sequence. This is the main biological data.
- **`--quality|-q`**
- Default: `false`
- Meaning: Include quality scores for each position. Essential for quality control and filtering.
- **`--definition|-d`**
- Default: `false`
- Meaning: Include the sequence description or definition from the source file.
- **`--count`**
- Default: `false`
- Meaning: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
- **`--taxon`**
- Default: `false`
- Meaning: Include taxonomic information. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see `--taxonomy`).
- **`--obipairing`**
- Default: `false`
- Meaning: Include attributes that were added by the `obipairing` command (pairing scores, mismatches, etc.).
- **`--auto`**
- Default: `false`
- Meaning: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with `--ids`, `--sequence`, etc. to add those columns on top of the auto-detected ones.
- **`--keep|-k <KEY>`**
- Default: `none`
- Meaning: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations.
- **`--na-value <NAVALUE>`**
- Default: `"NA"`
- Meaning: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, "NA", "null").
## Input/Output Files
- **`--out|-o <FILENAME>`**
- Default: `"-"` (stdout)
- Meaning: Write output to the specified file instead of stdout.
- **`--compress|-Z`**
- Default: `false`
- Meaning: Compress the output using gzip.
## Input Format
- **`--fasta`**, **`--fastq`**, **`--genbank`**, **`--embl`**, **`--ecopcr`**, **`--csv`**
- Default: auto-detection
- Meaning: Explicitly specify the input format.
- **`--input-OBI-header`**, **`--input-json-header`**
- Default: auto-detection
- Meaning: Specify the header format in FASTA/FASTQ files (OBI or JSON annotations).
- **`--u-to-t`**
- Default: `false`
- Meaning: Convert Uracil to Thymine. Useful for RNA sequences.
- **`--solexa`**
- Default: `false`
- Meaning: Decode quality strings according to the Solexa specification instead of Phred.
## Taxonomy
- **`--taxonomy|-t <string>`**
- Default: `""`
- Meaning: Path to the taxonomy database directory. Required for `--taxon` output.
- **`--fail-on-taxonomy`**
- Default: `false`
- Meaning: Make OBITools fail if a used taxid is not currently valid.
- **`--update-taxid`**
- Default: `false`
- Meaning: Automatically update taxids that have been merged to their newest valid taxid.
- **`--raw-taxid`**
- Default: `false`
- Meaning: Print only taxids without supplementary information (name and rank).
- **`--with-leaves`**
- Default: `false`
- Meaning: Add sequences as leaves of their taxid annotation when taxonomy is extracted from a sequence file.
## Performance
- **`--max-cpu <int>`**
- Default: `16`
- Meaning: Number of parallel threads for processing.
- **`--batch-size <int>`**
- Default: `1`
- Meaning: Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
- Default: `2000`
- Meaning: Maximum number of sequences per batch.
- **`--batch-mem <string>`**
- Default: `"128M"`
- Meaning: Maximum memory per batch (e.g., 128K, 64M, 1G).
- **`--no-order`**
- Default: `false`
- Meaning: When multiple input files are provided, indicates there is no order among them.
- **`--no-progressbar`**
- Default: `false`
- Meaning: Disable the progress bar.
## Other Options
- **`--debug`**
- Default: `false`
- Meaning: Enable debug mode by setting log level to debug.
- **`--pprof`**
- Default: `false`
- Meaning: Enable pprof server.
- **`--pprof-goroutine <int>`**
- Default: `6060`
- Meaning: Enable profiling of goroutine blocking.
- **`--pprof-mutex <int>`**
- Default: `10`
- Meaning: Enable profiling of mutex lock.
- **`--silent-warning`**
- Default: `false`
- Meaning: Suppress warning messages.
- **`--version`**
- Default: `false`
- Meaning: Print version information and exit.
- **`--help|-h|-?`**
- Default: `false`
- Meaning: Print help information.
---
# EXAMPLES
**Export sequences with identifiers to CSV**
Extracts sequence IDs and sequences from a FASTQ file.
```bash
obicsv --ids --sequence sequences.fastq -o output1.csv
```
**Expected output:** 3 sequences written to `output1.csv`.
**Export sequences with quality scores**
Useful for quality control and filtering in downstream tools.
```bash
obicsv --ids --sequence --quality sequences.fastq -o output2.csv
```
**Expected output:** 3 sequences written to `output2.csv`.
**Export with taxonomic information**
Includes taxid and scientific name for taxonomic analysis.
```bash
obicsv --ids --sequence --taxon --taxonomy /path/to/taxonomy sequences.fasta -o output.csv
```
**Auto-detect annotation columns from sequence headers**
Automatically discovers all annotation attributes present in the sequence headers and outputs them as CSV columns. Combined with `--ids` to also include the sequence identifier.
```bash
obicsv --auto --ids sequences.fasta -o output4.csv
```
**Expected output:** 3 rows in `output4.csv` with columns `id`, `sample`, `taxid` (attributes found in sequence headers).
**Extract specific attributes**
Keeps only the specified attributes as columns. Attributes not present in a sequence are written as the NA value.
```bash
obicsv --keep sample --keep taxid sequences.fasta -o output5.csv
```
**Expected output:** 3 rows in `output5.csv` with columns `taxid`, `sample`.
**Export with compression**
Writes gzip-compressed CSV output for large datasets.
```bash
obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz
```
**Expected output:** 3 sequences written to `output6.csv.gz`.
---
# SEE ALSO
- `obiconvert` — input/output handling framework
- `obipairing` — pairing information (used with `--obipairing`)
- Other export commands: `obifasta`, `obifastq`, `obijson`
---
# NOTES
- Without any column selection flag (`--ids`, `--sequence`, `--quality`, `--taxon`, `--auto`, `--keep`), the output contains no columns and no data.
- The `--taxon` option requires a valid taxonomy database specified with `--taxonomy`.
- Output is written to stdout by default; use `--out` to write to a file.
- Missing attributes are written as the NA value (customizable with `--na-value`).
- Input sequences are processed using streaming iterators to minimize memory footprint, even for large files.
+321
View File
@@ -0,0 +1,321 @@
# obidemerge
## NAME
`obidemerge` — split merged sequence records back into individual, sample-annotated copies
## SYNOPSIS
```
obidemerge [options] [input_files...]
```
## DESCRIPTION
In a typical metabarcoding workflow, `obiuniq` or similar tools collapse identical sequences
from multiple samples into a single representative record. That record carries a statistics
attribute (for example `merged_sample`) that stores, for every original sample, how many
times the sequence was observed. This compact representation is convenient for clustering
and denoising, but some downstream analyses need the original, per-sample view.
`obidemerge` reverses that merging step. For each input sequence, it reads the statistics
stored under a chosen attribute (by default `sample`) and produces one output sequence per
entry in that statistics map. Each output sequence is a copy of the original, but:
- its `sample` attribute (or whichever slot you chose) is set to the name of the individual
sample,
- its read count is set to the abundance recorded for that sample.
The original statistics attribute is removed from all output sequences.
Sequences that carry no statistics for the chosen slot are passed through unchanged.
The command reads sequences from one or more files, or from standard input when no file is
given, and writes the results to standard output or to the file specified with `--out`.
## INPUT FORMATS
`obidemerge` accepts all sequence formats supported by OBITools4:
| Format | Description |
|--------|-------------|
| FASTA | Plain nucleotide sequences with annotation in the title line |
| FASTQ | Sequences with per-base quality scores |
| EMBL | European Nucleotide Archive flat-file format |
| GenBank | NCBI GenBank flat-file format |
| ecoPCR | Output produced by the ecoPCR tool |
| CSV | Comma-separated values with sequence and metadata columns |
The format is detected automatically from the file extension or content. You can override
detection with the format flags listed under **Input format options** below.
Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style
(`--input-OBI-header`) or JSON style (`--input-json-header`).
## OUTPUT FORMATS
By default, the output format mirrors the input:
- If the input contains quality scores, output is FASTQ.
- Otherwise, output is FASTA with OBI-style annotations.
You can force a specific format with `--fasta-output`, `--fastq-output`, or `--json-output`.
## OPTIONS
### Demerge option
`--demerge <slot>`, `-d <slot>`
: Name of the sequence attribute that holds the per-sample statistics to expand.
Each key in that statistics map becomes a separate output sequence.
**Default:** `sample`
### Output options
`--out <FILENAME>`, `-o <FILENAME>`
: Write output to this file instead of standard output. Use `-` for standard output.
**Default:** `-` (standard output)
`--fasta-output`
: Write output in FASTA format, even when quality scores are available.
**Default:** false
`--fastq-output`
: Write output in FASTQ format (requires quality scores in the input).
**Default:** false
`--json-output`
: Write output in JSON format, one record per line.
**Default:** false
`--output-OBI-header`, `-O`
: Write FASTA/FASTQ title lines in OBI key=value annotation style.
**Default:** false (JSON-style headers)
`--output-json-header`
: Write FASTA/FASTQ title lines in JSON annotation style.
**Default:** false
`--compress`, `-Z`
: Compress the output with gzip.
**Default:** false
`--skip-empty`
: Discard sequences of length zero from the output.
**Default:** false
### Input format options
`--fasta`
: Force reading in FASTA format.
`--fastq`
: Force reading in FASTQ format.
`--embl`
: Force reading in EMBL flat-file format.
`--genbank`
: Force reading in GenBank flat-file format.
`--ecopcr`
: Force reading in ecoPCR output format.
`--csv`
: Force reading in CSV format.
`--input-OBI-header`
: Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
`--input-json-header`
: Parse FASTA/FASTQ title lines as JSON annotations.
`--solexa`
: Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard
Phred scale. Use this only for very old sequencing data.
**Default:** false
`--u-to-t`
: Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived
data that should be treated as DNA.
**Default:** false
`--no-order`
: When reading from several input files, do not attempt to preserve the order of records
across files. May improve speed when order does not matter.
**Default:** false
### Taxonomy options
`--taxonomy <path>`, `-t <path>`
: Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to
be resolved or validated during output.
**Default:** none
`--fail-on-taxonomy`
: Stop with an error if a taxonomic identifier in the data is not found in the loaded
taxonomy database.
**Default:** false
`--raw-taxid`
: Print taxonomic identifiers as plain numbers, without appending the taxon name and rank.
**Default:** false
`--update-taxid`
: Automatically replace deprecated taxonomic identifiers with their current equivalents,
as declared in the taxonomy database.
**Default:** false
`--with-leaves`
: When a taxonomy is extracted from the sequence file itself, treat each sequence as a
leaf node under its annotated taxonomic identifier.
**Default:** false
### Performance options
`--max-cpu <int>`
: Maximum number of parallel processing threads. Increase for faster processing on
multi-core machines.
**Default:** 16 (or the value of the `OBIMAXCPU` environment variable)
`--batch-size <int>`
: Minimum number of sequences processed together as a group.
**Default:** 1
`--batch-size-max <int>`
: Maximum number of sequences processed together as a group.
**Default:** 2000
`--batch-mem <size>`
: Maximum memory used per processing group (e.g. `64M`, `1G`). Set to `0` to disable the
memory limit and rely on `--batch-size-max` alone.
**Default:** `128M`
### Display options
`--no-progressbar`
: Hide the progress bar.
**Default:** false
`--silent-warning`
: Suppress warning messages.
**Default:** false
`--debug`
: Enable verbose debug logging.
**Default:** false
`--version`
: Print the OBITools4 version and exit.
`--help`, `-h`, `-?`
: Print this help message and exit.
## EXAMPLES
### Example 1 — basic demerge using the default slot
After running `obiuniq`, the file `unique.fasta` contains merged sequences whose
`merged_sample` attribute records abundance per sample. Demerge back to one
sequence per sample:
<!-- corrected: -d sample (not -d merged_sample) because HasStatsOn("sample") looks for the merged_sample attribute -->
```bash
obidemerge -d sample unique.fasta > per_sample_merged.fasta
```
**Expected output:** 7 sequences written to `per_sample_merged.fasta`.
### Example 2 — demerge with the default `sample` slot
If the statistics are already stored under the attribute named `sample` (the default),
no `-d` flag is needed:
```bash
obidemerge unique.fasta > per_sample_default.fasta
```
**Expected output:** 7 sequences written to `per_sample_default.fasta`.
### Example 3 — write compressed output to a file
```bash
obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta
```
**Expected output:** 7 sequences written (compressed) to `per_sample.fasta.gz`.
### Example 4 — pipeline use: cluster, then demerge
Obtain unique sequences, cluster them, then expand the clusters back to individual
sample records for ecological analysis:
```bash
obiuniq -m sample reads.fastq \
| obiclean ... \
| obidemerge -d sample -o demerged.fasta
```
### Example 5 — process multiple input files
```bash
obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta
```
**Expected output:** 6 sequences written to `combined_demerged.fasta`.
## SEE ALSO
`obiuniq(1)` — collapses identical sequences and records per-sample counts (the inverse operation)
`obiclean(1)` — removes PCR/sequencing artefacts from a set of unique sequences
`obiannotate(1)` — adds or modifies sequence attributes
`obigrep(1)` — filters sequences by attributes or sequence content
`obicount(1)` — counts sequences and total reads in a file
## NOTES
**Relationship to `obiuniq`.**
`obiuniq --merge sample` stores per-sample counts under an attribute named `merged_sample`.
When you later call `obidemerge`, you must therefore pass `-d sample` to match that
attribute name. The `-d` option takes the **logical** slot name (here `sample`), not the
internal storage name (`merged_sample`).
<!-- corrected: -d sample is correct (not -d merged_sample); the tool prepends "merged_" internally when looking up the attribute -->
**Read counts after demerging.**
Each output sequence has its read count set to the value recorded in the statistics map for
that sample. If you sum the counts of all output sequences that share the same identifier,
you recover the total count of the original merged record.
**Order of output sequences.**
The order in which the per-sample copies of a single merged sequence appear in the output
is not guaranteed. If a stable order is required, pipe the output through `obisort`.
## OUTPUT
`obidemerge` writes one sequence record per sample entry found in the statistics attribute.
Each output record is a copy of the input sequence, with:
- the statistics attribute (`merged_<slot>`) removed,
- the `<slot>` attribute set to the sample name,
- the `count` attribute set to the abundance for that sample.
Sequences with no statistics for the chosen slot are passed through unchanged.
## Observed output example
```
>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc
```
+296
View File
@@ -0,0 +1,296 @@
# NAME
obidistribute — divided an input set of sequences into subsets
---
# SYNOPSIS
```
obidistribute --pattern|-p <string> [--append|-A] [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>]
[--batches|-n <int>] [--classifier|-c <string>] [--compress|-Z]
[--csv] [--debug] [--directory|-d <string>] [--ecopcr] [--embl]
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
[--genbank] [--hash|-H <int>] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--na-value <string>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--silent-warning] [--skip-empty] [--solexa] [--u-to-t]
[--version] [<args>]
```
---
# DESCRIPTION
`obidistribute` splits a set of biological sequences into multiple output files according to one of three distribution strategies: annotation-based classification, round-robin batch assignment, or hash-based sharding.
The most common use case in metabarcoding is demultiplexing: sequences carry a tag annotation (e.g., `sample_id`) and `obidistribute` writes each sample's sequences into its own file. The output filename for each group is built from a user-supplied pattern containing `%s`, which is replaced by the classifier value or batch index.
When no classifier is specified, sequences can be split into a fixed number of batches (`--batches`) for parallel downstream processing, or sharded deterministically by hash (`--hash`) to ensure reproducible partitioning regardless of input order.
Output files can be organised into subdirectories (one per classifier value) using `--directory`, and existing files can be extended rather than overwritten with `--append`. Sequences lacking the classifier annotation are assigned to a file whose name uses the NA value (default: `"NA"`).
---
# INPUT
`obidistribute` reads biological sequences from one or more files supplied as positional arguments, or from standard input when no files are given. All major NGS and flat-file formats are supported and auto-detected:
- FASTA / FASTQ (plain or gzip-compressed)
- GenBank and EMBL flat files
- ecoPCR output
- CSV
Format can be forced with `--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, or `--csv`. Header annotation style can be specified with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Each distribution group produces a separate output file named according to the `--pattern` template. The `%s` placeholder in the pattern is replaced by the classifier value, batch index, or hash shard index, depending on the chosen distribution mode.
Output format follows the same rules as other OBITools commands: FASTQ is used when quality scores are present, FASTA otherwise. The format can be forced with `--fasta-output`, `--fastq-output`, or `--json-output`. All annotations present in the input sequences are preserved in the output files.
When `--directory` is used together with `--classifier`, output files are placed in subdirectories named after the classifier values, allowing hierarchical organisation of results.
## Observed output example
```
@seq001 {"sample_id":"sampleA"}
atcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample_id":"sampleA"}
gctagctagctagctagcta
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample_id":"sampleA"}
ttagctaatcggtaatcggt
+
IIIIIIIIIIIIIIIIIIII
@seq009 {"sample_id":"sampleA"}
atgatgatgatgatgatgat
+
IIIIIIIIIIIIIIIIIIII
```
---
# OPTIONS
## Distribution mode
- **`--pattern|-p <string>`** — _(required)_
Default: none.
The template used to build output filenames. The variable part is represented by `%s`. Example: `toto_%s.fastq`.
- **`--classifier|-c <string>`**
Default: `""`.
The name of an annotation tag on the sequences. Sequences are dispatched into separate files based on the value of this tag. The tag value must be a string, integer, or boolean.
- **`--batches|-n <int>`**
Default: `0`.
Splits the input into exactly *N* batches by round-robin assignment, regardless of sequence metadata.
- **`--hash|-H <int>`**
Default: `0`.
Splits the input into at most *N* batches using a hash of the sequence. Produces deterministic, reproducible sharding.
- **`--directory|-d <string>`**
Default: `""`.
Used together with `--classifier`: organises output files into subdirectories named after classifier values.
## Output file handling
- **`--append|-A`**
Default: `false`.
Appends sequences to output files if they already exist, instead of overwriting them.
- **`--na-value <string>`**
Default: `"NA"`.
Value used as the filename component when a sequence does not have the classifier tag defined.
- **`--compress|-Z`**
Default: `false`.
Compresses all output files using gzip.
## Input format
- **`--fasta`**
Default: `false`.
Read data following the FASTA format.
- **`--fastq`**
Default: `false`.
Read data following the FASTQ format.
- **`--embl`**
Default: `false`.
Read data following the EMBL flatfile format.
- **`--genbank`**
Default: `false`.
Read data following the GenBank flatfile format.
- **`--ecopcr`**
Default: `false`.
Read data following the ecoPCR output format.
- **`--csv`**
Default: `false`.
Read data following the CSV format.
- **`--input-OBI-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow OBI format.
- **`--input-json-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow JSON format.
- **`--solexa`**
Default: `false`.
Decodes quality string according to the Solexa specification.
- **`--u-to-t`**
Default: `false`.
Convert Uracil to Thymine.
- **`--skip-empty`**
Default: `false`.
Sequences of length equal to zero are suppressed from the output.
- **`--no-order`**
Default: `false`.
When several input files are provided, indicates that there is no order among them.
## Output format
- **`--fasta-output`**
Default: `false`.
Write sequences in FASTA format (default if no quality data available).
- **`--fastq-output`**
Default: `false`.
Write sequences in FASTQ format (default if quality data available).
- **`--json-output`**
Default: `false`.
Write sequences in JSON format.
- **`--output-OBI-header|-O`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow OBI format.
- **`--output-json-header`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow JSON format.
- **`--out|-o <FILENAME>`**
Default: `"-"`.
Filename used for saving the output.
## Performance
- **`--max-cpu <int>`**
Default: `16`.
Number of parallel threads computing the result.
- **`--batch-size <int>`**
Default: `1`.
Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
Default: `2000`.
Maximum number of sequences per batch.
- **`--batch-mem <string>`**
Default: `""` (128M).
Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostic & debug
- **`--debug`**
Default: `false`.
Enable debug mode, by setting log level to debug.
- **`--no-progressbar`**
Default: `false`.
Disable the progress bar printing.
- **`--silent-warning`**
Default: `false`.
Stop printing of warning messages.
- **`--pprof`**
Default: `false`.
Enable pprof server. Look at the log for details.
- **`--pprof-goroutine <int>`**
Default: `6060`.
Enable profiling of goroutine blocking profile.
- **`--pprof-mutex <int>`**
Default: `10`.
Enable profiling of mutex lock.
---
# EXAMPLES
```bash
# Demultiplex sequences by sample_id annotation into per-sample FASTQ files
obidistribute --classifier sample_id --pattern out_ex1_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `out_ex1_sampleA.fastq` (4 sequences), `out_ex1_sampleB.fastq` (3 sequences), `out_ex1_sampleC.fastq` (2 sequences), `out_ex1_NA.fastq` (1 sequence).
```bash
# Demultiplex into subdirectories, one directory per sample
obidistribute --classifier sample_id --directory --pattern %s/reads.fastq reads.fastq
```
```bash
# Split a large dataset into 3 equal batches for parallel processing
obidistribute --batches 3 --pattern chunk_%s.fasta --fasta-output --no-progressbar sequences.fasta
```
**Expected output:** 10 sequences written to 3 files: `chunk_1.fasta` (4 sequences), `chunk_2.fasta` (3 sequences), `chunk_3.fasta` (3 sequences). Batch indices are 1-based.
```bash
# Hash-based sharding into 4 reproducible shards
obidistribute --hash 4 --pattern shard_%s.fastq --no-progressbar reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `shard_0.fastq` through `shard_3.fastq`. Shard indices are 0-based.
```bash
# Append new sequences to existing per-sample files (incremental demultiplexing)
obidistribute --classifier sample_id --pattern samples_%s.fastq --append new_reads.fastq
```
```bash
# Demultiplex sequences, replacing the NA label for unclassified sequences
obidistribute --classifier sample_id --na-value unclassified --pattern out_ex6_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files including `out_ex6_unclassified.fastq` (1 sequence without `sample_id` annotation).
---
# SEE ALSO
`obiconvert`, `obisplit`, `obigrep`
---
# NOTES
- Sequences that lack the annotation specified by `--classifier` are written to the file whose name is built using the `--na-value` (default: `"NA"`).
- The three distribution modes (`--classifier`, `--batches`, `--hash`) are mutually exclusive.
- When using `--directory` together with `--classifier`, subdirectories are created automatically if they do not exist.
- Batch indices produced by `--batches` are 1-based; hash shard indices produced by `--hash` are 0-based.
+326
View File
@@ -0,0 +1,326 @@
# obigrep(1) — OBITools4 Manual
## NAME
`obigrep` — select a subset of sequence records on various criteria
## SYNOPSIS
```
obigrep [OPTIONS] [FILE...]
```
## DESCRIPTION
`obigrep` filters a set of biological sequence records (in FASTA or FASTQ format) and writes only those matching all specified criteria to the output. Its name is modelled on the Unix `grep` command, but instead of filtering lines in a text file, it filters sequence records.
Filtering criteria can be combined freely: only sequence records satisfying **all** specified conditions are retained. The selection can be inverted with `--inverse-match` to keep the records that would otherwise be discarded.
Sequences are read from one or more files, or from standard input if no file is given. Results are written to standard output or to a file specified with `--out`. Records that do not pass the filters can optionally be saved to a separate file with `--save-discarded`.
## INPUT FORMATS
`obigrep` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## FILTERING OPTIONS
### By sequence length
- `--min-length LENGTH` / `-l LENGTH`
Keep only sequences at least *LENGTH* bases long.
- `--max-length LENGTH` / `-L LENGTH`
Keep only sequences at most *LENGTH* bases long.
### By read abundance
Sequence records can carry a `count` attribute recording how many times the sequence was observed. The following options filter on that count:
- `--min-count COUNT` / `-c COUNT`
Keep only sequences observed at least *COUNT* times (default: 1).
- `--max-count COUNT` / `-C COUNT`
Keep only sequences observed at most *COUNT* times.
### By sequence pattern
- `--sequence PATTERN` / `-s PATTERN`
Keep records whose nucleotide sequence matches the regular expression *PATTERN* (case-insensitive). This option can be repeated; all patterns must match.
- `--approx-pattern PATTERN`
Keep records whose sequence contains an approximate match to *PATTERN*. The number of allowed differences is controlled by `--pattern-error`. This option can be repeated.
- `--pattern-error N`
Maximum number of mismatches (or indels, if `--allows-indels` is set) tolerated when using `--approx-pattern` (default: 0, i.e. exact match).
- `--allows-indels`
Allow insertions and deletions (in addition to substitutions) when performing approximate pattern matching.
- `--only-forward`
Search patterns on the forward strand only. By default both strands are searched.
### By identifier or definition
- `--identifier PATTERN` / `-I PATTERN`
Keep records whose identifier matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
- `--id-list FILENAME`
Keep only records whose identifier appears in *FILENAME*, a plain-text file with one identifier per line.
- `--definition PATTERN` / `-D PATTERN`
Keep records whose definition line matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
### By attribute (metadata)
Sequence records can carry arbitrary key/value annotations:
- `--has-attribute KEY` / `-A KEY`
Keep records that possess an attribute named *KEY*, regardless of its value. Can be repeated.
- `--attribute KEY=PATTERN` / `-a KEY=PATTERN`
Keep records for which the value of attribute *KEY* matches the regular expression *PATTERN* (case-sensitive). Can be repeated; all constraints must be satisfied.
### By custom boolean expression
- `--predicate EXPRESSION` / `-p EXPRESSION`
Keep records for which the boolean expression *EXPRESSION* evaluates to true. Attributes are accessed via the `annotations` map (e.g. `annotations["count"]`). The special variable `sequence` refers to the sequence object; its length can be obtained with `len(sequence)`. Can be repeated; all expressions must be true.
Example: `-p 'annotations["count"] >= 10 && len(sequence) < 200'`
### By taxonomy
Taxonomy-based filtering requires a taxonomy database to be provided with `--taxonomy`.
- `--taxonomy PATH` / `-t PATH`
Path to the taxonomy database.
- `--restrict-to-taxon TAXID` / `-r TAXID`
Keep only records whose taxon belongs to the lineage of *TAXID* (i.e. is *TAXID* itself or a descendant). Can be repeated; sequences must satisfy at least one of the provided taxids.
- `--ignore-taxon TAXID` / `-i TAXID`
Discard records whose taxon belongs to the lineage of *TAXID*. Can be repeated.
- `--valid-taxid`
Keep only records that carry a valid, recognised taxonomic identifier.
- `--require-rank RANK_NAME`
Keep only records whose taxon has a defined ancestor at the given rank (e.g. *species*, *genus*, *family*). Can be repeated.
- `--update-taxid`
Automatically update merged taxids to their current valid equivalent.
- `--fail-on-taxonomy`
Exit with an error if a taxid referenced in the data is not valid.
- `--with-leaves`
When the taxonomy is extracted from a sequence file, attach each sequence as a leaf node under its annotated taxid.
- `--raw-taxid`
Print taxids in output files without supplementary information (taxon name and rank).
### Inversion
- `--inverse-match` / `-v`
Invert the selection: output the records that would otherwise be discarded.
## PAIRED-END OPTIONS
When paired-end sequencing data are provided (forward and reverse reads stored in two files), `obigrep` can apply filters taking both reads into account.
- `--paired-with FILENAME`
File containing the reverse (paired) reads.
- `--paired-mode MODE`
How to combine the filter result from the forward and reverse reads. *MODE* is one of:
| Mode | Meaning |
|------|---------|
| `forward` | Keep the pair if the **forward** read passes (default) |
| `reverse` | Keep the pair if the **reverse** read passes |
| `and` | Keep the pair if **both** reads pass |
| `or` | Keep the pair if **at least one** read passes |
| `andnot` | Keep the pair if the **forward** passes and the **reverse** does not |
| `xor` | Keep the pair if **exactly one** read passes |
## OUTPUT CONTROL
- `--save-discarded FILENAME`
Write sequence records that do **not** pass the filters to *FILENAME*.
- `--out FILENAME` / `-o FILENAME`
Write the selected records to *FILENAME* (default: standard output).
- `--skip-empty`
Suppress sequences of length zero from the output.
## PERFORMANCE OPTIONS
- `--max-cpu N`
Number of parallel processing threads (default: number of available CPUs).
- `--batch-size N`
Minimum number of sequences per processing batch (default: 1).
- `--batch-size-max N`
Maximum number of sequences per processing batch (default: 2000).
- `--batch-mem SIZE`
Maximum memory per batch (e.g. `128M`, `1G`). Overrides `--batch-size-max` when memory is the limiting factor. Can also be set via the environment variable `OBIBATCHMEM`.
- `--no-order`
When multiple input files are provided, indicates that no ordering is assumed between them, which can improve throughput.
- `--no-progressbar`
Disable the progress bar.
## MISCELLANEOUS OPTIONS
- `--u-to-t`
Convert uracil (U) to thymine (T) in all sequences (useful for RNA data).
- `--solexa`
Decode quality scores according to the legacy Solexa specification instead of the standard Phred encoding.
- `--silent-warning`
Suppress warning messages.
- `--debug`
Enable verbose debug logging.
- `--version`
Print version information and exit.
- `--help` / `-h` / `-?`
Display the help message and exit.
## EXAMPLES
Keep all sequences longer than 100 bases:
```bash
obigrep --min-length 100 input.fasta > out_min_length.fasta
```
**Expected output:** 6 sequences written to `out_min_length.fasta`.
Select sequences observed at least 10 times:
```bash
obigrep --min-count 10 input.fasta > out_min_count.fasta
```
**Expected output:** 4 sequences written to `out_min_count.fasta`.
Keep sequences whose identifier starts with `BOLD`:
```bash
obigrep --identifier '^BOLD' input.fasta > out_bold.fasta
```
**Expected output:** 2 sequences written to `out_bold.fasta`.
Select only sequences carrying the IUPAC primer motif `GGGCWATGTTTCATAAYGGG` with up to 2 mismatches:
```bash
obigrep --approx-pattern GGGCWATGTTTCATAAYGGG --pattern-error 2 input.fasta > out_primer.fasta
```
**Expected output:** 2 sequences written to `out_primer.fasta`.
Retain sequences belonging to the genus *Homo* (taxid 9605) in an NCBI taxonomy:
```bash
obigrep --taxonomy /data/ncbi_tax --restrict-to-taxon 9605 input.fasta
```
Keep sequences that have a `sample` attribute equal to `lake1` and save the rest to a separate file:
```bash
obigrep --attribute sample='^lake1$' --save-discarded discarded.fasta \
input.fasta > lake1.fasta
```
**Expected output:** 5 sequences written to `lake1.fasta`, 5 sequences written to `discarded.fasta`.
Invert a length filter (discard sequences shorter than 50 bases):
```bash
obigrep --min-length 50 --inverse-match input.fasta > out_short.fasta
```
**Expected output:** 1 sequence written to `out_short.fasta`.
Apply a custom predicate (sequences with count ≥ 5):
```bash
obigrep -p 'annotations["count"] >= 5' input.fasta > out_predicate.fasta
```
**Expected output:** 6 sequences written to `out_predicate.fasta`.
## OUTPUT
### Attribute table
Attributes present on sequence records are preserved unchanged in the output. No new attributes are added by `obigrep` itself — only filtering occurs.
| Attribute | Type | Description |
|-----------|------|-------------|
| `count` | integer | Number of times the sequence was observed (read from input) |
| `sample` | string | Sample identifier (read from input) |
Any other annotations present in the input are carried through to the output unmodified.
### Observed output example
```
>seq001 {"count":15,"sample":"lake1"}
acgtacgtacgtacgtacgtgggcaatgtttcataatgggacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":3,"sample":"lake1"}
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
>seq004 {"count":2,"sample":"lake1"}
aaacccgggtttagctagctagctagctagctagctagctagctagctagctagctagct
agctagctagctagctagctagctagctagctagctagctagctagctagctagctagct
atacgtatcgatcg
>BOLD_005 {"count":8,"sample":"pond1"}
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq008 {"count":7,"sample":"river2"}
ttacgatcgatcgatcgatcgggcaatgtttcataaggggacgatcgatcgatcgatcga
tcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
```
## SEE ALSO
`obiannotate`(1), `obiuniq`(1), `obiconvert`(1), `obitag`(1), `obisplit`(1)
## OBITools4
`obigrep` is part of the **OBITools4** suite for analysing DNA metabarcoding and environmental DNA data.
+257
View File
@@ -0,0 +1,257 @@
# NAME
obijoin — merge annotations contained in a file to another file
---
# SYNOPSIS
```
obijoin --join-with|-j <string> [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--by|-b <string>]... [--compress|-Z]
[--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-id|-i]
[--update-quality|-q] [--update-sequence|-s] [--update-taxid]
[--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obijoin` merges annotations from a secondary file into a primary sequence dataset. For each sequence in the primary input, it looks up matching records in the secondary file based on one or more shared attribute keys, then copies all annotations from matched partner records onto the primary sequence.
The join is a **left outer join**: every sequence in the primary dataset is preserved in the output, whether or not a match is found in the secondary file. Unmatched sequences simply receive no additional annotations. Key matching is exact string equality.
A common use case is enriching amplicon or read sequences with external sample metadata. The secondary file (the *annotation source*) can be a FASTA/FASTQ sequence file, a CSV table, an EMBL or GenBank flat file, or any other format accepted by OBITools4. This makes it straightforward to prepare a simple spreadsheet with sample identifiers and metadata columns, save it as CSV, and merge it directly into a sequence dataset — the CSV format is auto-detected, no format conversion or extra flag is needed. <!-- corrected: secondary CSV is auto-detected; --csv flag is not needed for the secondary file -->
In addition to transferring annotations, `obijoin` can optionally replace the sequence identifier, nucleotide sequence, or quality scores of each primary sequence with values from its matched partner, controlled by the `--update-id`, `--update-sequence`, and `--update-quality` flags.
---
# INPUT
`obijoin` accepts a primary sequence dataset on standard input or as one or more file arguments. The supported formats are automatically detected and include FASTA, FASTQ, EMBL, GenBank, ecoPCR output, CSV, and JSON. Format-specific flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) can force a specific parser when auto-detection is ambiguous.
The secondary file, supplied via `--join-with`, is loaded entirely into memory before processing begins, and supports the same set of formats including CSV — the format is auto-detected automatically. <!-- corrected: removed incorrect claim that --csv is needed for secondary file -->
When multiple primary input files are provided and their ordering across files is irrelevant, `--no-order` allows the reader to return batches in whichever order they complete, improving throughput.
---
# OUTPUT
The output is a sequence file in FASTA or FASTQ format (determined automatically by the presence of quality data), written to standard output or to the file specified by `--out`. Alternative output formats can be requested with `--fasta-output`, `--fastq-output`, or `--json-output`. The output can be gzip-compressed with `--compress`.
Each output sequence carries all annotations from the primary dataset, enriched with every annotation attribute copied from the matched partner record. If a field name exists in both, the partner value overwrites the primary value. When `--update-id`, `--update-sequence`, or `--update-quality` are set, the corresponding sequence-level fields are also replaced with the partner's values.
## Observed output example
```
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
aaaaatttttcccccggggg
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa
```
---
# OPTIONS
## Required
`--join-with|-j <string>`
: Path to the secondary file whose records are joined onto the primary sequences. This parameter is mandatory. The file can be in any format accepted by OBITools4 (FASTA, FASTQ, CSV, EMBL, GenBank, ecoPCR); the format is auto-detected. Default: none.
## Join control
`--by|-b <string>`
: Declares a join key as an attribute name or a `primary_attr=secondary_attr` mapping. Repeat the flag to join on multiple keys simultaneously; all keys must match for a record pair to be considered a hit (intersection semantics). When omitted, the join defaults to matching by sequence identifier (`id`). Default: `[]`.
`--update-id|-i`
: Replace the identifier of each primary sequence with the identifier from its matched partner record. Default: `false`.
`--update-sequence|-s`
: Replace the nucleotide or amino acid sequence of each primary sequence with the sequence from its matched partner. Default: `false`.
`--update-quality|-q`
: Replace the per-base quality scores of each primary sequence with the quality scores from its matched partner. Relevant only when both datasets carry quality information (FASTQ). Default: `false`.
## Input format
`--csv`
: Read the primary input data in OBITools CSV format (e.g., sequences exported by `obicsv`). This flag applies to the primary input only; secondary files supplied via `--join-with` are always auto-detected. Default: `false`. <!-- corrected: --csv affects primary input only, not the secondary annotation file -->
`--ecopcr`
: Read data following the ecoPCR output format. Default: `false`.
`--embl`
: Read data following the EMBL flatfile format. Default: `false`.
`--fasta`
: Read data following the FASTA format. Default: `false`.
`--fastq`
: Read data following the FASTQ format. Default: `false`.
`--genbank`
: Read data following the GenBank flatfile format. Default: `false`.
`--input-OBI-header`
: Treat FASTA/FASTQ title line annotations as OBI format. Default: `false`.
`--input-json-header`
: Treat FASTA/FASTQ title line annotations as JSON format. Default: `false`.
`--solexa`
: Decode the quality string according to the Solexa specification. Default: `false`.
`--u-to-t`
: Convert uracil (U) to thymine (T) in input sequences. Default: `false`.
`--skip-empty`
: Suppress sequences of length zero from the output. Default: `false`.
`--no-order`
: When several input files are provided, indicates that there is no order among them. Default: `false`.
## Output format
`--out|-o <FILENAME>`
: Filename used for saving the output. Default: `-` (standard output).
`--fasta-output`
: Write sequences in FASTA format (default when no quality data are available). Default: `false`.
`--fastq-output`
: Write sequences in FASTQ format (default when quality data are available). Default: `false`.
`--json-output`
: Write sequences in JSON format. Default: `false`.
`--output-OBI-header|-O`
: Output FASTA/FASTQ title line annotations in OBI format. Default: `false`.
`--output-json-header`
: Output FASTA/FASTQ title line annotations in JSON format. Default: `false`.
`--compress|-Z`
: Compress the output using gzip. Default: `false`.
## Taxonomy
`--taxonomy|-t <string>`
: Path to the taxonomy database. Default: `""`.
`--fail-on-taxonomy`
: Cause `obijoin` to fail with an error if a taxid encountered is not currently valid. Default: `false`.
`--raw-taxid`
: Print taxids in files without supplementary information (taxon name and rank). Default: `false`.
`--update-taxid`
: Automatically update taxids that are declared as merged to a newer one. Default: `false`.
`--with-leaves`
: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation. Default: `false`.
## Performance
`--max-cpu <int>`
: Number of parallel threads used to compute the result. Default: `16`.
`--batch-size <int>`
: Minimum number of sequences per processing batch. Default: `1`.
`--batch-size-max <int>`
: Maximum number of sequences per processing batch. Default: `2000`.
`--batch-mem <string>`
: Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable. Default: `128M`.
## Diagnostics
`--no-progressbar`
: Disable the progress bar. Default: `false`.
`--silent-warning`
: Stop printing warning messages. Default: `false`.
`--debug`
: Enable debug mode by setting the log level to debug. Default: `false`.
---
# EXAMPLES
```bash
# Annotate amplicon sequences with sample metadata from a CSV table,
# matching on the sample attribute. CSV format is auto-detected.
obijoin --join-with metadata.csv --by sample input.fasta > out_basic.fasta
```
**Expected output:** 6 sequences written to `out_basic.fasta`.
```bash
# Join using a cross-attribute key: primary sequences have a 'sample' attribute,
# while the annotation CSV uses 'well' for the same identifier.
obijoin --join-with well_metadata.csv --by sample=well input.fasta > out_crosskey.fasta
```
**Expected output:** 6 sequences written to `out_crosskey.fasta`.
```bash
# Join on two keys simultaneously: match only when both sample and barcode agree,
# then update sequence identifiers with those from the reference file.
obijoin --join-with references.fasta \
--by sample --by barcode \
--update-id \
input.fasta > out_multikey.fasta
```
**Expected output:** 6 sequences written to `out_multikey.fasta`.
```bash
# Replace sequences and quality scores of reads with values from a corrected FASTQ file,
# joining by sequence ID (default when no --by is specified).
obijoin --join-with corrected.fastq \
--update-sequence --update-quality \
input.fastq > out_updated.fastq
```
**Expected output:** 3 sequences written to `out_updated.fastq`.
```bash
# Use an OBITools CSV file as primary input (--csv flag), join with a metadata CSV,
# then write compressed FASTA output without showing the progress bar.
obijoin --join-with metadata.csv --by sample \
--csv --fasta-output --compress \
--no-progressbar \
--out out_compressed.fasta.gz \
primary.csv
```
**Expected output:** 3 sequences written to `out_compressed.fasta.gz`.
---
# NOTES
- The secondary file supplied via `--join-with` is loaded entirely into memory before the join begins. For very large secondary files this may require significant RAM.
- Key matching is based on exact string equality; no regular expression or fuzzy matching is applied.
- The join is a left outer join: primary sequences without a matching partner in the secondary file are still emitted, unchanged, in the output.
- When the annotation source is a plain CSV spreadsheet (columns = attributes, rows = records), the format is auto-detected — no `--csv` flag is needed. The `--csv` flag applies exclusively to the primary input and is intended for sequences stored in OBITools CSV format.
+205
View File
@@ -0,0 +1,205 @@
# NAME
obimicrosat — looks for microsatellites sequences in a sequence file
---
# SYNOPSIS
```
obimicrosat [options] [<filename>...]
```
---
# DESCRIPTION
`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
microsatellites — tandem repetitions of a short motif (16 bp by default). For each
sequence containing a qualifying repeat, the command annotates it with the location,
unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
with no detected microsatellite are silently discarded.
The detection works in two passes. A first regular expression finds any tandem repeat
satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
is then determined, and a second scan refines the exact boundaries. The repeat unit is
normalized to its lexicographically smallest rotation across all rotations and its
reverse complement, which allows equivalent loci to be grouped consistently across
samples.
By default, when the canonical form of a unit requires the reverse complement, the
whole sequence is reoriented so that the microsatellite is always reported on the
direct strand of the normalized unit. This behaviour can be disabled with
`--not-reoriented`.
A common use case is identifying polymorphic SSR markers for population genetics, or
flagging repeat-rich regions before designing PCR primers.
---
# INPUT
Accepts one or more sequence files on the command line. If no file is given, sequences
are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
Format is detected automatically unless overridden by input flags.
---
# OUTPUT
Outputs only the sequences in which a microsatellite was found. Each retained sequence
carries the following additional attributes:
| Attribute | Content |
|---|---|
| `microsat` | Full repeat region as a string |
| `microsat_from` | 1-based start position of the repeat |
| `microsat_to` | End position of the repeat (inclusive) |
| `microsat_unit` | Repeat unit as observed in the sequence |
| `microsat_unit_normalized` | Lexicographically smallest canonical form |
| `microsat_unit_orientation` | `direct` or `reverse` |
| `microsat_unit_length` | Length of the repeat unit (bp) |
| `microsat_unit_count` | Number of complete unit repetitions |
| `seq_length` | Total length of the (possibly reoriented) sequence |
| `microsat_left` | Flanking sequence to the left of the repeat |
| `microsat_right` | Flanking sequence to the right of the repeat |
When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
identifier.
The output format follows the same rules as the rest of OBITools4: FASTQ when quality
scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
## Observed output example
```
>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
```
---
# OPTIONS
## Microsatellite detection
**`--min-unit-length` / `-m`**
- Default: `1`
- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
**`--max-unit-length` / `-M`**
- Default: `6`
- Maximum length in base pairs of the repeated motif. Increasing this value detects
longer repeat units (minisatellites) at the cost of more complex patterns.
**`--min-unit-count`**
- Default: `5`
- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
requires at least 10 bp of pure repeat.
**`--min-length` / `-l`**
- Default: `20`
- Minimum total length (in bp) of the repeat region. This filter applies after the
unit-count filter and is useful to exclude very short but technically qualifying
repeats.
**`--min-flank-length` / `-f`**
- Default: `0`
- Minimum length of the flanking sequence on each side of the repeat. Sequences with
flanks shorter than this threshold are discarded, which is useful when the output
will feed a primer-design step.
**`--not-reoriented` / `-n`**
- Default: `false` (reorientation is active by default)
- When set, sequences are never reverse-complemented to match the canonical orientation
of the repeat unit. The microsatellite is reported as found, in its original
orientation.
## Input / output
Inherited from the standard OBITools4 conversion layer. Common flags include:
**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
**`--input-json-header`** — parse JSON-encoded headers.
**`--skip-empty`** — skip sequences with no nucleotides.
**`--u-to-t`** — convert U to T (RNA → DNA).
**`--output-json-header`** — write JSON-encoded headers.
**`--output-obi-header`** — write OBI-style headers.
**`--gzip`** — compress output with gzip.
**`--workers` / `-p`** — number of parallel processing workers.
---
# EXAMPLES
```bash
# Detect default microsatellites (unit 16 bp, ≥5 repeats, ≥20 bp total)
obimicrosat sequences.fasta > out_default.fasta
```
**Expected output:** 6 sequences written to `out_default.fasta`.
```bash
# Restrict to di- and trinucleotide repeats only
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
```
**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
(mononucleotide and tetranucleotide repeats excluded).
```bash
# Require at least 30 bp flanking sequence on each side (for primer design)
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
```
**Expected output:** 3 sequences written to `out_primer_ready.fasta`
(sequences with flanks shorter than 30 bp are discarded).
```bash
# Keep sequences in their original orientation (no reverse-complement)
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
```
**Expected output:** 6 sequences written to `out_no_reorient.fasta`
(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
```bash
# Require at least 8 repeat units and a minimum repeat length of 30 bp
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
```
**Expected output:** 4 sequences written to `out_strict.fasta`
(short or low-count repeats excluded).
---
# SEE ALSO
`obigrep` — filter sequences by annotation after microsatellite detection.
`obiannotate` — add or modify sequence annotations.
`obiconvert` — format conversion for sequence files.
---
# NOTES
- Only sequences with at least one qualifying microsatellite are written to output;
all others are silently filtered out.
- The normalization algorithm considers all rotations of the unit and their reverse
complements, selecting the lexicographically smallest string. This ensures consistent
grouping of loci regardless of which strand was sequenced.
- When reorientation is active (the default), sequences whose canonical unit falls on
the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
(possibly reoriented) output sequence.
- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
first match is reported per sequence.
- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
repeat within the flank instead of the intended SSR. When designing synthetic test
data, ensure flanking regions do not contain tandem repeat motifs of their own.
+384
View File
@@ -0,0 +1,384 @@
# NAME
obiscript — executes a lua script on the input sequences
---
# SYNOPSIS
```
obiscript [--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>] [--compress|-Z]
[--csv] [--debug] [--definition|-D <PATTERN>]... [--ecopcr]
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--has-attribute|-A <KEY>]...
[--help|-h|-?] [--id-list <FILENAME>]
[--identifier|-I <PATTERN>]... [--ignore-taxon|-i <TAXID>]...
[--input-OBI-header] [--input-json-header] [--inverse-match|-v]
[--json-output] [--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--pattern-error <int>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
[--raw-taxid] [--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--script|-S <string>]
[--sequence|-s <PATTERN>]... [--silent-warning] [--skip-empty]
[--solexa] [--taxonomy|-t <string>] [--template] [--u-to-t]
[--update-taxid] [--valid-taxid] [--version] [--with-leaves]
[<args>]
```
---
# DESCRIPTION
`obiscript` applies a user-provided Lua script to a stream of biological sequences. For each input sequence record, the script's `worker(sequence)` function is called, giving the user full programmatic access to the sequence's identifier, data, quality scores, and metadata attributes. This makes it possible to implement custom annotation logic, computed filters, or record transformations that go beyond what fixed-function OBITools commands offer.
The Lua script may also define two optional lifecycle hooks: `begin()`, called once before any sequence is processed (useful for initialising counters or opening files), and `finish()`, called after the last sequence (useful for printing summary statistics or flushing output). A thread-safe shared table `obicontext` is available across all workers, allowing aggregation across parallel executions.
Sequences are read from files or standard input in any format supported by OBITools4 (FASTA, FASTQ, EMBL, GenBank, ecoPCR, CSV). The sequence filtering flags (such as `--min-length`, `--predicate`, etc.) select which sequences the Lua script is applied to; sequences that do not match the filter pass through to the output unchanged without the script being executed on them. All sequences — scripted or not — are written to the output. <!-- corrected: non-matching sequences are passed through unchanged, not discarded -->
To get started, use `--template` to print a minimal Lua script skeleton with stubs for all three hooks and inline documentation.
---
# INPUT
`obiscript` reads biological sequences from one or more files supplied as positional arguments, or from standard input if no files are given. All formats supported by OBITools4 are accepted: FASTA, FASTQ, EMBL flatfile, GenBank flatfile, ecoPCR output, and CSV. Format auto-detection is used by default; explicit format flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) override it. Header annotation style can be forced with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Sequences processed by the Lua script are written to standard output, or to the file given by `--out`. Any modifications made to sequence records inside `worker()` (identifier, sequence, attributes) are reflected in the output. The output format defaults to FASTA when no quality data are present and to FASTQ otherwise; use `--fasta-output`, `--fastq-output`, or `--json-output` to override. Header annotation style in FASTA/FASTQ output can be set with `--output-OBI-header` or `--output-json-header`. Output can be gzip-compressed with `--compress`.
## Observed output example
```
>sample1_seq001 {"definition":"control sequence for annotation test","sample":"sample1"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>sample1_seq002 {"definition":"another control sequence from sample1","sample":"sample1"}
gctagctagctagctagctagctagctagctagctagctagctagcta
>sample2_seq003 {"definition":"second sample sequence","sample":"sample2"}
ttaattaattaattaattaattaattaattaattaattaattaattaa
>sample2_seq004 {"definition":"second sample another sequence","sample":"sample2"}
ccggccggccggccggccggccggccggccggccggccggccggccgg
>sample3_seq005 {"definition":"third sample first sequence","sample":"sample3"}
aaaattttccccggggaaaattttccccggggaaaattttccccgggg
>sample3_seq006 {"definition":"third sample second sequence","sample":"sample3"}
ttttaaaaccccggggttttaaaaccccggggttttaaaaccccgggg
```
---
# OPTIONS
## Script
### `--script|-S <string>`
- Default: `""`
- Path to the Lua script file to execute. The file must exist and be syntactically valid Lua. The script should define a `worker(sequence)` function, and optionally `begin()` and `finish()`.
### `--template`
- Default: `false`
- Print a minimal Lua script template to standard output, with stubs for `begin()`, `worker()`, and `finish()` and inline documentation, then exit. Use this to bootstrap a new script.
## Sequence filtering (selects sequences on which the script is applied; non-matching sequences pass through unchanged)
### `--predicate|-p <EXPRESSION>`
- Default: `[]`
- Boolean expression evaluated for each sequence record. Attribute keys are accessible as variable names; `sequence` refers to the record itself. Multiple `-p` options are combined with AND.
### `--sequence|-s <PATTERN>`
- Default: `[]`
- Regular expression matched against the nucleotide sequence. Case-insensitive. Multiple patterns are combined with AND.
### `--identifier|-I <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence identifier. Case-insensitive.
### `--definition|-D <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence definition line. Case-insensitive.
### `--approx-pattern <PATTERN>`
- Default: `[]`
- Pattern matched approximately against the sequence. Use `--pattern-error` to set the maximum number of errors allowed.
### `--pattern-error <int>`
- Default: `0`
- Maximum number of errors (mismatches) allowed during approximate pattern matching.
### `--allows-indels`
- Default: `false`
- Allow insertions and deletions (in addition to mismatches) during approximate pattern matching.
### `--only-forward`
- Default: `false`
- Restrict pattern matching to the forward strand only.
### `--has-attribute|-A <KEY>`
- Default: `[]`
- Apply the script only to records that have an attribute with key `<KEY>`; others pass through.
### `--attribute|-a <KEY=VALUE>`
- Default: `{}`
- Apply the script only to records where the attribute `KEY` matches the regular expression `VALUE`. Case-sensitive. Multiple `-a` options are combined with AND.
### `--id-list <FILENAME>`
- Default: `""`
- Path to a text file containing one sequence identifier per line. The script is applied only to records whose identifier appears in the file; others pass through.
### `--min-length|-l <LENGTH>`
- Default: `1`
- Apply the script only to sequences whose length is at least `LENGTH`; shorter sequences pass through unchanged.
### `--max-length|-L <LENGTH>`
- Default: `2000000000`
- Apply the script only to sequences whose length is at most `LENGTH`; longer sequences pass through unchanged.
### `--min-count|-c <COUNT>`
- Default: `1`
- Apply the script only to sequences with a count (abundance) of at least `COUNT`; others pass through unchanged.
### `--max-count|-C <COUNT>`
- Default: `2000000000`
- Apply the script only to sequences with a count (abundance) of at most `COUNT`; others pass through unchanged.
### `--inverse-match|-v`
- Default: `false`
- Invert the selection: apply the script to records that do NOT match the filter criteria; matching records pass through unchanged.
## Taxonomic filtering
### `--taxonomy|-t <string>`
- Default: `""`
- Path to the taxonomy database. Required for taxonomy-based options.
### `--restrict-to-taxon|-r <TAXID>`
- Default: `[]`
- Retain only sequences whose taxid belongs to the specified taxon.
### `--ignore-taxon|-i <TAXID>`
- Default: `[]`
- Exclude sequences whose taxid belongs to the specified taxon.
### `--require-rank <RANK_NAME>`
- Default: `[]`
- Retain only sequences whose taxon has the specified rank (e.g., `species`, `genus`).
### `--valid-taxid`
- Default: `false`
- Retain only sequences that carry a currently valid NCBI taxid.
### `--fail-on-taxonomy`
- Default: `false`
- Abort with an error if a taxid used during filtering is not currently valid.
### `--update-taxid`
- Default: `false`
- Automatically replace taxids declared as merged with their current equivalent.
### `--raw-taxid`
- Default: `false`
- Print taxids in output without supplementary information (taxon name and rank).
### `--with-leaves`
- Default: `false`
- When extracting taxonomy from a sequence file, attach sequences as leaves of their taxid annotation.
## Paired-end mode
### `--paired-mode <forward|reverse|and|or|andnot|xor>`
- Default: `"forward"`
- When paired reads are provided, determines how filter conditions are applied to both reads of a pair.
## Input format
### `--fasta`
- Default: `false`
- Force FASTA format parsing.
### `--fastq`
- Default: `false`
- Force FASTQ format parsing.
### `--embl`
- Default: `false`
- Force EMBL flatfile format parsing.
### `--genbank`
- Default: `false`
- Force GenBank flatfile format parsing.
### `--ecopcr`
- Default: `false`
- Force ecoPCR output format parsing.
### `--csv`
- Default: `false`
- Force CSV format parsing.
### `--input-OBI-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as OBI format.
### `--input-json-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as JSON format.
### `--solexa`
- Default: `false`
- Decode quality strings according to the Solexa specification.
### `--u-to-t`
- Default: `false`
- Convert uracil (U) to thymine (T) in sequences.
### `--skip-empty`
- Default: `false`
- Suppress sequences of length zero from the output.
### `--no-order`
- Default: `false`
- When multiple input files are provided, indicates that no ordering is assumed among them.
## Output format
### `--out|-o <FILENAME>`
- Default: `"-"` (standard output)
- File path for saving the output.
### `--fasta-output`
- Default: `false`
- Write output in FASTA format.
### `--fastq-output`
- Default: `false`
- Write output in FASTQ format.
### `--json-output`
- Default: `false`
- Write output in JSON format.
### `--output-OBI-header|-O`
- Default: `false`
- Write FASTA/FASTQ title line annotations in OBI format.
### `--output-json-header`
- Default: `false`
- Write FASTA/FASTQ title line annotations in JSON format.
### `--compress|-Z`
- Default: `false`
- Compress output using gzip.
## Performance
### `--max-cpu <int>`
- Default: `16` (env: `OBIMAXCPU`)
- Number of parallel threads used for processing.
### `--batch-size <int>`
- Default: `1` (env: `OBIBATCHSIZE`)
- Minimum number of sequences per processing batch.
### `--batch-size-max <int>`
- Default: `2000` (env: `OBIBATCHSIZEMAX`)
- Maximum number of sequences per processing batch.
### `--batch-mem <string>`
- Default: `""``128M` (env: `OBIBATCHMEM`)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostics
### `--debug`
- Default: `false` (env: `OBIDEBUG`)
- Enable debug logging.
### `--no-progressbar`
- Default: `false`
- Disable the progress bar.
### `--silent-warning`
- Default: `false` (env: `OBIWARNING`)
- Suppress warning messages.
### `--pprof`
- Default: `false`
- Enable the pprof profiling HTTP server (see log for address).
### `--pprof-goroutine <int>`
- Default: `6060` (env: `OBIPPROFGOROUTINE`)
- Port for goroutine blocking profile.
### `--pprof-mutex <int>`
- Default: `10` (env: `OBIPPROFMUTEX`)
- Rate for mutex lock profiling.
---
# EXAMPLES
```bash
# Print a starter script template and save it to my_script.lua
obiscript --template > my_script.lua
```
**Expected output:** Lua template with `begin()`, `worker()`, and `finish()` stubs written to `my_script.lua`.
```bash
# Add a custom annotation to every sequence record
# (the script sets a new attribute 'sample' from the identifier prefix)
obiscript --script annotate.lua --fasta-output sequences.fasta > annotated.fasta
```
**Expected output:** 6 sequences written to `annotated.fasta`.
```bash
# Count reads per taxon using the finish() hook, filtering to a specific taxon
obiscript --script count_taxa.lua \
--restrict-to-taxon 6231 \
--taxonomy /data/ncbi_tax \
sequences.fasta > filtered_annotated.fasta
```
```bash
# Apply a script to FASTQ sequences with a length filter
obiscript --script process_pairs.lua \
--min-length 100 \
--out result.fastq \
reads.fastq
```
**Expected output:** 4 sequences written to `result.fastq`.
```bash
# Run on FASTQ input, output JSON, using 4 CPU threads
obiscript --script enrich.lua \
--json-output \
--max-cpu 4 \
sequences.fastq > enriched.json
```
**Expected output:** 4 sequences written to `enriched.json`.
---
# SEE ALSO
`obigrep` — filter sequences using the same selection criteria without scripting.
`obiannotate` — add or modify sequence attributes without scripting.
---
# NOTES
- The Lua `worker(sequence)` function is called in parallel across multiple goroutines. Use the thread-safe `obicontext` table (with `obicontext:lock()` / `obicontext:unlock()`) for any shared mutable state accessed across workers.
- The `begin()` and `finish()` hooks each run in a single goroutine and do not need locking for their own internal state.
- Sequence records modified inside `worker()` must be returned (or the original returned unmodified) for the record to appear in the output. Returning `nil` drops the sequence.
+271
View File
@@ -0,0 +1,271 @@
# NAME
obisummary — resume main information from a sequence file
---
# SYNOPSIS
```
obisummary [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
```
---
# DESCRIPTION
`obisummary` reads a set of biological sequences and computes a statistical
summary of their content and annotations. Rather than producing a new sequence
file, it outputs a single structured record describing the dataset as a whole.
The summary covers three main areas. First, global counts: the total number of
reads (sequences weighted by their `count` attribute), the number of distinct
sequence variants, and the total sequence length across all records. Second,
annotation profiling: `obisummary` inspects every annotation key present in
the dataset and classifies it as a scalar attribute (single value per
sequence), a map attribute (key-to-count mapping), or a vector attribute
(multi-value per sequence). Third, per-sample statistics: when sequences carry
sample information (via `merged_sample` or equivalent per-sample annotations),
`obisummary` reports for each sample the number of reads, the number of
variants, and the number of singletons. If `obiclean` has been run previously,
the summary also captures `obiclean_status` and related quality flags per
sample.
The output is a single JSON record by default, or YAML when `--yaml-output` is
requested. <!-- corrected: actual default output is JSON, not YAML -->
`obisummary` is typically used after processing steps such as
`obiclean` or `obiuniq` to quickly validate the state of a dataset before
downstream analysis.
---
# INPUT
`obisummary` accepts biological sequence data from one or more files supplied
as positional arguments, or from standard input when no files are given.
Supported formats include FASTA, FASTQ, GenBank flatfile, EMBL flatfile,
ecoPCR output, and CSV. By default the format is detected automatically; use
the format flags (`--fasta`, `--fastq`, `--genbank`, `--embl`, `--ecopcr`,
`--csv`) to force a specific parser.
FASTA/FASTQ annotation headers may follow the OBI format (`--input-OBI-header`)
or JSON format (`--input-json-header`). RNA sequences can be read as DNA by
converting uracil to thymine with `--u-to-t`. Quality strings encoded according
to the Solexa specification are handled with `--solexa`.
When multiple input files are provided, `obisummary` assumes they are ordered;
use `--no-order` to indicate that no ordering exists among them.
---
# OUTPUT
`obisummary` writes a single structured record to standard output. The default
format is JSON; use `--yaml-output` to obtain YAML instead.
<!-- corrected: actual default output is JSON, not YAML -->
The record contains three top-level sections:
- **`count`**: global metrics including `variants` (distinct sequences),
`reads` (total weighted count), and `total_length` (sum of all sequence
lengths).
- **`annotations`**: a breakdown of all annotation keys found in the dataset,
classified as `scalar_attributes`, `map_attributes`, or `vector_attributes`,
together with the observed keys and their occurrence counts within each
category.
- **`samples`**: when sample information is present, `sample_count` and a
per-sample `sample_stats` table with `reads`, `variants`, and `singletons`
fields. If `obiclean` data is present, an `obiclean_bad` field is also
reported per sample.
When `--map` is used, the named map attribute is included in the annotation
detail for that attribute.
## Observed output example
```
{
"annotations": {
"keys": {
"scalar": {
"count": 5
}
},
"map_attributes": 0,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 21,
"total_length": 100,
"variants": 5
}
}
```
---
# OPTIONS
## Summary output
**`--json-output`**
- Default: `false`
- Print the result as a JSON record (this is the default behaviour; this flag
makes the choice explicit).
<!-- corrected: JSON is the default output format, not YAML -->
**`--yaml-output`**
- Default: `false`
- Print the result as a YAML record instead of the default JSON format.
<!-- corrected: YAML is not the default; JSON is -->
**`--map <string>`**
- Default: `[]` (none)
- Name of a map attribute to include in the summary. This option may be
repeated to request multiple map attributes. Each named attribute will be
detailed in the `map_attributes` section of the output.
## Input format
**`--fasta`**
- Default: `false`
- Read data following the FASTA format.
**`--fastq`**
- Default: `false`
- Read data following the FASTQ format.
**`--genbank`**
- Default: `false`
- Read data following the GenBank flatfile format.
**`--embl`**
- Default: `false`
- Read data following the EMBL flatfile format.
**`--ecopcr`**
- Default: `false`
- Read data following the ecoPCR output format.
**`--csv`**
- Default: `false`
- Read data following the CSV format.
**`--input-OBI-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow OBI format.
**`--input-json-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow JSON format.
**`--solexa`**
- Default: `false`
- Decode quality strings according to the Solexa specification.
**`--u-to-t`**
- Default: `false`
- Convert uracil (U) to thymine (T) when reading RNA sequences.
## Batch control
**`--batch-size <int>`**
- Default: `1`
- Minimum number of sequences per processing batch.
**`--batch-size-max <int>`**
- Default: `2000`
- Maximum number of sequences per processing batch.
**`--batch-mem <string>`**
- Default: `""` (128M effective)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable
the memory limit.
## Processing
**`--max-cpu <int>`**
- Default: `16`
- Number of parallel threads used to compute the result.
**`--no-order`**
- Default: `false`
- When several input files are provided, indicates that there is no order
among them.
## General
**`--debug`**
- Default: `false`
- Enable debug mode by setting the log level to debug.
**`--silent-warning`**
- Default: `false`
- Stop printing warning messages.
**`--version`**
- Default: `false`
- Print the version and exit.
**`--help` / `-h` / `-?`**
- Default: `false`
- Display help and exit.
**`--pprof`**
- Default: `false`
- Enable the pprof profiling server. Consult the log for the server address.
**`--pprof-goroutine <int>`**
- Default: `6060`
- Port for goroutine blocking profile.
**`--pprof-mutex <int>`**
- Default: `10`
- Port for mutex lock profiling.
---
# EXAMPLES
```bash
# Get a JSON summary of a FASTA file produced by obiclean
obisummary cleaned.fasta > out_default.yaml
```
**Expected output:** a JSON summary record in `out_default.yaml`.
```bash
# Get the summary as an explicit JSON record for programmatic processing
obisummary --json-output cleaned.fasta > out_json.json
```
**Expected output:** a JSON summary record in `out_json.json`.
```bash
# Get a YAML record from a FASTQ file
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
```
**Expected output:** a YAML summary record in `out_yaml.yaml`.
```bash
# Summarise data read from standard input, forcing FASTA format
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
```
**Expected output:** a JSON summary record in `out_pipeline.yaml` (3 variants, 10 reads).
---
# SEE ALSO
`obiclean`, `obiuniq`, `obicount`
+347
View File
@@ -0,0 +1,347 @@
# NAME
obiuniq — dereplicate sequence data sets
---
# SYNOPSIS
```
obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
[--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
[--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
[--no-order] [--no-progressbar] [--no-singleton]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
`obiuniq` groups identical sequences together and replaces them with a single
representative, recording the total number of original occurrences as an
abundance count. This process — called dereplication — is a standard step in
amplicon sequencing workflows: it dramatically reduces the number of sequence
records to process, while preserving exact counts needed for downstream
statistical analyses.
By default, two sequences are considered identical if and only if their
nucleotide strings are the same. Using `--category-attribute` (repeatable),
additional metadata fields can be included in the identity criterion. For
example, grouping by sample name keeps the same sequence as separate records
when it occurs in different samples, enabling per-sample abundance tracking.
For each group of identical sequences, `obiuniq` emits one output record
carrying the merged metadata of all members. The `--merge` option (repeatable)
instructs the command to also record, in an attribute named `merged_<KEY>`, the
distribution of `KEY` attribute values across the sequences collapsed into each
group — useful for provenance tracking and quality control. <!-- corrected: actual attribute name is merged_KEY (not KEY); tracks attribute value distributions, not a list of sequence IDs -->
Sequences that appear only once in the entire dataset (singletons) can be
removed with `--no-singleton`. Singletons often represent sequencing errors
rather than genuine biological variants, so their removal is a common
noise-reduction step.
---
# INPUT
`obiuniq` accepts biological sequence data in FASTA, FASTQ, EMBL, GenBank,
ecoPCR, or CSV format (auto-detected by default, or forced with format flags
such as `--fasta`, `--fastq`, `--embl`, etc.). Input is read from one or more
files given as positional arguments, or from standard input when no files are
provided.
When multiple input files are provided, `obiuniq` assumes they are ordered
(e.g., paired-end reads in the same read order). If no such ordering exists,
use `--no-order` to signal that files can be consumed independently.
FASTA/FASTQ header annotations are parsed heuristically by default. Use
`--input-OBI-header` for OBI-formatted headers or `--input-json-header` for
JSON-formatted headers. RNA sequences can be normalised to DNA on the fly with
`--u-to-t`.
---
# OUTPUT
`obiuniq` writes dereplicated sequences to standard output or to the file
specified by `--out`. Each output record represents one group of identical
sequences (identical under the chosen grouping criterion). The output carries
the merged metadata from all input records in the group.
The output format defaults to FASTA. Even when the input contains quality
scores (FASTQ), quality information is not preserved across merged sequences,
so the output is written in FASTA format unless `--fastq-output` is explicitly
requested. <!-- corrected: actual output is always FASTA when dereplicating; quality scores are dropped during merging -->
Output annotations follow the OBI header format when `--output-OBI-header` is
set, or JSON when `--output-json-header` is set. The output can be
gzip-compressed with `--compress`.
For each output record:
- The abundance count reflects how many input sequences were merged into the
group.
- Attributes created by `--merge KEY` are named `merged_KEY` and map each
observed value of the `KEY` attribute to the count of input sequences
carrying that value within the group. <!-- corrected: attribute name is merged_KEY; value is a map not a list -->
- All other attributes are merged from the contributing records according to
the standard OBITools4 merging rules.
## Observed output example
```
>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt
```
---
# OPTIONS
## Dereplication Options
**`--category-attribute|-c <CATEGORY>`** (default: `[]`)
Adds one metadata attribute to the grouping criterion. Two sequences are
placed in the same group only when they are nucleotide-identical **and** share
the same value for every attribute listed with `-c`. This option can be
repeated to combine multiple attributes (e.g., `-c sample -c primer`).
Records that lack a listed attribute receive the value set by `--na-value`.
**`--chunk-count <int>`** (default: `100`)
Controls how many internal partitions the dataset is split into during
processing. A higher value reduces per-partition memory usage at the cost of
more temporary files; a lower value increases per-partition memory but reduces
I/O overhead. Tune this when processing very large or very small datasets.
**`--in-memory`** (default: `false`)
Stores intermediate data chunks in RAM rather than in temporary disk files.
Speeds up processing on datasets that fit comfortably in available memory;
omit this flag (the default) for large datasets that exceed available RAM.
**`--merge|-m <KEY>`** (default: `[]`)
Creates an output attribute named `merged_KEY` that maps each observed value
of the `KEY` attribute to the count of input sequences carrying that value
within the group. Repeat to track multiple attributes. <!-- corrected: actual attribute name is merged_KEY (not KEY); value is a map of attribute values to counts, not a list of sequence IDs -->
Useful for tracking which sample or category contributions were collapsed into each group.
**`--na-value <NA_NAME>`** (default: `"NA"`)
Value assigned to a category attribute when a sequence record does not carry
that attribute. All sequences lacking the attribute are grouped together under
this placeholder, rather than being treated as incomparable.
**`--no-singleton`** (default: `false`)
Discards all output records whose abundance count is exactly one — i.e.,
sequences that occur only once across the entire input. Removing singletons
is a standard heuristic for excluding sequencing errors from further analysis.
## Input Options
**`--batch-mem <string>`** (default: `""`, env: `OBIBATCHMEM`)
Maximum memory budget per processing batch (e.g. `128K`, `64M`, `1G`). Set
to `0` to disable the memory ceiling. Overrides `--batch-size-max` when
both are set.
**`--batch-size <int>`** (default: `10`, env: `OBIBATCHSIZE`)
Minimum number of sequences per batch (floor).
**`--batch-size-max <int>`** (default: `2000`, env: `OBIBATCHSIZEMAX`)
Maximum number of sequences per batch (ceiling).
**`--csv`** (default: `false`)
Parse input as CSV format.
**`--ecopcr`** (default: `false`)
Parse input as ecoPCR output format.
**`--embl`** (default: `false`)
Parse input as EMBL flatfile format.
**`--fasta`** (default: `false`)
Parse input as FASTA format.
**`--fastq`** (default: `false`)
Parse input as FASTQ format.
**`--genbank`** (default: `false`)
Parse input as GenBank flatfile format.
**`--input-OBI-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as OBI-format key=value pairs.
**`--input-json-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as JSON objects.
**`--no-order`** (default: `false`)
When multiple input files are provided, indicates that there is no ordering
relationship among them.
**`--skip-empty`** (default: `false`)
Suppress sequences of length zero from the output.
**`--solexa`** (default: `false`, env: `OBISOLEXA`)
Decode quality strings according to the Solexa specification rather than the
standard Phred encoding.
**`--u-to-t`** (default: `false`)
Convert uracil (U) to thymine (T) in all input sequences, normalising RNA to
DNA representation.
## Output Options
**`--compress|-Z`** (default: `false`)
Compress output using gzip.
**`--fasta-output`** (default: `false`)
Write output in FASTA format (default when no quality scores are available).
**`--fastq-output`** (default: `false`)
Write output in FASTQ format (default when quality scores are present).
**`--json-output`** (default: `false`)
Write output in JSON format.
**`--out|-o <FILENAME>`** (default: `"-"`)
Write output to the specified file instead of standard output.
**`--output-OBI-header|-O`** (default: `false`)
Write FASTA/FASTQ title line annotations in OBI format.
**`--output-json-header`** (default: `false`)
Write FASTA/FASTQ title line annotations in JSON format.
## Taxonomy Options
**`--fail-on-taxonomy`** (default: `false`)
Cause `obiuniq` to exit with an error if a taxid in the data is not a
currently valid taxon in the loaded taxonomy.
**`--raw-taxid`** (default: `false`)
Print taxids in output without supplementary information (taxon name and rank).
**`--taxonomy|-t <string>`** (default: `""`)
Path to the taxonomy database used to validate or update taxids.
**`--update-taxid`** (default: `false`)
Automatically replace merged taxids with the most recent valid taxid.
**`--with-leaves`** (default: `false`)
When taxonomy is extracted from a sequence file, add sequences as leaves of
their taxid annotation.
## Execution Options
**`--max-cpu <int>`** (default: `16`, env: `OBIMAXCPU`)
Number of parallel threads used to compute the result.
**`--debug`** (default: `false`, env: `OBIDEBUG`)
Enable debug mode by setting the log level to debug.
**`--no-progressbar`** (default: `false`)
Disable the progress bar.
**`--silent-warning`** (default: `false`, env: `OBIWARNING`)
Suppress warning messages.
**`--pprof`** (default: `false`)
Enable the pprof profiling server (address logged at startup).
**`--pprof-goroutine <int>`** (default: `6060`, env: `OBIPPROFGOROUTINE`)
Port for the goroutine blocking profile endpoint.
**`--pprof-mutex <int>`** (default: `10`, env: `OBIPPROFMUTEX`)
Rate for the mutex contention profile.
**`--version`** (default: `false`)
Print the version string and exit.
**`--help|-h|-?`** (default: `false`)
Print usage information and exit.
---
# EXAMPLES
```bash
# Dereplicate a FASTQ file of amplicon reads; write unique sequences to a FASTA output file.
obiuniq reads.fastq -o out_basic.fastq
```
**Expected output:** 4 sequences written to `out_basic.fastq`.
```bash
# Dereplicate keeping sequences separate per sample (category attribute),
# and discard singletons to remove likely sequencing errors.
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
```
**Expected output:** 2 sequences written to `out_no_singleton.fastq`.
```bash
# Dereplicate per sample, recording the sample distribution in 'merged_sample',
# and use 'UNKNOWN' for reads missing the sample attribute.
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
```
**Expected output:** 5 sequences written to `out_merge.fastq`.
```bash
# Process a dataset entirely in memory using 200 internal partitions,
# writing gzip-compressed output.
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
```
**Expected output:** 4 sequences written to `out_inmemory.fastq.gz`.
```bash
# Dereplicate reads from two sample files with no assumed ordering between them,
# grouping by both sample and primer attributes.
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
```
**Expected output:** 4 sequences written to `out_multifile.fastq`.
---
# SEE ALSO
- `obigrep` — filter dereplicated sequences by abundance, length, or annotation
- `obiannotate` — add or modify annotations on dereplicated records
- `obicount` — count sequences or groups in a dataset
- `obiclean` — remove sequencing artefacts from a dereplicated dataset
- `obisummary` — summarise annotation distributions across a sequence set
---
# NOTES
For datasets that do not fit in RAM, `obiuniq` uses temporary disk-backed
chunk files by default. The number of chunks is controlled by `--chunk-count`
(default 100). Increasing this value lowers per-chunk memory requirements;
decreasing it reduces I/O at the cost of higher peak memory. Use `--in-memory`
only when the full working set fits in available RAM, as exceeding memory will
degrade performance or cause out-of-memory failures.
Singletons (sequences with abundance = 1) are a common source of noise in
amplicon sequencing, often arising from PCR or sequencing errors. The
`--no-singleton` flag is therefore recommended for most metabarcoding
workflows, unless the study design requires retaining all observed variants.
When the `--category-attribute` option is used, records that lack the
specified attribute are grouped together under the `--na-value` placeholder
(default `"NA"`). This ensures that all records participate in dereplication
without being silently dropped, but users should be aware that heterogeneous
records with different missing attributes may be unintentionally merged.
+48
View File
@@ -0,0 +1,48 @@
# `neural-ensemble` — A Lightweight Library for Modular Neural Ensemble Learning
The `neural-ensemble` package provides tools to build, train, evaluate, and deploy ensembles of neural networks with minimal boilerplate. It emphasizes modularity, reproducibility, and scalability—supporting both homogeneous (e.g., multiple ResNets) and heterogeneous ensembles (mix of CNNs, Transformers, MLPs)—while offering unified interfaces for data handling, training orchestration, and uncertainty quantification.
## Core Functionalities
### 1. **Model Composition**
- `Ensemble`: A container class to manage multiple models (heterogeneous or homogeneous), supporting dynamic model registration, weighted averaging, voting, and stacking.
- `ModelConfig`: A dataclass to declaratively specify model architecture (e.g., backbone, input shape), training hyperparameters, and checkpoint paths.
### 2. **Training & Orchestration**
- `EnsembleTrainer`: Handles distributed or sequential training of ensemble members, with support for early stopping, learning rate scheduling per member, and custom loss weighting.
- `TrainerCallback`: Abstract base for implementing logging, checkpointing, or metric tracking hooks.
### 3. **Data Handling**
- `EnsembleDataset`: Wraps any PyTorch-compatible dataset and automatically replicates inputs across all ensemble members (with optional per-member augmentation).
- `EnsembleDataModule`: Lightning-compatible data module for seamless integration with PyTorch Lightning workflows.
### 4. **Inference & Aggregation**
- `EnsemblePredictor`: Provides `.predict()` and `.forward_ensemble()`, supporting:
- *Hard/soft voting* (classification)
- *Mean/variance aggregation* (regression)
- *Monte Carlo dropout & deep ensembles* for uncertainty estimation
- `UncertaintyMetrics`: Computes ECE, NLL, Brier score, and predictive entropy.
### 5. **Evaluation & Calibration**
- `EnsembleEvaluator`: Runs comprehensive evaluation across members and the ensemble, reporting per-member vs. aggregate metrics.
- `CalibrationWrapper`: Applies temperature scaling or isotonic regression to calibrate ensemble outputs.
### 6. **Serialization & Deployment**
- `Ensemble.save()` / `.load()`: Persists full ensemble state (weights, configs) to disk.
- `Ensemble.to_torchscript()`: Exports the ensemble for production inference (e.g., via TorchServe or ONNX).
## Key Design Principles
- **Minimal dependencies**: Built on top of PyTorch, with optional integrations (Lightning, HuggingFace).
- **No hidden state**: All ensemble behavior is controlled via explicit configuration.
- **Extensible hooks**: Custom aggregation rules, losses, or metrics can be injected via inheritance.
## Example Workflow
```python
ensemble = Ensemble([
ModelConfig(backbone="resnet18", input_shape=(3, 224, 224)),
ModelConfig(backbone="vit_b_16", input_shape=(3, 224, 224)),
])
trainer = EnsembleTrainer(ensemble=ensemble)
trainer.fit(train_loader, val_loader)
preds, uncertainties = EnsemblePredictor(ensemble).predict(test_loader, return_uncertainty=True)
```
+22
View File
@@ -0,0 +1,22 @@
# `obialign` Package: Sequence Alignment Utilities
The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
- At mismatches, it retains the higher-quality base.
- When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
- Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
- Optionally records mismatch statistics in sequence attributes.
- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
- **Alignment Path Format**: The path is a sequence of signed integers encoding:
- Negative steps → deletions in seqB (insertion in A),
- Positive steps → insertions in B,
- Consecutive pairs encode match/mismatch runs.
This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
@@ -0,0 +1,30 @@
# Semantic Description of `obialign` Backtracking Module
The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
## Core Functionality
- **Input**:
- `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.
- `lseqA`, `lseqB`: Lengths of sequences A and B.
- `path`: Pre-allocated slice to store the traceback path.
- **Output**: A compact representation of alignment steps, alternating between:
- Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
- Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
## Algorithm Highlights
- **Reverse traversal** from `(lseqA1, lseqB1)` to origin.
- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
## Encoded Path Semantics
Each pair in the returned slice encodes:
- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
## Use Case
Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obialign` Package
This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
## Key Functionalities
- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
- **Log-space arithmetic helpers**:
- `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
- `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 exp(a))` and `log(exp(a) exp(b))`, critical for probability transformations.
- **Match/mismatch scoring (`_MatchScoreRatio`)**:
- Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
- Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
- **Precomputed scoring matrices**:
- `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
- `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
- **Thread-safe initialization**:
- `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
+23
View File
@@ -0,0 +1,23 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
- **Core Encoding Strategy**:
A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
- **`encodeValues(score, length int, out bool)`**:
Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
- **`decodeValues(value uint64)`**:
Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
- **Utility Bitwise Helpers**:
- `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
- `_incscore(value)`: increments score by `1 << wsize`.
- `_setout(value)`: clears the out-flag, marking value as *not* terminal.
- **Predefined Constants**:
- `_empty`: neutral state (score=0, length=0).
- `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
+42
View File
@@ -0,0 +1,42 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
## Core Algorithm
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
## Scoring Scheme
- **Match**: +1 point
- **Mismatch or gap (indel)**: 0 points
## Key Functions
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
- Computes LCS score and alignment length between raw byte sequences.
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
- Designed for standard biosequence inputs.
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
## Features
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
## Use Cases
- Molecular barcode/UMI clustering
- Read-to-reference alignment in amplicon sequencing
- Similarity filtering of biological sequences
@@ -0,0 +1,15 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
- **Lookup tables**:
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
+19
View File
@@ -0,0 +1,19 @@
## `obialign` Package: Semantic Overview (≤50 lines)
The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
- `D1Or0(...)`: Returns a 4-tuple:
- **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
- **`int` (second)**: Position of the differing site (`-1` if identical).
- **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
**Algorithmic strategy:**
1. Early rejection if length difference exceeds 1.
2. Forward scan until first mismatch → identifies left boundary of divergence.
3. Backward scan from ends to find rightmost match boundary.
4. Validates whether the mismatch region allows exactly one edit:
- Single substitution: equal lengths, single divergent position.
- Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
@@ -0,0 +1,29 @@
# `LocatePattern` Functionality Overview
The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
- **Input**:
- `id`: identifier for logging/error reporting.
- `pattern []byte`: the query sequence (e.g., primer).
- `sequence []byte`: the target read/contig.
- **Constraints**:
- Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).
- **Scoring Scheme**:
- Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).
- Mismatch/Gap: `-1`.
- **Algorithm Features**:
- End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.
- Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).
- Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).
- Backtracks from the bottom-right to find optimal local alignment start/end coordinates.
- **Output**:
- `start`: starting index in `sequence`.
- `end+1`: ending index (exclusive) of best match.
- Error count: `-score`, i.e., number of mismatches/gaps in alignment.
- **Use Case**:
Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
@@ -0,0 +1,37 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
## Core Functionalities
### 1. **Memory Arena Management**
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
### 2. **Dynamic Programming Alignment Functions**
Implements three specialized global alignment variants using NeedlemanWunsch with affine gap penalties (scaled per mismatch):
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
### 3. **Scoring & Quality Integration**
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
- A user-defined `scale` factor to modulate mismatch penalties.
### 4. **Fast Heuristic Pre-Alignment**
The main `PEAlign` function integrates a kmer-based fast pre-screening:
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
### 5. **Backtracking & Path Output**
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
### Use Case
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
+58
View File
@@ -0,0 +1,58 @@
# Semantic Description of `obialign.ReadAlign`
The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
## Core Functionality
- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:
- `gap`: gap penalty (linear)
- `scale`: scaling factor for quality scores
- `delta`: extension buffer around initial overlap estimate
- `fastScoreRel`: use relative vs absolute k-mer matching score
## Algorithm Overview
1. **Preprocessing & Initialization**
- Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).
2. **Fast Overlap Estimation via 4-mer Indexing**
- Builds a k-mer index of `seqA` using `obikmer.Index4mer`.
- Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.
- Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).
3. **Overlap Computation**
- Determines overlap length `over` based on shift:
```text
over = |seqA| - shift if shift > 0
|seqB| + shift if shift < 0
min(|seqA|,|seqB)| otherwise
```
4. **Dynamic Programming Alignment**
- If overlap is *not* identical (`fastCount + 3 < over`):
- Extracts subregions with `delta`-buffered boundaries.
- Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.
- Backtracks via `_Backtracking` to produce alignment path.
- Else (near-perfect overlap):
- Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.
- Returns trivial path `[extra5, partLen]`.
## Output
Returns:
| Index | Type | Meaning |
|-------|----------|---------|
| 0️⃣ | `int` | Final alignment score (weighted by quality) |
| 1️⃣ | `[]int` | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
| 2️⃣ | `int` | K-mer match count (`fastCount`) |
| 3️⃣ | `int` | Overlap length (`over`) |
| 4️⃣ | `float64` | K-mer-based score (`fastScore`) |
| 5️⃣ | `bool` | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
## Key Design Highlights
- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.
- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.
- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).
- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
+25
View File
@@ -0,0 +1,25 @@
# Apat Package: Pattern Matching for Biological Sequences
The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
## Core Types
- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).
- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
## Key Functions & Methods
- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.
- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).
- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.
- `IsMatching(...)`: Boolean check for presence of at least one match in a region.
- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.
- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.
- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).
- `Free()`, `Len()`: Explicit memory cleanup and length queries.
## Implementation Notes
Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
+32
View File
@@ -0,0 +1,32 @@
# Semantic Description of `obiapat` Package Functionality
The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
## Core Functionality
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**
Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
- `pattern`: A string where:
- Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
- Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
- Exclamation `!` marks positions where **errors** (substitutions) are permitted.
- `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
- `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
## Behavior & Semantics
- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
- Supports three modes:
- **Exact matching** (`errormax = 0`, `allowsIndel = false`).
- **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
- **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
## Testing Coverage
The provided test suite validates:
- Valid pattern parsing across different configurations.
- Correct handling of `nil` vs. non-nil output pointers.
- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
+27
View File
@@ -0,0 +1,27 @@
# PCR Simulation Module (`obiapat`)
This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
## Key Functionalities
- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
## Core API
- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
## Implementation Details
- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
- Handles circular topology by wrapping indices around sequence boundaries.
- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
- Logs critical errors with `logrus`; debug-level details for amplicon generation.
Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
+23
View File
@@ -0,0 +1,23 @@
## Semantic Description of `IsPatternMatchSequence`
The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
### Core Functionality:
- **Input Parameters**
- `pattern`: A regular expression-like string describing the target pattern.
- `errormax`: Maximum allowed mismatches (substitutions only by default).
- `bothStrand`: If true, also search on the reverse-complement strand.
- `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
- **Internal Workflow**
- Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.
- Computes its reverse complement for dual-strand matching.
- Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
- **Matching Logic**
- Converts input sequence to `apat` format.
- Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.
- Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
### Semantic Use Case:
Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.
+15
View File
@@ -0,0 +1,15 @@
# `ISequenceChunk` Function — Semantic Description
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
- Optional parameters allow fine-tuning:
- `dereplicate`: enables deduplication of identical sequences.
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
Key capabilities include:
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
# `ISequenceChunkOnMemory` Function — Semantic Description
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
Key features:
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
Input:
- An iterator (`obiiter.IBioSequence`) of raw sequences.
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
Output:
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obichunk` Package
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
## Core Concepts
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
## Supported Functionalities
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
- **Batch Processing Control**:
- `OptionBatchCount(number)` sets the number of batches.
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
## Design Highlights
- Functional options pattern for extensibility and readability.
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
+29
View File
@@ -0,0 +1,29 @@
# Semantic Description of `obichunk.ISequenceSubChunk`
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
## Core Functionality
- **Input**:
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
- **Processing**:
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
- For each incoming `BioSequenceBatch`:
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
- If the batch has ≤1 sequence, its passed through unchanged (but reordered with a new order ID).
- **Ordering Mechanism**:
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
- **Output**:
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
## Semantic Purpose
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
+45
View File
@@ -0,0 +1,45 @@
# Semantic Description of `IUniqueSequence` Functionality
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
## Core Workflow
1. **Input Processing**
Accepts an input sequence iterator and optional configuration via `WithOption`.
2. **Parallelization Strategy**
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
3. **Data Splitting Phase**
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
- Ensures deterministic chunking for reproducibility.
4. **Storage Choice**
- *In-memory*: via `ISequenceChunkOnMemory`.
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
5. **Uniqueness Classification**
- Builds a composite classifier combining:
- Sequence identity (`SequenceClassifier`)
- Optional annotation categories (e.g., sample, primer), with NA handling.
- If no annotations are specified, only raw sequence identity is used.
6. **Singleton Filtering**
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
7. **Parallel Dereplication**
- Spawns worker goroutines to process chunks.
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
8. **Output Merging**
- Aggregates results using `IMergeSequenceBatch`, preserving:
- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
## Key Features
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- **Configurable**: Via functional options (`Options`).
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
+28
View File
@@ -0,0 +1,28 @@
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
## Core Components
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
- Scans each sequence *forward* and its reverse complement.
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
- Attaches match counts as sequence attributes.
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
Compiles a *single* matcher and returns a predicate function.
- Returns `true` if the number of matches ≥ `minMatches`.
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
## Technical Highlights
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
## Use Cases
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
- Filtering or annotating sequences based on pattern presence/abundance.
+34
View File
@@ -0,0 +1,34 @@
# ObiDefault Package: Batch Configuration Module
This Go module provides centralized configuration for sequence batching in Obitools, supporting both **count-based** and **memory-aware** batch processing.
## Core Features
- `_BatchSize` / `SetBatchSize()`
Defines and configures the *minimum* number of sequences per batch (default: `1`).
Used internally as `minSeqs` in `RebatchBySize`.
- `_BatchSizeMax()` / `SetBatchSizeMax()`
Sets the *maximum* sequences per batch (default: `2000`). Batches are flushed upon reaching this limit, regardless of memory.
- **CLI & Environment Integration**
Batch size is determined by `--batch-size` CLI flag and/or the `OBIBATCHSIZE` environment variable (via parsing logic not shown here but implied by comments).
- `_BatchMem()` / `SetBatchMem(n int)`
Configures the *maximum memory per batch* (default: `128 MB`). A value of `0` disables memory-based batching, falling back to pure count-based logic.
- `_BatchMemStr()`
Stores the *raw CLI string* passed to `--batch-mem` (e.g., `"256M"`, `"1G"`), enabling human-readable input parsing elsewhere.
## Utility Functions
- `BatchSizePtr()`, `BatchMemPtr()`
Expose pointers to internal variables for direct modification or inter-process sharing.
- `BatchSizeMaxPtr()`, `BatchMemStrPtr()`
Provide read/write access to max-size and raw memory string values.
## Design Intent
- Separates **configuration** (defaults, CLI/env parsing) from **processing logic**, enabling modular and testable batch handling.
- Supports both scalable, large-scale processing (via count limits) and memory-constrained environments (via soft RAM caps).
@@ -0,0 +1,35 @@
# Output Compression Control Module
This Go package (`obidefault`) provides a simple, global configuration mechanism for toggling output compression behavior across an application.
## Core Features
- **Global Compression Flag**: A package-level boolean variable `__compress__` (default: `false`) controls whether output should be compressed.
- **Read Access**:
- `CompressOutput()` returns the current compression setting as a boolean.
- **Write Access**:
- `SetCompressOutput(b bool)` updates the compression flag to a new value.
- **Pointer Access**:
- `CompressOutputPtr()` returns a pointer to the internal flag, enabling indirect modification (e.g., for UI bindings or reflection-based updates).
## Design Intent
- Minimal, side-effect-free API.
- Thread-safety *not* guaranteed — intended for use in single-threaded initialization or controlled environments.
- Encapsulation via unexported variable `__compress__`, enforced through accessor functions.
## Typical Usage
```go
// Enable compression globally:
obidefault.SetCompressOutput(true)
if obidefault.CompressOutput() {
// Apply compression logic (e.g., gzip, brotli)
}
```
## Notes
- The double underscore prefix (`__compress__`) signals internal/private status (convention, not enforced).
- Designed for runtime configurability without recompilation.
+38
View File
@@ -0,0 +1,38 @@
# `obidefault` Package — Semantic Overview
This minimal Go package provides a centralized, mutable global flag for controlling warning verbosity across an application.
## Core Functionality
- **`__silent_warning__`**:
A package-level boolean variable (unexported) that determines whether warnings should be suppressed.
- **`SilentWarning() bool`**:
A read-only accessor returning the current state of `__silent_warning__`. Enables safe, non-mutating checks elsewhere in the codebase.
- **`SilentWarningPtr() *bool`**:
Returns a pointer to `__silent_warning__`, allowing external code (e.g., CLI parsers, config loaders) to directly mutate the flag — e.g., `*SilentWarningPtr() = true`.
## Design Intent
- **Simplicity & Centralization**:
Avoids scattering warning-control logic; provides a single source of truth.
- **Flexibility**:
Supports both *read-only* inspection (via `SilentWarning()`) and *global mutation* (via pointer), useful for early initialization phases.
- **Explicit Semantics**:
When `SilentWarning()` returns `true`, all warning-generating code *should* suppress output (implementation responsibility lies outside this package).
## Usage Example
```go
// Suppress warnings globally:
*obidefault.SilentWarningPtr() = true
if !obidefault.SilentWarning() {
log.Println("⚠️ Warning: something happened")
}
```
> **Note**: The double underscore prefix on `__silent_warning__` signals internal/private status, discouraging direct access.
@@ -0,0 +1,33 @@
# Progress Bar Control Module (`obidefault`)
This Go package provides a simple, global mechanism to enable or disable progress bar display across an application.
## Core Functionality
- **`ProgressBar()`**: Returns `true` if progress bars are *enabled* (i.e., when `__no_progress_bar__` is `false`).
- **`NoProgressBar()`**: Returns the current state of `__no_progress_bar__`, i.e., whether progress bars are *disabled*.
- **`SetNoProgressBar(b bool)`**: Sets the global flag `__no_progress_bar__`. Passing `true` disables progress bars; passing `false` enables them.
- **`NoProgressBarPtr()`**: Returns a pointer to the internal `__no_progress_bar__` variable, allowing direct read/write access (e.g., for reflection or UI binding).
## Design Intent
- Centralizes progress bar visibility control in one place.
- Supports both boolean query/set and pointer-based manipulation for flexibility (e.g., CLI flags, config binding).
- Uses a *negative* flag name (`__no_progress_bar__`) internally to default progress bars **on** (i.e., `false` → enabled).
## Usage Example
```go
// Disable progress bars globally:
obidefault.SetNoProgressBar(true)
// Check status:
if !obidefault.ProgressBar() {
log.Println("Progress bars are disabled.")
}
```
## Notes
- Thread-safety is *not* guaranteed; concurrent access should be externally synchronized.
- The double underscore prefix (`__no_progress_bar__`) signals internal/private usage per Go convention (though not enforced).
+26
View File
@@ -0,0 +1,26 @@
# Quality Shift and Read/Write Control Module
This Go package (`obidefault`) provides configurable controls over quality score handling in sequence data processing (e.g., FASTQ files). It defines three global variables and corresponding accessor/mutator functions:
- `_Quality_Shift_Input`: Input quality score offset (default: `33`, i.e., Phred+33/Sanger format).
- `_Quality_Shift_Output`: Output quality score offset (default: `33`), allowing format conversion.
- `_Read_Qualities`: Boolean flag indicating whether quality scores should be parsed/processed (`true` by default).
## Public API
| Function | Purpose |
|---------|--------|
| `SetReadQualitiesShift(shift byte)` | Sets the quality score offset for *input* data (e.g., when reading FASTQ). |
| `ReadQualitiesShift() byte` | Returns the current input quality offset. |
| `SetWriteQualitiesShift(shift byte)` | Sets the quality score offset for *output* data (e.g., when writing FASTQ). |
| `WriteQualitiesShift() byte` | Returns the current output quality offset. |
| `SetReadQualities(read bool)` | Enables/disables reading/processing of quality scores. |
| `ReadQualities() bool` | Returns whether qualities are currently being read/used. |
## Semantic Use Cases
- **Format Interoperability**: Allows seamless conversion between Phred+33 (Sanger), Phred+64, or other quality encodings.
- **Performance Optimization**: Disabling `ReadQualities` skips parsing of quality strings, useful when only sequences are needed.
- **Centralized Configuration**: Global state enables consistent behavior across modules without passing parameters.
All functions are thread-unsafe by design—intended for initialization before concurrent processing begins.
+21
View File
@@ -0,0 +1,21 @@
# `obidefault` Package: Configuration State Management
This Go package provides a centralized, thread-safe(ish) configuration layer for taxonomy-related settings in the OBIDMS (Open Biological and Biomedical Data Management System) framework. It exposes simple getters, setters, and pointer accessors for four core boolean/string flags that control how taxonomic identifiers (taxids) are handled during data processing.
## Core Configuration Flags
- `__taxonomy__`: Stores the currently selected taxonomy (e.g., `"NCBI"`, `"UNIPROT"`).
- `__alternative_name__`: Enables/disables use of alternative taxonomic names (e.g., synonyms).
- `__fail_on_taxonomy__`: If true, processing halts on taxonomy mismatches/errors.
- `__update_taxid__`: If true, taxids are auto-updated to current NCBI/DB versions.
- `__raw_taxid__`: If true, raw (unprocessed) taxids are preserved instead of normalized.
## Public API
- **Getters**: `UseRawTaxids()`, `SelectedTaxonomy()`, `HasSelectedTaxonomy()`, etc., return current values.
- **Pointer Accessors**: e.g., `SelectedTaxonomyPtr()` returns a pointer for direct mutation (advanced use).
- **Setters**: `SetSelectedTaxonomy()`, `SetAlternativeNamesSelected()`, etc., update state.
## Use Case
Typically used at application startup to configure global behavior (e.g., `SetSelectedTaxonomy("NCBI")`, `SetUpdateTaxid(true)`), then referenced by downstream modules during data import, validation, or mapping. Minimalist and explicit—no external dependencies.
+35
View File
@@ -0,0 +1,35 @@
# Obidefault: Parallelism Configuration Module
This Go package (`obideault`) provides a centralized, configurable interface for managing parallel execution parameters—particularly useful in I/O- and CPU-bound workloads.
## Core Concepts
- **CPU-aware defaults**: Automatically detects available cores via `runtime.NumCPU()`.
- **Configurable workers per core**:
- General: `_WorkerPerCore` (default `1.0`)
- Read-specific: `_ReadWorkerPerCore` (`0.25`, i.e., ~1 reader per 4 cores)
- Write-specific: `_WriteWorkerPerCore` (`0.25`)
- **Strict overrides**: Allow hardcoding worker counts via `SetStrictReadWorker()`/`Write...`, bypassing per-core scaling.
## Public API
| Function | Purpose |
|---------|--------|
| `ParallelWorkers()` | Total workers = `MaxCPU() × WorkerPerCore` |
| `Read/WriteParallelWorkers()` | Resolves to strict count if set, else per-core calculation (min 1) |
| `ParallelFilesRead()` | Files read in parallel: defaults to `ReadParallelWorkers()`, overridable |
| Getters (`MaxCPU`, `WorkerPerCore`, etc.) | Expose current settings safely |
| Setters (`Set*`) | Dynamically adjust behavior at runtime |
## Configuration Sources
- **Command-line flags**: e.g., `--max-cpu` or `-m`
- **Environment variable**: `OBIMAXCPU`
## Design Highlights
✅ Decouples resource discovery from policy
✅ Supports both *proportional* (per-core) and *absolute* (strict) worker definitions
✅ Ensures non-zero defaults for critical paths (`ReadParallelWorkers` ≥ 1)
⚠️ **Note**: `WriteParallelWorkers()` contains a likely bug—returns `_StrictReadWorker` in the else branch instead of `StrictWriteWorker`.
+28
View File
@@ -0,0 +1,28 @@
# `obidist` Package: Efficient Symmetric Distance/Similarity Matrix Management
The `*DistMatrix` type provides a memory-efficient, symmetric matrix implementation for distance or similarity data.
- **Storage Strategy**: Only the upper triangle (i < j) is stored, reducing memory usage from *O(n²)* to *n(n1)/2*.
- **Diagonal Handling**: Diagonal entries are fixed (0.0 for distances, 1.0 for similarities); assignments to diagonal indices are silently ignored.
- **Symmetry Guarantee**: `Get(i, j)` and `Set(i, j, v)` automatically handle both (i,j) and (j,i), ensuring consistency.
## Constructors
| Function | Description |
|---------|-------------|
| `NewDistMatrix(n)` / `WithLabels(labels)` | Creates *n×n* distance matrix (diag = 0). |
| `NewSimilarityMatrix(n)` / `WithLabels(labels)` | Creates *n×n* similarity matrix (diag = 1). |
## Core Operations
- `Get(i, j)` / `Set(i, j, v)`: Access/update symmetric entries.
- `Size() int`, `GetLabel(i)` / `SetLabel(i, label)`: Query/mutate element labels.
- `Labels() []string`, `GetRow(i)` / `GetColumn(j)`: Retrieve full rows/columns (as copies).
## Analysis Helpers
- `MinDistance()`, `MaxDistance()``(value, i, j)` of the extremal off-diagonal entry.
- `Copy() *DistMatrix`: Deep copy for immutability-safe operations.
- `ToFullMatrix()``[][]float64`: Converts to dense representation (use sparingly).
Designed for clustering, phylogenetics, or any domain requiring fast symmetric matrix access with minimal footprint.
@@ -0,0 +1,28 @@
# `obidist` Package: Semantic Feature Overview
The `obidist` Go package provides two core data structures for managing **distance** and **similarity matrices**, with built-in guarantees suitable for scientific computing (e.g., clustering, phylogenetics). Key features include:
- **`DistMatrix`**: A symmetric `n×n` matrix representing pairwise distances, where:
- Diagonal entries are *always* `0.0` (self-distance).
- Off-diagonals obey symmetry: `dist(i, j) == dist(j, i)`.
- Automatic enforcement via dedicated `Set()`/`Get()` methods.
- **`SimilarityMatrix`**: A symmetric matrix where:
- Diagonal entries are *always* `1.0`.
- Off-diagonals represent similarity scores (e.g., between `0` and `1`, though not enforced).
- Symmetry is similarly guaranteed.
Both matrix types support:
- **Optional labels**: Associate human-readable identifiers (e.g., sample names) with rows/columns.
- **Safe bounds checking**: Panics on out-of-range access (tested via `defer/recover`).
- **Deep copy support**: Ensures isolation between original and copied instances.
- **Utility methods**:
- `MinDistance()` / `MaxDistance()`: Return extremal values and their indices.
- `GetRow(i)`: Retrieve a full row as a slice (symmetric copy).
- `ToFullMatrix()`: Export the matrix as an immutable 2D slice.
Edge cases are rigorously handled:
- Empty (`n=0`) and singleton (`n=1`) matrices return `(0.0, -1, -1)` for min/max.
- Label mutations do not affect internal state via defensive copying.
All behaviors are validated through comprehensive unit tests, emphasizing correctness and robustness.
@@ -0,0 +1,43 @@
# Semantic Description of `ReadSequencesBatchFromFiles`
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
## Core Functionality
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
## Concurrency Model
- Launches `concurrent_readers` goroutines to process files in parallel.
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
## Streaming Interface
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
## Error Handling & Logging
- Panics on file-open failure (via `log.Panicf`).
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
## Resource Management
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
## Design Intent
- Enables scalable, memory-efficient ingestion of large NGS datasets.
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
## Key Abstractions
| Type/Interface | Role |
|----------------|------|
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
@@ -0,0 +1,36 @@
# `obiformats` Package — Semantic Overview
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
## Core Abstraction
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
```go
func(string, ...WithOption) (obiiter.IBioSequence, error)
```
- It accepts:
- A file path (`string`)
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
- Returns:
- An iterator over biological sequences (`obiiter.IBioSequence`)
- Or an error if the file cannot be opened/parsed
## Semantic Intent
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
## Typical Usage Pattern
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
2. Call it with a file path and optional options.
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
## Design Principles
- **Functional, minimal API**: Single responsibility—reading and iteration.
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
+30
View File
@@ -0,0 +1,30 @@
# CSV Import Module for Biological Sequences (`obiformats`)
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
## Core Features
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
- **Metadata Handling**:
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
- **Multiple Entry Points**:
- `ReadCSV`: From any `io.Reader`.
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
- `ReadCSVFromStdin`: Reads from standard input.
- **Error & Edge Handling**:
- Gracefully handles empty files/streams via `ReadEmptyFile`.
- Uses structured logging (Logrus) for fatal and informational messages.
## Integration
Designed to integrate with OBItools4s core types:
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
## Use Case
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
# CSVSequenceRecord Function Description
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
## Core Features
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
## Design Highlights
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
- Handles missing data consistently via `opt.CSVNAValue()`.
- Supports both standard and user-defined metadata fields.
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
# `CSVTaxaIterator` Function — Semantic Description
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
### Core Functionality:
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
### Configurable Output Fields (via options):
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
- `scientific_name`: Full scientific name of the taxon.
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
### Implementation Highlights:
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
- Dynamically builds CSV headers based on selected options before processing begins.
### Use Case:
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
## CSV Taxonomy Loader for OBITools4
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
### Key Features:
- **Robust CSV Parsing**: Uses Gos `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
- **Taxonomy Construction**:
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
- Ensures existence of a root node; returns error otherwise.
- **Metadata Extraction**:
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
- Logs key metadata for traceability.
- **Scalable Design**:
- Processes records line-by-line (memory-efficient).
- Supports large datasets via streaming CSV reading.
### Input Format:
CSV must contain exactly four columns (case-sensitive headers):
- `taxid`: Unique taxon identifier.
- `parent`: Parent taxonomic node ID (empty for root).
- `scientific_name`: Binomial or descriptive name.
- `taxonomic_rank`: e.g., *species*, *genus*.
### Output:
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats.WriterDispatcher`
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
# EcoPCR File Parser for Biological Sequences
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
## Key Features
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
- Name (with deduplication support)
- Nucleotide/protein sequence
- Comment field
- **Structured Annotation**: Populates rich annotations including:
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
- Primer matching info (`forward_match`, `reverse_mismatch`)
- Melting temperatures (if present in v2)
- Amplicon length and strand orientation
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
## Implementation Highlights
- Custom line reader (`__readline__`) for robust header parsing.
- CSV parser configured with `|` delimiter and comment support (`#`).
- Deduplication of sequence names using a running count suffix.
- Concurrent goroutine-based streaming to decouple I/O and processing.
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
+17
View File
@@ -0,0 +1,17 @@
# EMBL Format Parser for OBITools4
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
- **Two Parsing Modes**:
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
- **Configurable Options**:
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
- **Integration**: Outputs are compatible with OBITools4s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
## `ReadEmptyFile` Function — Semantic Description
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
- **Signature**:
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
- **Behavior**:
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
- **Output**:
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
### Semantic Role & Use Cases
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
### Design Notes
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
# FASTA Parser Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
## Core Functionalities
- **`FastaChunkParser(UtoT bool)`**
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
- **`FastaChunkParserRope(...)`**
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
- **`ReadFasta(reader io.Reader, ...)`**
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
- **`EndOfLastFastaEntry(...)`**
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
## Key Features
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
## Design Highlights
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
- Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
# FASTQ Parsing Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
## Core Functionalities
- **`EndOfLastFastqEntry(buffer []byte) int`**
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
- **`FastqChunkParser(...)`**
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
- Header parsing (`@id [definition]`)
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
- Quality score shifting (`quality_shift`)
- Strict validation (e.g., `+` line, matching sequence/length)
- **`FastqChunkParserRope(...)`**
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
Enables concurrent, chunked parsing of large files:
- Splits input into chunks using `ReadFileChunk`
- Uses configurable parallel workers (`nworker`)
- Pushes parsed batches to an iterator interface
- **Convenience I/O Wrappers**
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
## Key Options & Features
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
## Design Highlights
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
## Semantic Description of `obiformats` Package
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
Two main constructor functions enable flexible formatting:
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
# Semantic Description of `obiformats` Package
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
## Core Functions
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
Dynamically routes header parsing based on the first character of the sequence definition:
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
- Otherwise invokes `ParseFastSeqOBIHeader`.
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
Applies header parsing to a *batch* of sequences:
- Takes an iterator over `BioSequence`s.
- Uses optional configuration (e.g., parallelism, parsing behavior).
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
## Design Principles
- **Format agnosticism**: Automatically detects header type.
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
# `FormatHeader` Function Type in `obiformats`
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
- **Package**: `obiformats`
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
- **Type Definition**:
```go
type FormatHeader func(sequence *obiseq.BioSequence) string
```
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
- **Semantic Role**:
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
- **Usage Context**:
- Used by writers/formatters to produce standardized headers when exporting sequences.
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
- **Dependencies**:
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
- **Design Intent**:
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
- **JSON Parsing Helpers**:
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
- **Header Interpretation**:
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
- Core fields (`id`, `definition`, `count`)
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
- **Sequence Annotation Enrichment**:
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
- **Serialization Support**:
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
- **Error Handling**:
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
# OBIFormats Package: Semantic Description
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
- Three core parsing functions detect value types:
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
- Numeric values are stored as integers if they have no fractional part.
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
- `*_count``map[string]int`,
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
- `*_status`/`*_mutation``map[string]string`.
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequences definition line, moving annotations into its metadata map and preserving leftover text.
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
- Strings and booleans use `key=value;`.
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
# FastSeq Reader Module — Semantic Description
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
## Core Features
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
- **Stdin & File I/O**: Two entry points:
- `ReadFastSeqFromFile(filename, ...)` for regular files.
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
## Integration
Built on top of `obitools4`s core abstractions:
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
# `obiformats` Package Overview
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
## Core Formatting Functions
- **`FormatFasta(seq, formater)`**
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
## File Writing Functions
- **`WriteFasta(iterator, file, options...)`**
Writes a stream of sequences to any `io.WriteCloser`. Supports:
- Parallel workers (`ParallelWorkers`)
- Chunked writing via `WriteFileChunk`
- Optional compression (e.g., gzip)
Returns a new iterator mirroring the input for pipeline chaining.
- **`WriteFastaToStdout(iterator, options...)`**
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
- **`WriteFastaToFile(iterator, filename, options...)`**
Writes to a named file with:
- Truncation or append mode (`AppendFile`)
- Automatic paired-end output if `HaveToSavePaired()` is enabled
(writes reverse reads to a secondary file specified via `PairedFileName`)
## Key Design Highlights
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
# FASTQ Output Module (`obiformats`)
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
## Core Functionality
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
## Header Customization
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
## Writing to Streams/Files
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
- Append/truncate modes
- Paired-end output (splits iterator and writes to two files)
- Automatic compression via `obiutils.CompressStream`
## Parallelization & Robustness
- Uses goroutines to parallelize formatting/writing across multiple workers.
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
## Integration
Designed to work seamlessly with the `obitools4` ecosystem:
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
- Extensible through functional options (`WithOption`) for configuration.
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
# `obiformats` Package Overview
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
# Semantic Description of `obiformats` Package Functionalities
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
- **`ReadFileChunk()`**: Core function that:
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
- Extends chunks incrementally (e.g., +1MB) until a full record boundary is found via `splitter`;
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
- **Key semantics**:
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
- *Streaming-first design* — supports large files without full loading into memory.
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
# `WriteFileChunk` Function — Semantic Description
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
- **Input**:
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
- **Core Behavior**:
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
- **Buffer Management**:
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
- **Error Handling**:
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
- **Cleanup & Lifecycle**:
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
- **Use Case**:
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
# GenBank Parser Module (`obiformats`)
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
## Core Functionalities
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
- **Parallel streaming I/O**:
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
## Key Design Decisions
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
## Output
Returns a batched iterator of `BioSequence` objects, each containing:
- Identifier (`id`)
- Compact nucleotide sequence
- Definition line (as description)
- Source file origin
- Optional feature table bytes
- Annotations: `scientific_name`, `taxid`
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
# JSON Output Module for Biological Sequences (`obiformats`)
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
- `"id"`: Sequence identifier.
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
- `"qualities"` (optional): Quality scores as a string if available.
- `"annotations"` (optional): Metadata annotations map.
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
- Parallel workers (configurable via options).
- Automatic compression (`gzip`/`bgzip`) if enabled.
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
- Atomic ordering to preserve sequence integrity across parallel writes.
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
- Outputs to stdout or a file (with append/truncate control).
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
- **Internal helpers**:
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9``\u00E9`).
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
# NCBI Taxonomy Loader Module (`obiformats`)
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
Key features:
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
- Efficient buffered reading (`bufio.Reader`) for large files.
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
## NCBI Taxonomy Archive Support in `obiformats`
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
### Core Functionalities
1. **Archive Validation (`IsNCBITarTaxDump`)**
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
- Steps include:
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
- Sets the root taxon to NCBIs default (`taxid = 1`, i.e., *root*).
3. **Integration with Other Modules**
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
### Key Parameters
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
- `seqAsTaxa`: Reserved for future use; currently unused.
### Logging & Error Handling
- Uses `logrus` to log loading progress and counts.
- Returns descriptive errors if required files or the root taxon are missing.
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
# Newick Format Export Functionality in `obiformats`
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
## Core Components
- `Tree`: A struct modeling a node in a Newick tree, containing:
- `Children`: list of child nodes (nested trees),
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
- `Length`: optional branch length (evolutionary distance).
- **`Newick()` methods**:
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
- **Writing Functions**:
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
- Accepts an iterator over taxa (`*obitax.ITaxon`).
- Validates single-taxonomy input.
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
## Configuration Options
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
## Semantic Summary
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
# NGSFilter Configuration Parser — Semantic Overview
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
## Core Functionality
- **Format Detection**:
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
- **Dual Input Parsing**:
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
- Primer pairs (`forward`, `reverse`)
- Tag pairs (with optional `-` for untagged direction)
- Experiment/sample metadata
- OBIFeatures annotations (via `ParseOBIFeatures`)
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
Additional columns are stored as annotations.
- **Parameter Configuration**:
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
- Error tolerance:
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
- Indel handling:
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
- **Validation & Integrity Checks**:
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
- Duplicate tag-pair detection per marker (error on reuse).
- Strict column/field validation with informative error messages.
- **Logging & Observability**:
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
## Design Highlights
- **Extensibility**: New parameters can be added via `library_parameter` map.
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats` Package Functionalities
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
Key capabilities include:
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
## Core Functionality
- **`newRopeScanner(rope *PieceOfChunk)`**
Constructs a new scanner starting at the root of the rope.
- **`ReadLine() []byte`**
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
- Returns `nil` when the end of the rope is reached.
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
- The returned slice aliases rope data and is only valid until the next call.
- **`skipToNewline()`**
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
## Implementation Highlights
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the ropes underlying data.
## Use Case
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
# Taxonomy Loading Module (`obiformats`)
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
## Core Features
1. **Format Detection**
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
- Supports:
• NCBI Taxdump (both directory and `.tar` archive)
• CSV files (`text/csv`)
• FASTA/FASTQ sequences (via `mimetype` detection)
2. **Modular Loaders**
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
3. **Sequence-Based Taxonomy Extraction**
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
4. **Integration with OBITools Ecosystem**
- Leverages `obitax.Taxonomy` as the canonical output structure.
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
5. **Error Handling & Logging**
- Graceful failure with descriptive errors; informative logging via `logrus`.
## Usage Flow
```go
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
```
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
# OBIFORMATS Package: Semantic Description
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
- **CSV** (`text/csv`): generic tabular support.
Core functionality is exposed through:
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
Internally leverages:
- `obiutils.Ropen()` for unified file opening (including stdin handling).
- Path extension stripping and source tagging via `OptionsSource()`.
- Logging (`logrus`) for format diagnostics.
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
# `obiformats` Package: Sequence Writing Utilities
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
## Core Functionality
- **`WriteSequence()`**:
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
- Preserves iterator state via `PushBack()` to allow chaining.
- **`WriteSequencesToStdout()`**:
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
- **`WriteSequencesToFile()`**:
Writes sequences to a specified file. Supports:
- File creation/truncation or append mode (`OptionAppendFile()`).
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
## Design Highlights
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
## Integration
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
+13
View File
@@ -0,0 +1,13 @@
## Uint128 Type in `obifp`: Semantic Overview
This Go package defines a custom 128-bit unsigned integer type (`Uint128`) composed of two `uint64` limbs (high and low). It provides comprehensive arithmetic, comparison, bitwise operations, and type conversions.
- **Basic Constructors**: `Zero()`, `MaxValue()` initialize the smallest/largest possible values.
- **State Checks**: `IsZero()`, and equality/comparison methods (`Equals`, `Cmp`, `<`, `>`, etc.) enable conditional logic.
- **Type Casting**: Safe conversions to/from smaller (`Uint64`, `uint64`) and larger (`Uint256`) integer types, with overflow warnings where applicable.
- **Arithmetic**: Full support for addition (`Add`, `Add64`), subtraction (`Sub`), multiplication (`Mul`, `Mul64`) — with panic on overflow.
- **Division & Modulo**: Integer division (`Div`, `Div64`) and remainder (`Mod`, `Mod64`), implemented via optimized quotient-remainder pairs (`QuoRem`, `QuoRem64`) using hardware-assisted 64-bit operations.
- **Bit Manipulation**: Left/right shifts (`LeftShift`, `RightShift`), and bitwise logic: AND, OR, XOR, NOT.
- **Utility**: Direct access to low limb via `AsUint64()`.
All operations preserve 128-bit precision, with strict overflow checking for correctness in high-precision contexts (e.g., bioinformatics counting).
+17
View File
@@ -0,0 +1,17 @@
# `obifp.Uint128` Package — Semantic Feature Overview
This Go package provides a 128-bit unsigned integer type (`Uint128`) with comprehensive arithmetic, comparison, and bitwise operations. Internally represented as two `uint64` limbs (`w1`: high, `w0`: low), it supports:
- **Arithmetic Operations**
- `Add`, `Sub`, `Mul` (128×128), and `Mul64` (scalar multiplication)
- Division: `Div`, `Mod`, and combined quotient/remainder via `QuoRem` (and their 64-bit variants)
- **Comparison & Equality**
- `Cmp`, `Equals`, `LessThan`/`GreaterThan`, and their inclusive variants (`≤`, `≥`)
- Support for comparing against both `Uint128` and native `uint64` values
- **Bitwise Operations**
- Logical AND (`And`), OR (`Or`), XOR (`Xor`) between two `Uint128`s
- Bitwise NOT (`Not`) — inverts all bits of the value
- **Conversion & Utility**
- `AsUint64()` safely truncates to lower 64 bits (assumes upper limb is zero)
All operations handle overflow/underflow correctly, including carry propagation in addition and borrow handling in subtraction. Tests cover edge cases: zero values, max `uint64` boundaries (e.g., wrapping in addition/subtraction), and large multiplications. Designed for cryptographic or high-precision numeric use where native integer types are insufficient.
+30
View File
@@ -0,0 +1,30 @@
# Uint256 Type and Operations — Semantic Overview
The `obifp` package provides a custom 256-bit unsigned integer type (`Uint256`) implemented in Go, composed of four 64-bit limbs (`w0` to `w3`). It supports arithmetic, comparison, bitwise operations, and safe casting with overflow detection.
- **Core Representation**: `Uint256` stores values as four 64-bit words, enabling arbitrary-precision unsigned integers up to $2^{256} - 1$.
- **Utility Methods**:
- `Zero()` / `MaxValue()`: Return the neutral and maximum values.
- `IsZero()`, `Equals(v)`, comparison methods (`LessThan`, etc.): Enable logical and ordering checks.
- **Casting & Conversion**:
- `Uint64()`, `Uint128()` downcast with warnings on overflow.
- `Set64(v)`: Initializes from a standard `uint64`.
- `AsUint64()`: Direct access to least-significant limb.
- **Bitwise Operations**:
- `And`, `Or`, `Xor`, `Not`: Standard bitwise logic per limb.
- **Shifts**:
- `LeftShift(n)` / `RightShift(n)`: Multi-limb shifts with carry propagation.
- **Arithmetic**:
- `Add(v)`, `Sub(v)` / `Mul(v)`: Use Gos `math/bits` for carry-aware operations; panic on overflow.
- `Div(v)`: Implements long division via repeated subtraction of shifted multiples; panics on zero divisor.
- **Safety & Logging**:
- Warnings via `obilog.Warnf` for silent overflows during narrowing casts.
- Panics on arithmetic overflow or division-by-zero using `log.Panicf`.
This type is suitable for cryptographic, genomic (OBITools), or high-precision counting use cases requiring precise control over large unsigned integers.
+34
View File
@@ -0,0 +1,34 @@
# Uint64 Type Functionalities Overview
The `obifp` package provides a custom `Uint64` type wrapping Gos native 64-bit unsigned integer (`uint64`) to support arithmetic, bitwise operations, and type conversions in a structured way.
## Core Operations
- **`Zero()` / `MaxValue()`**: Returns the zero and maximum representable values, respectively.
- **`IsZero()` / `Equals(v)`**: Checks if the value is zero or equal to another.
- **`Cmp(v)`, `LessThan(v)`**, etc.: Standard comparison operations returning `-1/0/+1` or boolean results.
## Arithmetic with Overflow Detection
- **Add/Sub/Mul**: Performs 64-bit addition, subtraction, and multiplication.
- Uses `math/bits` for low-level operations (`bits.Add64`, etc.).
- Panics on overflow (carry ≠ 0), enforcing strict safety.
## Bitwise Operations
- **`And`, `Or`, `Xor`, `Not()`**: Standard bitwise logic operations.
- **`LeftShift(n)` / `RightShift(n)`**:
- Shifts bits left/right by *n* positions.
- Uses internal `LeftShift64`/`RightShift64`, supporting *carry-in* for multi-word arithmetic.
## Extended Precision Conversions
- **`Uint128()` / `Uint256()`**: Casts the 64-bit value into larger unsigned integer types (zero-extended).
- **`Set64(v)`**: Reassigns the internal value from a raw `uint64`.
## Utility & Logging
- **`AsUint64()`**: Extracts the underlying `uint64`.
- **Warning on overflow in shift operations** (e.g., shifts ≥ 128 bits) via `obilog.Warnf`.
> Designed for use in high-precision or cryptographic contexts where explicit overflow handling and type safety are critical.
+32
View File
@@ -0,0 +1,32 @@
# Obifp Package: Generic Fixed-Point Unsigned Integer Operations
This Go package (`obifp`) provides a generic, type-safe interface for fixed-point unsigned integer arithmetic over three size variants: `Uint64`, `Uint128`, and `Uint256`.
## Core Interface: `FPUint[T]`
The interface defines a unified API for unsigned integer types, supporting:
- **Initialization & Conversion**:
- `Zero()`, `Set64(v)`: Create zero or set from a `uint64`.
- `AsUint64()`: Downcast to standard `uint64`.
- **Logical Operations**:
- Bitwise: `And`, `Or`, `Xor`, `Not`.
- Shifts: `LeftShift(n)`, `RightShift(n)`.
- **Arithmetic**:
- Addition (`Add`), subtraction (`Sub`), multiplication (`Mul`). Division is commented out—likely reserved for future implementation.
- **Comparison**:
- Full ordering: `<`, `<=`, `>`, `>=`.
- **Utility Predicates**:
- `IsZero()` for zero-checking.
## Helper Functions
- `ZeroUint[T]`: Returns the neutral element (zero) for type `T`.
- `OneUint[T]`: Constructs value 1 via `Set64(1)`.
- `From64[T]`: Converts a standard Go `uint64` into the generic type.
All operations are **method-chaining friendly** (return `T`, not pointers), enabling fluent syntax. The design promotes correctness and performance in cryptographic or financial contexts where large, fixed-size integers are required.
+30
View File
@@ -0,0 +1,30 @@
# `obigraph` Package: Semantic Overview
The `obigraph` package provides a generic, type-safe undirected/directed graph implementation in Go. Its core features include:
- **Generic Graph Structure**: Parametrized over vertex type `V` and edge data type `T`, enabling flexible use with arbitrary user-defined types.
- **Bidirectional Edge Tracking**: Maintains both forward (`Edges`) and reverse (`ReverseEdges`) adjacency maps for efficient neighbor/parent queries.
- **Edge Management**:
- `AddEdge`: Adds an *undirected* edge (inserted in both directions).
- `AddDirectedEdge`: Adds a *directed* edge (only one direction).
- `SetAsDirectedEdge`: Converts an existing undirected edge into a directed one by removing the reverse link.
- **Graph Queries**:
- `Neighbors(v)`: Returns all adjacent vertices (outgoing in directed case).
- `Parents(v)`: Returns incoming neighbors via reverse adjacency.
- `Degree(v)` / `ParentDegree(v)`: Compute vertex degrees (total or incoming).
- **Customizable Vertex/Edge Properties**:
- `VertexWeight`, `EdgeWeight`: Funcs to assign weights (default: constant weight = 1.0).
- `VertexId`: Custom vertex label generator (default: `"V%d"`).
- **GML Export**:
- `Gml(...)` / `WriteGml(...)`: Generates or writes a Graph Modelling Language (GML) representation.
- Supports directed/undirected modes, degree-based filtering (`min_degree`), and visual styling:
- Vertex shape: `circle` if weight ≥ threshold, else `rectangle`.
- Size scaled by square root of vertex weight.
- Uses Gos `text/template` for rendering.
- **File I/O**: Directly writes GML to file via `WriteGmlFile(...)`.
- **Logging & Safety**: Uses Logrus for bounds-checking errors; panics on template parsing/writing failures.
The package is designed for lightweight, high-performance graph modeling and visualization-ready export.
+14
View File
@@ -0,0 +1,14 @@
# `obigraph.GraphBuffer` Feature Overview
The `GraphBuffer[V, T]` type provides a **thread-safe graph construction interface** using buffered edge insertion via Go channels.
- **Asynchronous Edge Addition**: Edges are enqueued through a `chan Edge[T]`, processed in the background by a goroutine that updates an underlying static graph (`Graph[V, T]`).
- **Non-blocking API**: `AddEdge` and `AddDirectedEdge` are non-synchronous — they send to the channel without waiting for graph mutation, enabling high-throughput edge ingestion.
- **Graph Initialization**: `NewGraphBuffer` initializes both the graph and a dedicated worker goroutine to consume edges.
- **GML Export Support**: Full support for exporting the final graph in [Graph Modelling Language (GML)](https://en.wikipedia.org/wiki/Graph_Modelling_Language), with optional filtering (`min_degree`) and layout parameters (`threshold`, `scale`).
- **File & Stream Output**: Methods `WriteGml` and `WriteGmlFile` allow writing GML to any `io.Writer`, including files.
- **Resource Cleanup**: The explicit `Close()` method terminates the worker goroutine by closing the channel, ensuring clean shutdown.
- **Generic Design**: Fully generic over vertex (`V`) and edge data types (`T`), supporting arbitrary value semantics.
> ⚠️ **Note**: The buffer is *not* safe for concurrent `AddEdge` calls without external synchronization beyond channel semantics.
> ✅ Ideal for producer-consumer patterns where edges are streamed from multiple goroutines into a single graph.
+29
View File
@@ -0,0 +1,29 @@
# BioSequenceBatch: A Container for Ordered Biological Sequences
`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
## Core Properties
- **`source`**: String identifying the origin (e.g., file, pipeline stage).
- **`order`**: Integer defining processing sequence or priority.
- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
## Key Functionalities
- **Construction**:
`MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
- **Accessors**:
`Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
- **Mutation (via copy)**:
`Reorder(newOrder)` returns a new batch with updated order.
- **Size & emptiness**:
`Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
- **Consumption**:
`Pop0()` removes and returns the first sequence (FIFO behavior).
- **Safety**:
`IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
## Design Notes
- Instances are value types (struct), enabling safe copying.
- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
# `obiiter`: Stream-Based Biosequence Iterator Library
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
## Core Functionality
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
## Iterator Management
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
- **Lifecycle Control**:
- `Add(n)`, `Done()`: Track active workers (like goroutines).
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
## Batch Transformation & Reorganization
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
- **Concatenation & Pooling**:
- `Concat(...)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
## Filtering & Predicate-Based Processing
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
## Utility & Analysis
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
## Additional Features
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
- Batch ordering preserved for downstream reproducibility.
- Integrates with OBITools4s `obidefault`, `obiutils` for config and resource management.
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
+32
View File
@@ -0,0 +1,32 @@
# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
- **Key Fields**:
- `outputs`: A map from integer class codes to output streams (`IBioSequence`).
- `news`: An unbuffered channel emitting class codes when new output streams are created.
- `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
- **Batching Strategy**:
- Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
- Batches are flushed automatically and on finalization.
- **Asynchronous Processing**:
- The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
- Outputs are closed only after all sequences have been processed.
- **Notifications**:
- The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
- **Error Handling**:
- `Outputs(key)` returns an error if the requested key has no associated output.
- **Integration**:
- Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
- Uses `SortBatches()` on the input iterator to ensure ordered processing.
In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
# `ExtractTaxonomy` Function — Semantic Description
The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
- **Input**:
- A pointer to `IBioSequence`, representing a sequence iterator over biological data.
- A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
- **Process**:
- Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.
- For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.
- Stops and returns immediately upon encountering the first error during taxonomy extraction.
- **Output**:
- Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).
- Returns `nil` error on success; otherwise, returns the first encountered error.
- **Semantic Role**:
Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
- **Dependencies**:
Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
+28
View File
@@ -0,0 +1,28 @@
# `IFragments` Functionality Overview
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
## Core Parameters
- `minsize`: Minimum sequence length to skip fragmentation.
- `length`: Desired fragment size (in bases/amino acids).
- `overlap`: Number of overlapping residues between consecutive fragments.
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
## Workflow
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
2. **Parallel Fragmentation**:
- Each worker processes a subset of batches independently using goroutines.
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
3. **Resource Management**:
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
## Key Features
- **Overlap handling**: Ensures contiguous coverage without gaps.
- **Memory efficiency**: Uses recycling and batched output.
- **Scalability**: Leverages Go concurrency via `nworkers`.
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
## Use Case
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
+29
View File
@@ -0,0 +1,29 @@
# Memory-Limited Biosequence Iterator
This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
## Core Functionality
- **`LimitMemory(fraction float64)`**
Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
- **Memory Monitoring**
Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
- **Backpressure Mechanism**
While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
- **Logging**
Warns via `obilog.Warnf` when:
- Memory pressure persists (every ~1000 yields),
- Or wait duration becomes unusually long (>10,000 yielding cycles).
- **Concurrency Model**
- A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
- A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
## Semantic Behavior
- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
+19
View File
@@ -0,0 +1,19 @@
# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**
- Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
- Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
- For each batch:
- Collects up to `batchsize` sequences via the input iterator.
- Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
- Wraps merged results into a `BioSequenceBatch` with ordering metadata.
- Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**
- A *pipeline combinator* (higher-order function), enabling functional composition.
- Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
**Semantic Purpose**:
Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
+35
View File
@@ -0,0 +1,35 @@
# `NumberSequences` Function — Semantic Description
The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
## Core Functionality
- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
## Parallelization Strategy
- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
- **Reordering mode**: If `forceReordering` is true:
- Input iterator is batch-sorted (`SortBatches()`).
- Parallelism disabled (1 worker) to ensure deterministic numbering order.
## Implementation Details
- Each goroutine processes its own split of the input iterator.
- A shared `next_first` counter tracks the next available sequence number globally.
- Locking ensures atomic increment and assignment, preventing race conditions.
## Output
Returns a new `IBioSequence` iterator:
- Contains the same sequence batches (possibly reordered if sorted).
- Each `BioSequence` object now carries a `"seq_number"` attribute.
- Paired sequences are co-numbered and marked accordingly.
## Use Cases
- Preparing data for downstream tools requiring unique sequence IDs.
- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
- Reproducible numbering across pipeline stages or restarts.
+17
View File
@@ -0,0 +1,17 @@
# Paired-End Sequence Handling in `obiiter`
This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
- `BioSequenceBatch` methods:
- **`IsPaired()`**: Checks whether the batch contains paired reads.
- **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
- **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
- **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
- `IBioSequence` (iterator) methods:
- **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
- **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
- **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
- **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
+17
View File
@@ -0,0 +1,17 @@
# Semantic Description of `obiiter` Package Features
This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.
- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.
- Uses goroutines to ensure non-blocking parallel consumption.
- Ensures proper lifecycle management: closing the second stream when the first is closed.
- Preserves paired-end status (`MarkAsPaired`) if applicable.
Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
@@ -0,0 +1,28 @@
# `MakeSetAttributeWorker` Functionality Overview
The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequences metadata.
- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
- **Use case example**:
```go
worker := MakeSetAttributeWorker("species")
seq = worker(seq) // annotates `seq` with species-level taxon
```
- **Assumptions**:
- `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.
- Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).
- Sequences carry mutable taxonomic metadata.
- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
+31
View File
@@ -0,0 +1,31 @@
# `Speed` Functionality Description
The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
## Core Features
- **Non-intrusive progress bar**:
The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
- **Conditional rendering**:
The progress bar is only shown when:
- `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
- stderr is connected to a terminal (`os.ModeCharDevice`),
- stdout is *not* piped (to avoid interfering with file output).
- **Batch-aware counting**:
Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
- **Paired-end support**:
If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
- **Pipeable wrapper**:
`SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
## Implementation Highlights
- Uses goroutines to decouple iteration and progress updates.
- Automatically closes the output iterator when input ends (`WaitAndClose()`).
- Prints a final newline to stderr upon completion.
This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
+20
View File
@@ -0,0 +1,20 @@
# Semantic Description of `obiiter` Package Functionalities
This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:
Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.
+33
View File
@@ -0,0 +1,33 @@
# `obiitercsv`: CSV Record Iterator for Streaming and Batch Processing
This Go package provides a thread-safe, channel-based iterator (`ICSVRecord`) for streaming and processing CSV records in batches. It supports ordered batch handling, concurrent access via mutexes, and dynamic header management.
## Core Types
- **`CSVHeader`**: A slice of strings representing column names.
- **`CSVRecord`**: A map from field name to value (`map[string]interface{}`).
- **`CSVRecordBatch`**: A batch of records with metadata: `source`, `order`, and the actual data slice.
## Key Features
- **Streaming via Channels**: Records are consumed as `CSVRecordBatch` items through a channel, enabling asynchronous producers/consumers.
- **Ordered Processing**: Batches include an `order` field, used by `SortBatches()` to reconstruct sequential order even when received out-of-order.
- **Thread Safety**: Uses `sync.RWMutex`, atomic operations (`batch_size`), and `abool.AtomicBool` for flags like `finished`.
- **Iterator Protocol**: Implements standard methods:
- `Next()` to advance,
- `Get()` to retrieve current batch,
- `PushBack()` for re-queuing the last record.
- **Batch Management**:
- `SetHeader()` / `AppendField()`: dynamic header updates.
- `Split()`: creates a new iterator sharing the same channel but with independent locking.
- **Lifecycle Control**:
- `Add()` / `Done()`: track active goroutines (via `sync.WaitGroup`).
- `WaitAndClose()` ensures all data is flushed before closing the channel.
## Utility Methods
- **`NotEmpty()`, `IsNil()`**: Check batch validity.
- **`Consume()`**: Drains the iterator (e.g., for side-effect processing).
- **`SortBatches()`**: Reorders batches by `order`, buffering out-of-sequence ones.
Designed for bioinformatics pipelines (e.g., OBITools4), it enables scalable, memory-efficient CSV processing with strict ordering guarantees.

Some files were not shown because too many files have changed in this diff Show More