mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.9 KiB
2.9 KiB
Here's a concise, semantically structured Markdown description (≤200 lines) of the public-facing functionalities provided by the obigrep package, based on your input and focusing only on public APIs:
# `obigrep`: Command-Line Sequence Filtering for OBITools4
`obigrep` delivers a robust, CLI-driven filtering engine for biological sequences (FASTA/FASTQ), enabling precise selection or exclusion of reads using diverse criteria—length, abundance, taxonomy, patterns (exact/fuzzy), metadata attributes—and paired-end logic.
## Core Filtering Capabilities
### Length & Abundance
- `--min-length`, `--max-length`: Filter by sequence length.
- `--min-count`, `--max-count`: Filter based on read abundance (count attribute).
### Pattern Matching
- Exact regex via `--sequence`/`-s`, `--definition`/`-D`, or `--identifier`/`-I`.
- Case-insensitive by default.
- Approximate matching via `--pattern`, with options:
- `--pattern-error`: Max edit distance.
- `--allows-indels`: Allow insertions/deletions (default: mismatches only).
- `--only-forward`: Restrict to forward strand.
### Taxonomic Filtering
- `--restrict-to-taxon`/`-r`: Keep only sequences matching given taxon(s).
- `--ignore-taxon`/`-i`: Exclude specific taxa.
- `--valid-taxid`: Enforce presence of valid NCBI taxids in records.
- `--require-rank`: Require specific taxonomic rank (e.g., *species*, *genus*).
### Attribute & Metadata Filtering
- `--has-attribute`/`-A`: Retain sequences with a given attribute key.
- `--attribute=key=pattern`/`-a`: Match regex against a specific attribute value.
- `--id-list FILE`: Select sequences whose identifiers appear in the file.
### Custom Logic
- `--predicate`/`-p`: Evaluate arbitrary boolean expressions (e.g., `"attr['quality'] > 30 && len(sequence) < 500"`).
### Paired-End Handling
- `--paired-mode`: Define how filters apply to read pairs:
- `"forward"`: Only forward read considered.
- `"and"`, `"or"`, `"xor"`, etc.: Logical combinations of forward/reverse filters.
### Output Control
- `--save-discarded FILE`: Write rejected sequences to file.
- `--inverse-match`/`-v`: Globally invert selection (i.e., output *only* discarded reads).
## Implementation Notes
- Filters are composed into a single predicate using `CLISequenceSelectionPredicate()`.
- Paired-end logic is layered via `PairedPredicat()` when input files are paired (`CLIHasPairedFile()`).
- Filtering is executed via `iterator.FilterOn(...)` (in-place) or `DivideOn(...)` + async write to discarded file.
- Uses structured logging (`logrus`) and graceful error handling for robust CLI operation.
## Semantic Role
`obigrep` acts as the **semantic filter layer** in OBITools4 workflows—translating user CLI flags into type-safe, composable predicates that operate uniformly over `IBioSequence` iterators. It bridges high-level biological intent (e.g., “keep only *Bacillales* with ≥Q30 and no Ns”) to low-level filtering primitives.