Files
obitools4/autodoc/docmd/pkg_obitools_obigrep.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

55 lines
2.9 KiB
Markdown

Here's a concise, semantically structured Markdown description (≤200 lines) of the **public-facing functionalities** provided by the `obigrep` package, based on your input and focusing only on *public* APIs:
```markdown
# `obigrep`: Command-Line Sequence Filtering for OBITools4
`obigrep` delivers a robust, CLI-driven filtering engine for biological sequences (FASTA/FASTQ), enabling precise selection or exclusion of reads using diverse criteria—length, abundance, taxonomy, patterns (exact/fuzzy), metadata attributes—and paired-end logic.
## Core Filtering Capabilities
### Length & Abundance
- `--min-length`, `--max-length`: Filter by sequence length.
- `--min-count`, `--max-count`: Filter based on read abundance (count attribute).
### Pattern Matching
- Exact regex via `--sequence`/`-s`, `--definition`/`-D`, or `--identifier`/`-I`.
- Case-insensitive by default.
- Approximate matching via `--pattern`, with options:
- `--pattern-error`: Max edit distance.
- `--allows-indels`: Allow insertions/deletions (default: mismatches only).
- `--only-forward`: Restrict to forward strand.
### Taxonomic Filtering
- `--restrict-to-taxon`/`-r`: Keep only sequences matching given taxon(s).
- `--ignore-taxon`/`-i`: Exclude specific taxa.
- `--valid-taxid`: Enforce presence of valid NCBI taxids in records.
- `--require-rank`: Require specific taxonomic rank (e.g., *species*, *genus*).
### Attribute & Metadata Filtering
- `--has-attribute`/`-A`: Retain sequences with a given attribute key.
- `--attribute=key=pattern`/`-a`: Match regex against a specific attribute value.
- `--id-list FILE`: Select sequences whose identifiers appear in the file.
### Custom Logic
- `--predicate`/`-p`: Evaluate arbitrary boolean expressions (e.g., `"attr['quality'] > 30 && len(sequence) < 500"`).
### Paired-End Handling
- `--paired-mode`: Define how filters apply to read pairs:
- `"forward"`: Only forward read considered.
- `"and"`, `"or"`, `"xor"`, etc.: Logical combinations of forward/reverse filters.
### Output Control
- `--save-discarded FILE`: Write rejected sequences to file.
- `--inverse-match`/`-v`: Globally invert selection (i.e., output *only* discarded reads).
## Implementation Notes
- Filters are composed into a single predicate using `CLISequenceSelectionPredicate()`.
- Paired-end logic is layered via `PairedPredicat()` when input files are paired (`CLIHasPairedFile()`).
- Filtering is executed via `iterator.FilterOn(...)` (in-place) or `DivideOn(...)` + async write to discarded file.
- Uses structured logging (`logrus`) and graceful error handling for robust CLI operation.
## Semantic Role
`obigrep` acts as the **semantic filter layer** in OBITools4 workflows—translating user CLI flags into type-safe, composable predicates that operate uniformly over `IBioSequence` iterators. It bridges high-level biological intent (e.g., “keep only *Bacillales* with ≥Q30 and no Ns”) to low-level filtering primitives.