mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
55 lines
2.9 KiB
Markdown
55 lines
2.9 KiB
Markdown
Here's a concise, semantically structured Markdown description (≤200 lines) of the **public-facing functionalities** provided by the `obigrep` package, based on your input and focusing only on *public* APIs:
|
|
|
|
```markdown
|
|
# `obigrep`: Command-Line Sequence Filtering for OBITools4
|
|
|
|
`obigrep` delivers a robust, CLI-driven filtering engine for biological sequences (FASTA/FASTQ), enabling precise selection or exclusion of reads using diverse criteria—length, abundance, taxonomy, patterns (exact/fuzzy), metadata attributes—and paired-end logic.
|
|
|
|
## Core Filtering Capabilities
|
|
|
|
### Length & Abundance
|
|
- `--min-length`, `--max-length`: Filter by sequence length.
|
|
- `--min-count`, `--max-count`: Filter based on read abundance (count attribute).
|
|
|
|
### Pattern Matching
|
|
- Exact regex via `--sequence`/`-s`, `--definition`/`-D`, or `--identifier`/`-I`.
|
|
- Case-insensitive by default.
|
|
- Approximate matching via `--pattern`, with options:
|
|
- `--pattern-error`: Max edit distance.
|
|
- `--allows-indels`: Allow insertions/deletions (default: mismatches only).
|
|
- `--only-forward`: Restrict to forward strand.
|
|
|
|
### Taxonomic Filtering
|
|
- `--restrict-to-taxon`/`-r`: Keep only sequences matching given taxon(s).
|
|
- `--ignore-taxon`/`-i`: Exclude specific taxa.
|
|
- `--valid-taxid`: Enforce presence of valid NCBI taxids in records.
|
|
- `--require-rank`: Require specific taxonomic rank (e.g., *species*, *genus*).
|
|
|
|
### Attribute & Metadata Filtering
|
|
- `--has-attribute`/`-A`: Retain sequences with a given attribute key.
|
|
- `--attribute=key=pattern`/`-a`: Match regex against a specific attribute value.
|
|
- `--id-list FILE`: Select sequences whose identifiers appear in the file.
|
|
|
|
### Custom Logic
|
|
- `--predicate`/`-p`: Evaluate arbitrary boolean expressions (e.g., `"attr['quality'] > 30 && len(sequence) < 500"`).
|
|
|
|
### Paired-End Handling
|
|
- `--paired-mode`: Define how filters apply to read pairs:
|
|
- `"forward"`: Only forward read considered.
|
|
- `"and"`, `"or"`, `"xor"`, etc.: Logical combinations of forward/reverse filters.
|
|
|
|
### Output Control
|
|
- `--save-discarded FILE`: Write rejected sequences to file.
|
|
- `--inverse-match`/`-v`: Globally invert selection (i.e., output *only* discarded reads).
|
|
|
|
## Implementation Notes
|
|
|
|
- Filters are composed into a single predicate using `CLISequenceSelectionPredicate()`.
|
|
- Paired-end logic is layered via `PairedPredicat()` when input files are paired (`CLIHasPairedFile()`).
|
|
- Filtering is executed via `iterator.FilterOn(...)` (in-place) or `DivideOn(...)` + async write to discarded file.
|
|
- Uses structured logging (`logrus`) and graceful error handling for robust CLI operation.
|
|
|
|
## Semantic Role
|
|
|
|
`obigrep` acts as the **semantic filter layer** in OBITools4 workflows—translating user CLI flags into type-safe, composable predicates that operate uniformly over `IBioSequence` iterators. It bridges high-level biological intent (e.g., “keep only *Bacillales* with ≥Q30 and no Ns”) to low-level filtering primitives.
|