- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
11 KiB
obigrep(1) — OBITools4 Manual
NAME
obigrep — select a subset of sequence records on various criteria
SYNOPSIS
obigrep [OPTIONS] [FILE...]
DESCRIPTION
obigrep filters a set of biological sequence records (in FASTA or FASTQ format) and writes only those matching all specified criteria to the output. Its name is modelled on the Unix grep command, but instead of filtering lines in a text file, it filters sequence records.
Filtering criteria can be combined freely: only sequence records satisfying all specified conditions are retained. The selection can be inverted with --inverse-match to keep the records that would otherwise be discarded.
Sequences are read from one or more files, or from standard input if no file is given. Results are written to standard output or to a file specified with --out. Records that do not pass the filters can optionally be saved to a separate file with --save-discarded.
INPUT FORMATS
obigrep recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|---|---|
--fasta |
FASTA |
--fastq |
FASTQ |
--embl |
EMBL flat file |
--genbank |
GenBank flat file |
--ecopcr |
ecoPCR output |
--csv |
CSV tabular format |
Header annotation styles can be selected with --input-OBI-header (OBITools format) or --input-json-header (JSON format).
OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
--fasta-output— write FASTA--fastq-output— write FASTQ--json-output— write JSON--output-OBI-header/-O— annotate FASTA/FASTQ title lines in OBITools format--output-json-header— annotate FASTA/FASTQ title lines in JSON format--compress/-Z— compress output with gzip
Use --out FILE / -o FILE to write results to a file instead of standard output.
FILTERING OPTIONS
By sequence length
-
--min-length LENGTH/-l LENGTHKeep only sequences at least LENGTH bases long. -
--max-length LENGTH/-L LENGTHKeep only sequences at most LENGTH bases long.
By read abundance
Sequence records can carry a count attribute recording how many times the sequence was observed. The following options filter on that count:
-
--min-count COUNT/-c COUNTKeep only sequences observed at least COUNT times (default: 1). -
--max-count COUNT/-C COUNTKeep only sequences observed at most COUNT times.
By sequence pattern
-
--sequence PATTERN/-s PATTERNKeep records whose nucleotide sequence matches the regular expression PATTERN (case-insensitive). This option can be repeated; all patterns must match. -
--approx-pattern PATTERNKeep records whose sequence contains an approximate match to PATTERN. The number of allowed differences is controlled by--pattern-error. This option can be repeated. -
--pattern-error NMaximum number of mismatches (or indels, if--allows-indelsis set) tolerated when using--approx-pattern(default: 0, i.e. exact match). -
--allows-indelsAllow insertions and deletions (in addition to substitutions) when performing approximate pattern matching. -
--only-forwardSearch patterns on the forward strand only. By default both strands are searched.
By identifier or definition
-
--identifier PATTERN/-I PATTERNKeep records whose identifier matches the regular expression PATTERN (case-insensitive). Can be repeated. -
--id-list FILENAMEKeep only records whose identifier appears in FILENAME, a plain-text file with one identifier per line. -
--definition PATTERN/-D PATTERNKeep records whose definition line matches the regular expression PATTERN (case-insensitive). Can be repeated.
By attribute (metadata)
Sequence records can carry arbitrary key/value annotations:
-
--has-attribute KEY/-A KEYKeep records that possess an attribute named KEY, regardless of its value. Can be repeated. -
--attribute KEY=PATTERN/-a KEY=PATTERNKeep records for which the value of attribute KEY matches the regular expression PATTERN (case-sensitive). Can be repeated; all constraints must be satisfied.
By custom boolean expression
-
--predicate EXPRESSION/-p EXPRESSIONKeep records for which the boolean expression EXPRESSION evaluates to true. Attributes are accessed via theannotationsmap (e.g.annotations["count"]). The special variablesequencerefers to the sequence object; its length can be obtained withlen(sequence). Can be repeated; all expressions must be true.Example:
-p 'annotations["count"] >= 10 && len(sequence) < 200'
By taxonomy
Taxonomy-based filtering requires a taxonomy database to be provided with --taxonomy.
-
--taxonomy PATH/-t PATHPath to the taxonomy database. -
--restrict-to-taxon TAXID/-r TAXIDKeep only records whose taxon belongs to the lineage of TAXID (i.e. is TAXID itself or a descendant). Can be repeated; sequences must satisfy at least one of the provided taxids. -
--ignore-taxon TAXID/-i TAXIDDiscard records whose taxon belongs to the lineage of TAXID. Can be repeated. -
--valid-taxidKeep only records that carry a valid, recognised taxonomic identifier. -
--require-rank RANK_NAMEKeep only records whose taxon has a defined ancestor at the given rank (e.g. species, genus, family). Can be repeated. -
--update-taxidAutomatically update merged taxids to their current valid equivalent. -
--fail-on-taxonomyExit with an error if a taxid referenced in the data is not valid. -
--with-leavesWhen the taxonomy is extracted from a sequence file, attach each sequence as a leaf node under its annotated taxid. -
--raw-taxidPrint taxids in output files without supplementary information (taxon name and rank).
Inversion
--inverse-match/-vInvert the selection: output the records that would otherwise be discarded.
PAIRED-END OPTIONS
When paired-end sequencing data are provided (forward and reverse reads stored in two files), obigrep can apply filters taking both reads into account.
-
--paired-with FILENAMEFile containing the reverse (paired) reads. -
--paired-mode MODEHow to combine the filter result from the forward and reverse reads. MODE is one of:Mode Meaning forwardKeep the pair if the forward read passes (default) reverseKeep the pair if the reverse read passes andKeep the pair if both reads pass orKeep the pair if at least one read passes andnotKeep the pair if the forward passes and the reverse does not xorKeep the pair if exactly one read passes
OUTPUT CONTROL
-
--save-discarded FILENAMEWrite sequence records that do not pass the filters to FILENAME. -
--out FILENAME/-o FILENAMEWrite the selected records to FILENAME (default: standard output). -
--skip-emptySuppress sequences of length zero from the output.
PERFORMANCE OPTIONS
-
--max-cpu NNumber of parallel processing threads (default: number of available CPUs). -
--batch-size NMinimum number of sequences per processing batch (default: 1). -
--batch-size-max NMaximum number of sequences per processing batch (default: 2000). -
--batch-mem SIZEMaximum memory per batch (e.g.128M,1G). Overrides--batch-size-maxwhen memory is the limiting factor. Can also be set via the environment variableOBIBATCHMEM. -
--no-orderWhen multiple input files are provided, indicates that no ordering is assumed between them, which can improve throughput. -
--no-progressbarDisable the progress bar.
MISCELLANEOUS OPTIONS
-
--u-to-tConvert uracil (U) to thymine (T) in all sequences (useful for RNA data). -
--solexaDecode quality scores according to the legacy Solexa specification instead of the standard Phred encoding. -
--silent-warningSuppress warning messages. -
--debugEnable verbose debug logging. -
--versionPrint version information and exit. -
--help/-h/-?Display the help message and exit.
EXAMPLES
Keep all sequences longer than 100 bases:
obigrep --min-length 100 input.fasta > out_min_length.fasta
Expected output: 6 sequences written to out_min_length.fasta.
Select sequences observed at least 10 times:
obigrep --min-count 10 input.fasta > out_min_count.fasta
Expected output: 4 sequences written to out_min_count.fasta.
Keep sequences whose identifier starts with BOLD:
obigrep --identifier '^BOLD' input.fasta > out_bold.fasta
Expected output: 2 sequences written to out_bold.fasta.
Select only sequences carrying the IUPAC primer motif GGGCWATGTTTCATAAYGGG with up to 2 mismatches:
obigrep --approx-pattern GGGCWATGTTTCATAAYGGG --pattern-error 2 input.fasta > out_primer.fasta
Expected output: 2 sequences written to out_primer.fasta.
Retain sequences belonging to the genus Homo (taxid 9605) in an NCBI taxonomy:
obigrep --taxonomy /data/ncbi_tax --restrict-to-taxon 9605 input.fasta
Keep sequences that have a sample attribute equal to lake1 and save the rest to a separate file:
obigrep --attribute sample='^lake1$' --save-discarded discarded.fasta \
input.fasta > lake1.fasta
Expected output: 5 sequences written to lake1.fasta, 5 sequences written to discarded.fasta.
Invert a length filter (discard sequences shorter than 50 bases):
obigrep --min-length 50 --inverse-match input.fasta > out_short.fasta
Expected output: 1 sequence written to out_short.fasta.
Apply a custom predicate (sequences with count ≥ 5):
obigrep -p 'annotations["count"] >= 5' input.fasta > out_predicate.fasta
Expected output: 6 sequences written to out_predicate.fasta.
OUTPUT
Attribute table
Attributes present on sequence records are preserved unchanged in the output. No new attributes are added by obigrep itself — only filtering occurs.
| Attribute | Type | Description |
|---|---|---|
count |
integer | Number of times the sequence was observed (read from input) |
sample |
string | Sample identifier (read from input) |
Any other annotations present in the input are carried through to the output unmodified.
Observed output example
>seq001 {"count":15,"sample":"lake1"}
acgtacgtacgtacgtacgtgggcaatgtttcataatgggacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":3,"sample":"lake1"}
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
>seq004 {"count":2,"sample":"lake1"}
aaacccgggtttagctagctagctagctagctagctagctagctagctagctagctagct
agctagctagctagctagctagctagctagctagctagctagctagctagctagctagct
atacgtatcgatcg
>BOLD_005 {"count":8,"sample":"pond1"}
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq008 {"count":7,"sample":"river2"}
ttacgatcgatcgatcgatcgggcaatgtttcataaggggacgatcgatcgatcgatcga
tcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
SEE ALSO
obiannotate(1), obiuniq(1), obiconvert(1), obitag(1), obisplit(1)
OBITools4
obigrep is part of the OBITools4 suite for analysing DNA metabarcoding and environmental DNA data.