obitools4/doc/build/_man/man1/obigrep.man

.\" Automatically generated by Pandoc 2.19.2
.\"
.\" Define V font for inline verbatim, using C font in formats
.\" that render this, and otherwise B font.
.ie "\f[CB]x\f[]"x" \{\
. ftr V B
. ftr VI BI
. ftr VB B
. ftr VBI BI
.\}
.el \{\
. ftr V CR
. ftr VI CI
. ftr VB CB
. ftr VBI CBI
.\}
.TH "obigrep" "1" "" "" ""
.hy
.SH NAME
.PP
obigrep \[en] filters sequence files according to numerous conditions
.SH SYNOPSIS
.PP
\f[B]obigrep\f[R] [\f[B]--attribute\f[R] | \f[B]-a\f[R]
\f[I]KEY=VALUE\f[R]]\&...
[\f[B]--compress\f[R] | \f[B]-Z\f[R]] [\f[B]--debug\f[R]]
[\f[B]--definition\f[R]|\f[B]-D\f[R] \f[I]PATTERN\f[R]]\&...
.PD 0
.P
.PD
[\f[B]--ecopcr\f[R]] [\f[B]--embl\f[R]] [\f[B]--fasta-output\f[R]]
[\f[B]--fastq-output\f[R]] [\f[B]--genbank\f[R]]
[\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]]\&...
[\f[B]--help\f[R] | \f[B]-h\f[R] | \f[B]-?\f[R]] [\f[B]--id-list\f[R]
\f[I]FILENAME\f[R]] [\f[B]--identifier\f[R] | \f[B]-I\f[R]
\f[I]PATTERN\f[R]]\&...
[\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R]]\&...
[\f[B]--input-OBI-header\f[R]] [\f[B]--input-json-header\f[R]]
[\f[B]--inverse-match\f[R] | \f[B]-v\f[R]]
[\f[B]--max-count\f[R]|\f[B]-C\f[R] \f[I]COUNT\f[R]]
[\f[B]--max-cpu\f[R] \f[I]INT\f[R]] [\f[B]--max-length\f[R] |
\f[B]-L\f[R] \f[I]LENGTH\f[R]] [\f[B]--min-count\f[R] | \f[B]-c\f[R]
\f[I]COUNT\f[R]] [\f[B]--min-length\f[R] | \f[B]-l\f[R]
\f[I]LENGTH\f[R]] [\f[B]--no-order\f[R]] [\f[B]--no-progressbar\f[R]]
[\f[B]--out\f[R] | \f[B]-o\f[R] \f[I]FILENAME\f[R]]
[\f[B]--output-OBI-header\f[R] | \f[B]-O\f[R]]
[\f[B]--output-json-header\f[R]] [\f[B]--paired-mode\f[R]
\f[I]forward|reverse|and|or|andnot|xor\f[R]] [\f[B]--paired-with\f[R]
\f[I]FILENAME\f[R]] [\f[B]--predicate\f[R]|\f[B]-p\f[R]
\f[I]EXPRESSION\f[R]]\&...
[\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R]]\&...
[\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R]]\&...
[\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R]]
[\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R]]\&...
[\f[B]--solexa\f[R]] [\f[B]--taxdump\f[R] | \f[B]-t\f[R]
\f[I]DIRECTORY\f[R]] [\f[B]--workers\f[R] | \f[B]-w\f[R] \f[I]INT\f[R]]
[\f[I]FILENAMES\f[R]]
.SH DESCRIPTION
.PP
The \f[V]obigrep\f[R] command is somewhat analogous to the standard Unix
\f[V]grep\f[R] command.
It selects a subset of sequence records from a sequence file.
A sequence record is a complex object consisting of an identifier, a set
of attributes (a key, defined by its name, associated with a value), a
definition, and the sequence itself.
Instead of working text line by text line like the standard Unix tool,
\f[V]obigrep\f[R] selection is done sequence record by sequence record.
A large number of options allow you to refine the selection on any
element of the sequence.
\f[V]obigrep\f[R] allows you to specify multiple conditions
simultaneously (which take on the value \f[V]TRUE\f[R] or
\f[V]FALSE\f[R]) and only those sequence records which meet all
conditions (all conditions are \f[V]TRUE\f[R]) are selected.
\f[V]obigrep\f[R] is able to work on two paired read files.
The selection criteria apply to one or the other of the readings in each
pair depending on the mode chosen (\f[B]--paired-mode\f[R] option).
In all cases the selection is applied in the same way to both files,
thus maintaining their consistency.
.SH OPTIONS
.SS General options
.PP
\f[B]Helpful options\f[R]
.TP
\f[B]--help\f[R], \f[B]-h\f[R]
Display a friendly help message.
.PP
\f[B]--no-progressbar\f[R]
.PP
\f[B]Managing parallel execution\f[R]
.TP
\f[B]--max-cpu\f[R]
OBITools V4 are able to run in parallel on all the CPU cores available
on the computer.
It is sometime required to limit the computation to a smaller number of
cores.
That option specify the maximum number of cores that the OBITools
command can use.
This behaviour can also be set up using the \f[V]OBIMAXCPU\f[R]
environment variable.
.PP
\f[B]--workers\f[R], \f[B]-w\f[R]
.PP
\f[B]OBITools debuging related options\f[R]
.PP
\f[B]--debug\f[R]
.SS Input format options
.PP
The OBITools are centered around the [FASTA]
(https://en.wikipedia.org/wiki/FASTA_format) and [FASTQ]
(https://en.wikipedia.org/wiki/FASTQ_format) formats.
These formats are automaticaly recognized when data are read both from
files, and from standard input (\f[V]stdin\f[R]).
Other formats (genbank, EMBL, ecopcr) are also automatically identified
when data are read from files, but for stdin input, input format must be
indicated using one of the following options.
.SS Output format options
.PP
\f[B]--fasta-output\f[R]
.PP
\f[B]--fastq-output\f[R]
.PP
\f[B]--output-OBI-header\f[R], \f[B]-O\f[R]
.PP
\f[B]--output-json-header\f[R]
.TP
\f[B]--out\f[R] \f[I]FILENAME\f[R], \f[B]-o\f[R]
OBITools, as all standard UNIX tools, print their results to the
standard output (\f[V]stdout\f[R]).
To save them, stdout must be redirected to a file.
That option allows to specify explicitely an output file to the command.
This is especially useful when OBITools are processing paired files.
In that later case, the indicated output file names is modified by
adding to it the \f[I]_R1\f[R] (forward file) and \f[I]_R2\f[R] (reverse
file) suffix just before the extensions (\f[I]e.g.\f[R] sequence.fasta
becomes sequence_R1.fasta and sequence_R2.fasta).
If that option is not specified and paired files are processed only the
forward data are ouputed to the \f[I]stdout\f[R].
.TP
\f[B]--compress\f[R], \f[B]-Z\f[R]
The ouput is compressed following the
gzip (https://en.wikipedia.org/wiki/Gzip) format.
.SS Paired reads options
.PP
\f[B]--paired-with\f[R] \f[I]FILENAME\f[R]
.PP
\f[B]--paired-mode\f[R] \f[I]forward|reverse|and|or|andnot|xor\f[R]
.SS Taxonomy related options
.PP
\f[B]--taxdump\f[R] | \f[B]-t\f[R] \f[I]DIRECTORY\f[R]
.PP
\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R]
.PP
\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R]
.PP
\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R]
.SS Filtering options
.PP
\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]\&...
.PP
\f[B]--id-list\f[R] \f[I]FILENAME\f[R]
.PP
\f[B]--identifier\f[R] | \f[B]-I\f[R] \f[I]PATTERN\f[R]
.TP
\f[B]--max-count\f[R] | \f[B]-C\f[R] \f[I]COUNT\f[R]
only sequences reprensenting no more than \f[I]COUNT\f[R] reads will be
selected.
That option rely on the \f[V]count\f[R] attribute.
If the \f[V]count\f[R] attribute is not defined for a sequence record,
it is assumed equal to 1.
.TP
\f[B]--min-count\f[R] | \f[B]-c\f[R] \f[I]COUNT\f[R]
only sequences reprensenting at least \f[I]COUNT\f[R] reads will be
selected.
That option rely on the \f[V]count\f[R] attribute.
If the \f[V]count\f[R] attribute is not defined for a sequence record,
it is assumed equal to 1.
.TP
\f[B]--max-length\f[R] | \f[B]-L\f[R] \f[I]LENGTH\f[R]
Keeps sequence records whose sequence length is equal or shorter than
\f[I]LENGTH\f[R].
.TP
\f[B]--min-length\f[R] | \f[B]-l\f[R] \f[I]LENGTH\f[R]
Keeps sequence records whose sequence length is equal or longer than
\f[I]LENGTH\f[R].
.PP
\f[B]--predicate\f[R]|\f[B]-p\f[R] \f[I]EXPRESSION\f[R]
.TP
\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R]
Regular expression pattern to be tested against the sequence itself.
The pattern is case insensitive.
A complete description of the regular pattern grammar is available
here (https://yourbasic.org/golang/regexp-cheat-sheet/#cheat-sheet).
.PP
\f[B]--inverse-match\f[R] | \f[B]-v\f[R]
.PP
\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R]
.SH ENVIRONMENT
.PP
\f[B]OBICPUMAX\f[R]
.SH EXAMPLES
.IP \[bu] 2
Filtering sequence file to keep only barcodes between 8 and 130 bp.
.RS 2
.IP
.nf
\f[C]
obigrep -l 8 -L 130 data_SPER01.fasta > data_goodLength_SPER01.fasta
\f[R]
.fi
.RE
.IP \[bu] 2
Filtering reads without anbiguity base code in its sequence.
.RS 2
.IP
.nf
\f[C]
obigrep -s \[aq]\[ha][acgt]+$\[aq] data_SPER01.fasta > data_onlyACGT_SPER01.fasta
\f[R]
.fi
.RE
.IP \[bu] 2
Filtering paired files for keeping only pairs of read without ambiguity.
.RS 2
.IP
.nf
\f[C]
obigrep  -s \[aq]\[ha][acgt]+$\[aq] \[rs]
         --paired-mode and --paired-with wolf_R.fastq.gz \[rs]
         --out wolf_good.fastq \[rs]
         wolf_F.fastq.gz
\f[R]
.fi
.PP
That command produces two files \f[V]wolf_good_R1.fastq\f[R] and
\f[V]wolf_good_R1.fastq\f[R] containing respectively the filtered
forward and reverse reads.
.RE
.SH SEE ALSO
.PP
\f[V]obiannotate\f[R]
.SH HISTORY
.SH BUGS
.PP
Submit bug reports online at:
https://git.metabarcoding.org/obitools/obitools4/obitools4/-/issues
.SH AUTHORS
Eric Coissac <eric.coissac@metabarcoding.org>.