mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-06-29 16:20:46 +00:00
248 lines
8.3 KiB
Groff
248 lines
8.3 KiB
Groff
.\" Automatically generated by Pandoc 2.19.2
|
|
.\"
|
|
.\" Define V font for inline verbatim, using C font in formats
|
|
.\" that render this, and otherwise B font.
|
|
.ie "\f[CB]x\f[]"x" \{\
|
|
. ftr V B
|
|
. ftr VI BI
|
|
. ftr VB B
|
|
. ftr VBI BI
|
|
.\}
|
|
.el \{\
|
|
. ftr V CR
|
|
. ftr VI CI
|
|
. ftr VB CB
|
|
. ftr VBI CBI
|
|
.\}
|
|
.TH "obigrep" "1" "" "" ""
|
|
.hy
|
|
.SH NAME
|
|
.PP
|
|
obigrep \[en] filters sequence files according to numerous conditions
|
|
.SH SYNOPSIS
|
|
.PP
|
|
\f[B]obigrep\f[R] [\f[B]--attribute\f[R] | \f[B]-a\f[R]
|
|
\f[I]KEY=VALUE\f[R]]\&...
|
|
[\f[B]--compress\f[R] | \f[B]-Z\f[R]] [\f[B]--debug\f[R]]
|
|
[\f[B]--definition\f[R]|\f[B]-D\f[R] \f[I]PATTERN\f[R]]\&...
|
|
.PD 0
|
|
.P
|
|
.PD
|
|
[\f[B]--ecopcr\f[R]] [\f[B]--embl\f[R]] [\f[B]--fasta-output\f[R]]
|
|
[\f[B]--fastq-output\f[R]] [\f[B]--genbank\f[R]]
|
|
[\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]]\&...
|
|
[\f[B]--help\f[R] | \f[B]-h\f[R] | \f[B]-?\f[R]] [\f[B]--id-list\f[R]
|
|
\f[I]FILENAME\f[R]] [\f[B]--identifier\f[R] | \f[B]-I\f[R]
|
|
\f[I]PATTERN\f[R]]\&...
|
|
[\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R]]\&...
|
|
[\f[B]--input-OBI-header\f[R]] [\f[B]--input-json-header\f[R]]
|
|
[\f[B]--inverse-match\f[R] | \f[B]-v\f[R]]
|
|
[\f[B]--max-count\f[R]|\f[B]-C\f[R] \f[I]COUNT\f[R]]
|
|
[\f[B]--max-cpu\f[R] \f[I]INT\f[R]] [\f[B]--max-length\f[R] |
|
|
\f[B]-L\f[R] \f[I]LENGTH\f[R]] [\f[B]--min-count\f[R] | \f[B]-c\f[R]
|
|
\f[I]COUNT\f[R]] [\f[B]--min-length\f[R] | \f[B]-l\f[R]
|
|
\f[I]LENGTH\f[R]] [\f[B]--no-order\f[R]] [\f[B]--no-progressbar\f[R]]
|
|
[\f[B]--out\f[R] | \f[B]-o\f[R] \f[I]FILENAME\f[R]]
|
|
[\f[B]--output-OBI-header\f[R] | \f[B]-O\f[R]]
|
|
[\f[B]--output-json-header\f[R]] [\f[B]--paired-mode\f[R]
|
|
\f[I]forward|reverse|and|or|andnot|xor\f[R]] [\f[B]--paired-with\f[R]
|
|
\f[I]FILENAME\f[R]] [\f[B]--predicate\f[R]|\f[B]-p\f[R]
|
|
\f[I]EXPRESSION\f[R]]\&...
|
|
[\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R]]\&...
|
|
[\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R]]\&...
|
|
[\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R]]
|
|
[\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R]]\&...
|
|
[\f[B]--solexa\f[R]] [\f[B]--taxdump\f[R] | \f[B]-t\f[R]
|
|
\f[I]DIRECTORY\f[R]] [\f[B]--workers\f[R] | \f[B]-w\f[R] \f[I]INT\f[R]]
|
|
[\f[I]FILENAMES\f[R]]
|
|
.SH DESCRIPTION
|
|
.PP
|
|
The \f[V]obigrep\f[R] command is somewhat analogous to the standard Unix
|
|
\f[V]grep\f[R] command.
|
|
It selects a subset of sequence records from a sequence file.
|
|
A sequence record is a complex object consisting of an identifier, a set
|
|
of attributes (a key, defined by its name, associated with a value), a
|
|
definition, and the sequence itself.
|
|
Instead of working text line by text line like the standard Unix tool,
|
|
\f[V]obigrep\f[R] selection is done sequence record by sequence record.
|
|
A large number of options allow you to refine the selection on any
|
|
element of the sequence.
|
|
\f[V]obigrep\f[R] allows you to specify multiple conditions
|
|
simultaneously (which take on the value \f[V]TRUE\f[R] or
|
|
\f[V]FALSE\f[R]) and only those sequence records which meet all
|
|
conditions (all conditions are \f[V]TRUE\f[R]) are selected.
|
|
\f[V]obigrep\f[R] is able to work on two paired read files.
|
|
The selection criteria apply to one or the other of the readings in each
|
|
pair depending on the mode chosen (\f[B]--paired-mode\f[R] option).
|
|
In all cases the selection is applied in the same way to both files,
|
|
thus maintaining their consistency.
|
|
.SH OPTIONS
|
|
.SS General options
|
|
.PP
|
|
\f[B]Helpful options\f[R]
|
|
.TP
|
|
\f[B]--help\f[R], \f[B]-h\f[R]
|
|
Display a friendly help message.
|
|
.PP
|
|
\f[B]--no-progressbar\f[R]
|
|
.PP
|
|
\f[B]Managing parallel execution\f[R]
|
|
.TP
|
|
\f[B]--max-cpu\f[R]
|
|
OBITools V4 are able to run in parallel on all the CPU cores available
|
|
on the computer.
|
|
It is sometime required to limit the computation to a smaller number of
|
|
cores.
|
|
That option specify the maximum number of cores that the OBITools
|
|
command can use.
|
|
This behaviour can also be set up using the \f[V]OBIMAXCPU\f[R]
|
|
environment variable.
|
|
.PP
|
|
\f[B]--workers\f[R], \f[B]-w\f[R]
|
|
.PP
|
|
\f[B]OBITools debuging related options\f[R]
|
|
.PP
|
|
\f[B]--debug\f[R]
|
|
.SS Input format options
|
|
.PP
|
|
The OBITools are centered around the [FASTA]
|
|
(https://en.wikipedia.org/wiki/FASTA_format) and [FASTQ]
|
|
(https://en.wikipedia.org/wiki/FASTQ_format) formats.
|
|
These formats are automaticaly recognized when data are read both from
|
|
files, and from standard input (\f[V]stdin\f[R]).
|
|
Other formats (genbank, EMBL, ecopcr) are also automatically identified
|
|
when data are read from files, but for stdin input, input format must be
|
|
indicated using one of the following options.
|
|
.SS Output format options
|
|
.PP
|
|
\f[B]--fasta-output\f[R]
|
|
.PP
|
|
\f[B]--fastq-output\f[R]
|
|
.PP
|
|
\f[B]--output-OBI-header\f[R], \f[B]-O\f[R]
|
|
.PP
|
|
\f[B]--output-json-header\f[R]
|
|
.TP
|
|
\f[B]--out\f[R] \f[I]FILENAME\f[R], \f[B]-o\f[R]
|
|
OBITools, as all standard UNIX tools, print their results to the
|
|
standard output (\f[V]stdout\f[R]).
|
|
To save them, stdout must be redirected to a file.
|
|
That option allows to specify explicitely an output file to the command.
|
|
This is especially useful when OBITools are processing paired files.
|
|
In that later case, the indicated output file names is modified by
|
|
adding to it the \f[I]_R1\f[R] (forward file) and \f[I]_R2\f[R] (reverse
|
|
file) suffix just before the extensions (\f[I]e.g.\f[R] sequence.fasta
|
|
becomes sequence_R1.fasta and sequence_R2.fasta).
|
|
If that option is not specified and paired files are processed only the
|
|
forward data are ouputed to the \f[I]stdout\f[R].
|
|
.TP
|
|
\f[B]--compress\f[R], \f[B]-Z\f[R]
|
|
The ouput is compressed following the
|
|
gzip (https://en.wikipedia.org/wiki/Gzip) format.
|
|
.SS Paired reads options
|
|
.PP
|
|
\f[B]--paired-with\f[R] \f[I]FILENAME\f[R]
|
|
.PP
|
|
\f[B]--paired-mode\f[R] \f[I]forward|reverse|and|or|andnot|xor\f[R]
|
|
.SS Taxonomy related options
|
|
.PP
|
|
\f[B]--taxdump\f[R] | \f[B]-t\f[R] \f[I]DIRECTORY\f[R]
|
|
.PP
|
|
\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R]
|
|
.PP
|
|
\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R]
|
|
.PP
|
|
\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R]
|
|
.SS Filtering options
|
|
.PP
|
|
\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]\&...
|
|
.PP
|
|
\f[B]--id-list\f[R] \f[I]FILENAME\f[R]
|
|
.PP
|
|
\f[B]--identifier\f[R] | \f[B]-I\f[R] \f[I]PATTERN\f[R]
|
|
.TP
|
|
\f[B]--max-count\f[R] | \f[B]-C\f[R] \f[I]COUNT\f[R]
|
|
only sequences reprensenting no more than \f[I]COUNT\f[R] reads will be
|
|
selected.
|
|
That option rely on the \f[V]count\f[R] attribute.
|
|
If the \f[V]count\f[R] attribute is not defined for a sequence record,
|
|
it is assumed equal to 1.
|
|
.TP
|
|
\f[B]--min-count\f[R] | \f[B]-c\f[R] \f[I]COUNT\f[R]
|
|
only sequences reprensenting at least \f[I]COUNT\f[R] reads will be
|
|
selected.
|
|
That option rely on the \f[V]count\f[R] attribute.
|
|
If the \f[V]count\f[R] attribute is not defined for a sequence record,
|
|
it is assumed equal to 1.
|
|
.TP
|
|
\f[B]--max-length\f[R] | \f[B]-L\f[R] \f[I]LENGTH\f[R]
|
|
Keeps sequence records whose sequence length is equal or shorter than
|
|
\f[I]LENGTH\f[R].
|
|
.TP
|
|
\f[B]--min-length\f[R] | \f[B]-l\f[R] \f[I]LENGTH\f[R]
|
|
Keeps sequence records whose sequence length is equal or longer than
|
|
\f[I]LENGTH\f[R].
|
|
.PP
|
|
\f[B]--predicate\f[R]|\f[B]-p\f[R] \f[I]EXPRESSION\f[R]
|
|
.TP
|
|
\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R]
|
|
Regular expression pattern to be tested against the sequence itself.
|
|
The pattern is case insensitive.
|
|
A complete description of the regular pattern grammar is available
|
|
here (https://yourbasic.org/golang/regexp-cheat-sheet/#cheat-sheet).
|
|
.PP
|
|
\f[B]--inverse-match\f[R] | \f[B]-v\f[R]
|
|
.PP
|
|
\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R]
|
|
.SH ENVIRONMENT
|
|
.PP
|
|
\f[B]OBICPUMAX\f[R]
|
|
.SH EXAMPLES
|
|
.IP \[bu] 2
|
|
Filtering sequence file to keep only barcodes between 8 and 130 bp.
|
|
.RS 2
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
obigrep -l 8 -L 130 data_SPER01.fasta > data_goodLength_SPER01.fasta
|
|
\f[R]
|
|
.fi
|
|
.RE
|
|
.IP \[bu] 2
|
|
Filtering reads without anbiguity base code in its sequence.
|
|
.RS 2
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
obigrep -s \[aq]\[ha][acgt]+$\[aq] data_SPER01.fasta > data_onlyACGT_SPER01.fasta
|
|
\f[R]
|
|
.fi
|
|
.RE
|
|
.IP \[bu] 2
|
|
Filtering paired files for keeping only pairs of read without ambiguity.
|
|
.RS 2
|
|
.IP
|
|
.nf
|
|
\f[C]
|
|
obigrep -s \[aq]\[ha][acgt]+$\[aq] \[rs]
|
|
--paired-mode and --paired-with wolf_R.fastq.gz \[rs]
|
|
--out wolf_good.fastq \[rs]
|
|
wolf_F.fastq.gz
|
|
\f[R]
|
|
.fi
|
|
.PP
|
|
That command produces two files \f[V]wolf_good_R1.fastq\f[R] and
|
|
\f[V]wolf_good_R1.fastq\f[R] containing respectively the filtered
|
|
forward and reverse reads.
|
|
.RE
|
|
.SH SEE ALSO
|
|
.PP
|
|
\f[V]obiannotate\f[R]
|
|
.SH HISTORY
|
|
.SH BUGS
|
|
.PP
|
|
Submit bug reports online at:
|
|
https://git.metabarcoding.org/obitools/obitools4/obitools4/-/issues
|
|
.SH AUTHORS
|
|
Eric Coissac <eric.coissac@metabarcoding.org>.
|