diff --git a/doc/build/_man/man1/obigrep.man b/doc/build/_man/man1/obigrep.man new file mode 100644 index 0000000..4c8e9a1 --- /dev/null +++ b/doc/build/_man/man1/obigrep.man @@ -0,0 +1,239 @@ +.\" Automatically generated by Pandoc 2.19.2 +.\" +.\" Define V font for inline verbatim, using C font in formats +.\" that render this, and otherwise B font. +.ie "\f[CB]x\f[]"x" \{\ +. ftr V B +. ftr VI BI +. ftr VB B +. ftr VBI BI +.\} +.el \{\ +. ftr V CR +. ftr VI CI +. ftr VB CB +. ftr VBI CBI +.\} +.TH "obigrep" "1" "" "" "" +.hy +.SH NAME +.PP +obigrep \[en] filters sequence files according to numerous conditions +.SH SYNOPSIS +.PP +\f[B]obigrep\f[R] [\f[B]--attribute\f[R] | \f[B]-a\f[R] +\f[I]KEY=VALUE\f[R]]\&... +[\f[B]--compress\f[R] | \f[B]-Z\f[R]] [\f[B]--debug\f[R]] +[\f[B]--definition\f[R]|\f[B]-D\f[R] \f[I]PATTERN\f[R]]\&... +.PD 0 +.P +.PD +[\f[B]--ecopcr\f[R]] [\f[B]--embl\f[R]] [\f[B]--fasta-output\f[R]] +[\f[B]--fastq-output\f[R]] [\f[B]--genbank\f[R]] +[\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]]\&... +[\f[B]--help\f[R] | \f[B]-h\f[R] | \f[B]-?\f[R]] [\f[B]--id-list\f[R] +\f[I]FILENAME\f[R]] [\f[B]--identifier\f[R] | \f[B]-I\f[R] +\f[I]PATTERN\f[R]]\&... +[\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R]]\&... +[\f[B]--input-OBI-header\f[R]] [\f[B]--input-json-header\f[R]] +[\f[B]--inverse-match\f[R] | \f[B]-v\f[R]] +[\f[B]--max-count\f[R]|\f[B]-C\f[R] \f[I]COUNT\f[R]] +[\f[B]--max-cpu\f[R] \f[I]INT\f[R]] [\f[B]--max-length\f[R] | +\f[B]-L\f[R] \f[I]LENGTH\f[R]] [\f[B]--min-count\f[R] | \f[B]-c\f[R] +\f[I]COUNT\f[R]] [\f[B]--min-length\f[R] | \f[B]-l\f[R] +\f[I]LENGTH\f[R]] [\f[B]--no-order\f[R]] [\f[B]--no-progressbar\f[R]] +[\f[B]--out\f[R] | \f[B]-o\f[R] \f[I]FILENAME\f[R]] +[\f[B]--output-OBI-header\f[R] | \f[B]-O\f[R]] +[\f[B]--output-json-header\f[R]] [\f[B]--paired-mode\f[R] +\f[I]forward|reverse|and|or|andnot|xor\f[R]] [\f[B]--paired-with\f[R] +\f[I]FILENAME\f[R]] [\f[B]--predicate\f[R]|\f[B]-p\f[R] +\f[I]EXPRESSION\f[R]]\&... +[\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R]]\&... +[\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R]]\&... +[\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R]] +[\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R]]\&... +[\f[B]--solexa\f[R]] [\f[B]--taxdump\f[R] | \f[B]-t\f[R] +\f[I]DIRECTORY\f[R]] [\f[B]--workers\f[R] | \f[B]-w\f[R] \f[I]INT\f[R]] +[\f[I]FILENAMES\f[R]] +.SH DESCRIPTION +.PP +The \f[V]obigrep\f[R] command is somewhat analogous to the standard Unix +\f[V]grep\f[R] command. +It selects a subset of sequence records from a sequence file. +A sequence record is a complex object consisting of an identifier, a set +of attributes (a key, defined by its name, associated with a value), a +definition, and the sequence itself. +Instead of working text line by text line like the standard Unix tool, +\f[V]obigrep\f[R] selection is done sequence record by sequence record. +A large number of options allow you to refine the selection on any +element of the sequence. +\f[V]obigrep\f[R] allows you to specify multiple conditions +simultaneously (which take on the value \f[V]TRUE\f[R] or +\f[V]FALSE\f[R]) and only those sequence records which meet all +conditions (all conditions are \f[V]TRUE\f[R]) are selected. +\f[V]obigrep\f[R] is able to work on two paired read files. +The selection criteria apply to one or the other of the readings in each +pair depending on the mode chosen (\f[B]--paired-mode\f[R] option). +In all cases the selection is applied in the same way to both files, +thus maintaining their consistency. +.SH OPTIONS +.SS General options +.PP +\f[B]Helpful options\f[R] +.TP +\f[B]--help\f[R], \f[B]-h\f[R] +Display a friendly help message. +.PP +\f[B]--no-progressbar\f[R] +.PP +\f[B]Managing parallel execution\f[R] +.TP +\f[B]--max-cpu\f[R] +OBITools V4 are able to run in parallel on all the CPU cores available +on the computer. +It is sometime required to limit the computation to a smaller number of +cores. +That option specify the maximum number of cores that the OBITools +command can use. +This behaviour can also be set up using the \f[V]OBIMAXCPU\f[R] +environment variable. +.PP +\f[B]--workers\f[R], \f[B]-w\f[R] +.PP +\f[B]OBITools debuging related options\f[R] +.PP +\f[B]--debug\f[R] +.SS Input format options +.PP +The OBITools are centered around the [FASTA] +(https://en.wikipedia.org/wiki/FASTA_format) and [FASTQ] +(https://en.wikipedia.org/wiki/FASTQ_format) formats. +These formats are automaticaly recognized when data are read both from +files, and from standard input (\f[V]stdin\f[R]). +Other formats (genbank, EMBL, ecopcr) are also automatically identified +when data are read from files, but for stdin input, input format must be +indicated using one of the following options. +.SS Output format options +.PP +\f[B]--fasta-output\f[R] +.PP +\f[B]--fastq-output\f[R] +.PP +\f[B]--output-OBI-header\f[R], \f[B]-O\f[R] +.PP +\f[B]--output-json-header\f[R] +.TP +\f[B]--out\f[R] \f[I]FILENAME\f[R], \f[B]-o\f[R] +OBITools, as all standard UNIX tools, print their results to the +standard output (\f[V]stdout\f[R]). +To save them, stdout must be redirected to a file. +That option allows to specify explicitely an output file to the command. +This is especially useful when OBITools are processing paired files. +In that later case, the indicated output file names is modified by +adding to it the \f[I]_R1\f[R] (forward file) and \f[I]_R2\f[R] (reverse +file) suffix just before the extensions (\f[I]e.g.\f[R] sequence.fasta +becomes sequence_R1.fasta and sequence_R2.fasta). +If that option is not specified and paired files are processed only the +forward data are ouputed to the \f[I]stdout\f[R]. +.TP +\f[B]--compress\f[R], \f[B]-Z\f[R] +The ouput is compressed following the +gzip (https://en.wikipedia.org/wiki/Gzip) format. +.SS Paired reads options +.PP +\f[B]--paired-with\f[R] \f[I]FILENAME\f[R] +.PP +\f[B]--paired-mode\f[R] \f[I]forward|reverse|and|or|andnot|xor\f[R] +.SS Taxonomy related options +.PP +\f[B]--taxdump\f[R] | \f[B]-t\f[R] \f[I]DIRECTORY\f[R] +.PP +\f[B]--ignore-taxon\f[R] | \f[B]-i\f[R] \f[I]TAXID\f[R] +.PP +\f[B]--require-rank\f[R] \f[I]RANK_NAME\f[R] +.PP +\f[B]--restrict-to-taxon\f[R] | \f[B]-r\f[R] \f[I]TAXID\f[R] +.SS Filtering options +.PP +\f[B]--has-attribute\f[R] | \f[B]-A\f[R] \f[I]KEY\f[R]\&... +.PP +\f[B]--id-list\f[R] \f[I]FILENAME\f[R] +.PP +\f[B]--identifier\f[R] | \f[B]-I\f[R] \f[I]PATTERN\f[R] +.TP +\f[B]--max-count\f[R] | \f[B]-C\f[R] \f[I]COUNT\f[R] +only sequences reprensenting no more than \f[I]COUNT\f[R] reads will be +selected. +That option rely on the \f[V]count\f[R] attribute. +If the \f[V]count\f[R] attribute is not defined for a sequence record, +it is assumed equal to 1. +.TP +\f[B]--min-count\f[R] | \f[B]-c\f[R] \f[I]COUNT\f[R] +only sequences reprensenting at least \f[I]COUNT\f[R] reads will be +selected. +That option rely on the \f[V]count\f[R] attribute. +If the \f[V]count\f[R] attribute is not defined for a sequence record, +it is assumed equal to 1. +.PP +\f[B]--max-length\f[R] | \f[B]-L\f[R] \f[I]LENGTH\f[R] +.PP +\f[B]--min-length\f[R] | \f[B]-l\f[R] \f[I]LENGTH\f[R] +.PP +\f[B]--predicate\f[R]|\f[B]-p\f[R] \f[I]EXPRESSION\f[R] +.PP +\f[B]--sequence\f[R]|\f[B]-s\f[R] \f[I]PATTERN\f[R] +.PP +\f[B]--inverse-match\f[R] | \f[B]-v\f[R] +.PP +\f[B]--save-discarded\f[R] \f[I]FILENAME\f[R] +.SH ENVIRONMENT +.PP +\f[B]OBICPUMAX\f[R] +.SH EXAMPLES +.IP \[bu] 2 +Filtering sequence file to keep only barcodes between 8 and 130 bp. +.RS 2 +.IP +.nf +\f[C] +obigrep -l 8 -L 130 data_SPER01.fasta > data_goodLength_SPER01.fasta +\f[R] +.fi +.RE +.IP \[bu] 2 +Filtering reads without anbiguity base code in its sequence. +.RS 2 +.IP +.nf +\f[C] +obigrep -s \[aq]\[ha][acgt]+$\[aq] data_SPER01.fasta > data_onlyACGT_SPER01.fasta +\f[R] +.fi +.RE +.IP \[bu] 2 +Filtering paired files for keeping only pairs of read without ambiguity. +.RS 2 +.IP +.nf +\f[C] +obigrep -s \[aq]\[ha][acgt]+$\[aq] \[rs] + --paired-mode and --paired-with wolf_R.fastq.gz \[rs] + --out wolf_good.fastq \[rs] + wolf_F.fastq.gz +\f[R] +.fi +.PP +That command produces two files \f[V]wolf_good_R1.fastq\f[R] and +\f[V]wolf_good_R1.fastq\f[R] containing respectively the filtered +forward and reverse reads. +.RE +.SH SEE ALSO +.PP +\f[V]obiannotate\f[R] +.SH HISTORY +.SH BUGS +.PP +Submit bug reports online at: +https://git.metabarcoding.org/obitools/obitools4/obitools4/-/issues +.SH AUTHORS +Eric Coissac . diff --git a/doc/man/Makefile b/doc/man/Makefile new file mode 100644 index 0000000..17f11d7 --- /dev/null +++ b/doc/man/Makefile @@ -0,0 +1,41 @@ +MANPAGES= obigrep + +BUILDDIR=../build +MANDIR=$(BUILDDIR)/_man +MANDEST=$(MANDIR)/man1 +HTMLDEST=$(MANDIR)/html + +MANSRC=$(MANPAGES:=.qmd) +DEPS=$(patsubst %,depends/%,$(MANPAGES:=.d)) +MAN=$(patsubst %,$(MANDEST)/%,$(MANSRC:.qmd=.man)) + + + +all: $(MAN) + +clean: + rm -f $(MAN) + rm -rf depends + +.PHONY: all + +$(MANDEST): + @echo Creating $@ directory + @mkdir -p $@ + +$(MAN) : $(MANDEST)/%.man : %.qmd $(MANDEST) + @echo "Rendering the man page for " $(notdir $(@:.man=)) + @quarto render $< --to man + @mv $(notdir $@) $@ + @echo ===================================================== + @echo + +depends/%.d: %.qmd + @mkdir -p depends + @echo Generating depends file for $(notdir $(@:.qmd=)) + @awk -v src=$< 'BEGIN {printf("%s: ",src)} \ + /\{\{< *include *[^>]+>\}\}/ {sub(/^ *\{\{< *include */,"",$$0); \ + sub(/ *> *\}\} */,"",$$0); \ + printf("%s ",$$0)}' $< > $@ + +-include $(DEPS) \ No newline at end of file diff --git a/doc/man/obigrep.qmd b/doc/man/obigrep.qmd new file mode 100644 index 0000000..05b4e6d --- /dev/null +++ b/doc/man/obigrep.qmd @@ -0,0 +1,153 @@ +--- +title: "obigrep" +section: 1 +author: Eric Coissac +format: + html: default + man: default +--- + +# NAME + +obigrep -- filters sequence files according to numerous conditions + +# SYNOPSIS + + +**obigrep** \[**\--attribute** | **-a** _KEY=VALUE_]... + \[**\--compress** | **-Z**] + \[**\--debug**] + \[**\--definition**|**-D** _PATTERN_]... + \[**\--ecopcr**] + \[**\--embl**] + \[**\--fasta-output**] + \[**\--fastq-output**] + \[**\--genbank**] + \[**\--has-attribute** | **-A** _KEY_]... + \[**\--help** | **-h** | **-?**] + \[**\--id-list** _FILENAME_] + \[**\--identifier** | **-I** _PATTERN_]... + \[**\--ignore-taxon** | **-i** _TAXID_]... + \[**\--input-OBI-header**] + \[**\--input-json-header**] + \[**\--inverse-match** | **-v**] + \[**\--max-count**|**-C** _COUNT_] + \[**\--max-cpu** _INT_] + \[**\--max-length** | **-L** _LENGTH_] + \[**\--min-count** | **-c** _COUNT_] + \[**\--min-length** | **-l** _LENGTH_] + \[**\--no-order**] + \[**\--no-progressbar**] + \[**\--out** | **-o** _FILENAME_] + \[**\--output-OBI-header** | **-O**] + \[**\--output-json-header**] + \[**\--paired-mode** _forward|reverse|and|or|andnot|xor_] + \[**\--paired-with** _FILENAME_] + \[**\--predicate**|**-p** _EXPRESSION_]... + \[**\--require-rank** _RANK_NAME_]... + \[**\--restrict-to-taxon** | **-r** _TAXID_]... + \[**\--save-discarded** _FILENAME_] + \[**\--sequence**|**-s** _PATTERN_]... + \[**\--solexa**] + \[**\--taxdump** | **-t** _DIRECTORY_] + \[**\--workers** | **-w** _INT_] [_FILENAMES_] + +# DESCRIPTION + +{{< include ../lib/descriptions/_obigrep.qmd >}} + +# OPTIONS + +## General options + +{{< include ../lib/options/_system.qmd >}} + +## Input format options + +The OBITools are centered around the [FASTA] (https://en.wikipedia.org/wiki/FASTA_format) and [FASTQ] (https://en.wikipedia.org/wiki/FASTQ_format) formats. These formats are automaticaly recognized when data are read both from files, and from standard input (`stdin`). Other formats (genbank, EMBL, ecopcr) are also automatically identified when data are read from files, but for stdin input, input format must be indicated using one of the following options. + + +## Output format options + +{{< include ../lib/options/_output.qmd >}} + +## Paired reads options + +**\--paired-with** _FILENAME_ + +**\--paired-mode** _forward|reverse|and|or|andnot|xor_ + +## Taxonomy related options + +**\--taxdump** | **-t** _DIRECTORY_ + +**\--ignore-taxon** | **-i** _TAXID_ + +**\--require-rank** _RANK_NAME_ + +**\--restrict-to-taxon** | **-r** _TAXID_ + +## Filtering options + +**\--has-attribute** | **-A** _KEY_... + +**\--id-list** _FILENAME_ + +**\--identifier** | **-I** _PATTERN_ + +{{< include ../lib/options/selection/_max-count.qmd >}} + +{{< include ../lib/options/selection/_min-count.qmd >}} + +**\--max-length** | **-L** _LENGTH_ + +**\--min-length** | **-l** _LENGTH_ + +**\--predicate**|**-p** _EXPRESSION_ + +**\--sequence**|**-s** _PATTERN_ + +**\--inverse-match** | **-v** + +**\--save-discarded** _FILENAME_ + +# ENVIRONMENT + +**OBICPUMAX** + +# EXAMPLES + +- Filtering sequence file to keep only barcodes between 8 and 130 bp. + + ```bash + obigrep -l 8 -L 130 data_SPER01.fasta > data_goodLength_SPER01.fasta + ``` + +- Filtering reads without anbiguity base code in its sequence. + + ```bash + obigrep -s '^[acgt]+$' data_SPER01.fasta > data_onlyACGT_SPER01.fasta + ``` +- Filtering paired files for keeping only pairs of read without ambiguity. + + ```bash + obigrep -s '^[acgt]+$' \ + --paired-mode and --paired-with wolf_R.fastq.gz \ + --out wolf_good.fastq \ + wolf_F.fastq.gz + ``` + + That command produces two files `wolf_good_R1.fastq` and `wolf_good_R1.fastq` + containing respectively the filtered forward and reverse reads. + +# SEE ALSO + +`obiannotate` + +# HISTORY + +# BUGS + +Submit bug reports online at: https://git.metabarcoding.org/obitools/obitools4/obitools4/-/issues + +