12 Sequence sampling and filtering
12.1 obigrep
– filters sequence files according to numerous conditions
The obigrep
command is somewhat analogous to the standard Unix grep
command. It selects a subset of sequence records from a sequence file. A sequence record is a complex object consisting of an identifier, a set of attributes (a key, defined by its name, associated with a value), a definition, and the sequence itself. Instead of working text line by text line like the standard Unix tool, obigrep
selection is done sequence record by sequence record. A large number of options allow you to refine the selection on any element of the sequence. obigrep
allows you to specify multiple conditions simultaneously (which take on the value TRUE
or FALSE
) and only those sequence records which meet all conditions (all conditions are TRUE
) are selected. obigrep
is able to work on two paired read files. The selection criteria apply to one or the other of the readings in each pair depending on the mode chosen (--paired-mode option). In all cases the selection is applied in the same way to both files, thus maintaining their consistency.
12.1.1 The options usable with obigrep
12.1.1.1 Selecting sequences based on their caracteristics
Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.
Selection based on the sequence
Sequence records can be selected according if they match or not with a pattern. The simplest pattern is as short sequence (e.g AACCTT
). But the usage of regular patterns allows for looking for more complex pattern. As example, A[TG]C+G
matches a A
, followed by a T
or a G
, then one or several C
and endly a G
.
- --sequence|-s PATTERN
-
Regular expression pattern to be tested against the sequence itself. The pattern is case insensitive. A complete description of the regular pattern grammar is available here.
- Examples:
-
Selects only the sequence records that contain an EcoRI restriction site.
obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
: Selects only the sequence records that contain a stretch of at least 10 A
.
obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
: Selects only the sequence records that do not contain ambiguous nucleotides.
obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
- --min-count | -c COUNT
-
only sequences reprensenting at least COUNT reads will be selected. That option rely on the
count
attribute. If thecount
attribute is not defined for a sequence record, it is assumed equal to \(1\). - --max-count | -C COUNT
-
only sequences reprensenting no more than COUNT reads will be selected. That option rely on the
count
attribute. If thecount
attribute is not defined for a sequence record, it is assumed equal to \(1\). - Examples
-
Selecting sequence records representing at least five reads in the dataset.
obigrep -c 5 data_SPER01.fasta > data_norare_SPER01.fasta