12  Sequence sampling and filtering

12.1 obigrep – filters sequence files according to numerous conditions

The obigrep command is somewhat analogous to the standard Unix grep command. It selects a subset of sequence records from a sequence file. A sequence record is a complex object consisting of an identifier, a set of attributes (a key, defined by its name, associated with a value), a definition, and the sequence itself. Instead of working text line by text line like the standard Unix tool, obigrep selection is done sequence record by sequence record. A large number of options allow you to refine the selection on any element of the sequence. obigrep allows you to specify multiple conditions simultaneously (which take on the value TRUE or FALSE) and only those sequence records which meet all conditions (all conditions are TRUE) are selected. obigrep is able to work on two paired read files. The selection criteria apply to one or the other of the readings in each pair depending on the mode chosen (--paired-mode option). In all cases the selection is applied in the same way to both files, thus maintaining their consistency.

12.1.1 The options usable with obigrep

12.1.1.1 Selecting sequences based on their caracteristics

Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.

Selection based on the sequence

Sequence records can be selected according if they match or not with a pattern. The simplest pattern is as short sequence (e.g AACCTT). But the usage of regular patterns allows for looking for more complex pattern. As example, A[TG]C+G matches a A, followed by a T or a G, then one or several C and endly a G.

--sequence|-s PATTERN

Regular expression pattern to be tested against the sequence itself. The pattern is case insensitive. A complete description of the regular pattern grammar is available here.

Examples:

Selects only the sequence records that contain an EcoRI restriction site.

obigrep -s 'GAATTC' seq1.fasta > seq2.fasta

: Selects only the sequence records that contain a stretch of at least 10 A.

obigrep -s 'A{10,}' seq1.fasta > seq2.fasta

: Selects only the sequence records that do not contain ambiguous nucleotides.

obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
--min-count | -c COUNT

only sequences reprensenting at least COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to 11.

--max-count | -C COUNT

only sequences reprensenting no more than COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to 11.

Examples

Selecting sequence records representing at least five reads in the dataset.

obigrep -c 5 data_SPER01.fasta > data_norare_SPER01.fasta