12  Sequence sampling and filtering

12.1 obigrep – filters sequence files according to numerous conditions

The obigrep command is somewhat analogous to the standard Unix grep command. It selects a subset of sequence records from a sequence file. A sequence record is a complex object consisting of an identifier, a set of attributes (a key, defined by its name, associated with a value), a definition, and the sequence itself. Instead of working text line by text line like the standard Unix tool, obigrep selection is done sequence record by sequence record. A large number of options allow you to refine the selection on any element of the sequence. obigrep allows you to specify multiple conditions simultaneously (which take on the value TRUE or FALSE) and only those sequence records which meet all conditions (all conditions are TRUE) are selected. obigrep is able to work on two paired read files. The selection criteria apply to one or the other of the readings in each pair depending on the mode chosen (--paired-mode option). In all cases the selection is applied in the same way to both files, thus maintaining their consistency.

12.1.1 The options usable with obigrep

12.1.1.1 Selecting sequences based on their caracteristics

Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.

--min-count | -c COUNT

only sequences reprensenting at least COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to \(1\).

--max-count | -C COUNT

only sequences reprensenting no more than COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to \(1\).

Example

Selecting sequence records representing at least five reads in the dataset.

obigrep -c 5 data_SPER01.fasta > data_norare_SPER01.fasta