12 Sequence sampling and filtering

12.1 `obigrep` – filters sequence files according to numerous conditions

The obigrep command is somewhat analogous to the standard Unix grep command. It selects a subset of sequence records from a sequence file. A sequence record is a complex object consisting of an identifier, a set of attributes (a key, defined by its name, associated with a value), a definition, and the sequence itself. Instead of working text line by text line like the standard Unix tool, obigrep selection is done sequence record by sequence record. A large number of options allow you to refine the selection on any element of the sequence. obigrep allows you to specify multiple conditions simultaneously (which take on the value TRUE or FALSE) and only those sequence records which meet all conditions (all conditions are TRUE) are selected. obigrep is able to work on two paired read files. The selection criteria apply to one or the other of the readings in each pair depending on the mode chosen (--paired-mode option). In all cases the selection is applied in the same way to both files, thus maintaining their consistency.

12.1.1 The options usable with `obigrep`

12.1.1.1 Selecting sequences based on their caracteristics

Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.

--min-count | -c COUNT: only sequences reprensenting at least COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to \(1\).
--max-count | -C COUNT: only sequences reprensenting no more than COUNT reads will be selected. That option rely on the count attribute. If the count attribute is not defined for a sequence record, it is assumed equal to \(1\).
Example: Selecting sequence records representing at least five reads in the dataset.

obigrep -c 5 data_SPER01.fasta > data_norare_SPER01.fasta

12.1 obigrep – filters sequence files according to numerous conditions

12.1.1 The options usable with obigrep

12.1.1.1 Selecting sequences based on their caracteristics

12.1 `obigrep` – filters sequence files according to numerous conditions

12.1.1 The options usable with `obigrep`