mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-06-29 16:20:46 +00:00
230 lines
9.5 KiB
Plaintext
230 lines
9.5 KiB
Plaintext
|
# The OBITools
|
|||
|
|
|||
|
## Aims of *OBITools*
|
|||
|
|
|||
|
## File formats usable with *OBITools*
|
|||
|
|
|||
|
### The sequence files
|
|||
|
|
|||
|
Sequences can be stored following various format. OBITools knows some of
|
|||
|
them. The central formats for sequence files manipulated by OBITools
|
|||
|
scripts are the `fasta` and fastq format. OBITools extends the both
|
|||
|
these formats by specifying a syntax to include in the definition line
|
|||
|
data qualifying the sequence. All file formats use the `IUPAC` code for
|
|||
|
encoding nucleotides.
|
|||
|
|
|||
|
### The IUPAC Code
|
|||
|
|
|||
|
The International Union of Pure and Applied Chemistry (IUPAC\_) defined
|
|||
|
the standard code for representing protein or DNA sequences.
|
|||
|
|
|||
|
#### Nucleic IUPAC Code {#DNA-IUPAC}
|
|||
|
|
|||
|
| **Code** | **Nucleotide** |
|
|||
|
|----------|-----------------------------|
|
|||
|
| A | Adenine |
|
|||
|
| C | Cytosine |
|
|||
|
| G | Guanine |
|
|||
|
| T | Thymine |
|
|||
|
| U | Uracil |
|
|||
|
| R | Purine (A or G) |
|
|||
|
| Y | Pyrimidine (C, T, or U) |
|
|||
|
| M | C or A |
|
|||
|
| K | T, U, or G |
|
|||
|
| W | T, U, or A |
|
|||
|
| S | C or G |
|
|||
|
| B | C, T, U, or G (not A) |
|
|||
|
| D | A, T, U, or G (not C) |
|
|||
|
| H | A, T, U, or C (not G) |
|
|||
|
| V | A, C, or G (not T, not U) |
|
|||
|
| N | Any base (A, C, G, T, or U) |
|
|||
|
|
|||
|
### The *fasta* format {#classical-fasta}
|
|||
|
|
|||
|
The **fasta format** is certainly the most widely used sequence file
|
|||
|
format. This is certainly due to its great simplicity. It was originally
|
|||
|
created for the Lipman and Pearson [FASTA
|
|||
|
program](http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation).
|
|||
|
OBITools use in more of the classical :ref:`fasta` format an
|
|||
|
:ref:`extended version` of this format where structured data are
|
|||
|
included in the title line.
|
|||
|
|
|||
|
In *fasta* format a sequence is represented by a title line beginning
|
|||
|
with a **`>`** character and the sequences by itself following the
|
|||
|
:doc:`iupac` code. The sequence is usually split other severals lines of
|
|||
|
the same length (expect for the last one)
|
|||
|
|
|||
|
>my_sequence this is my pretty sequence
|
|||
|
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
|
|||
|
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
|
|||
|
AACGACGTTGCAGTACGTTGCAGT
|
|||
|
|
|||
|
This is no special format for the title line excepting that this line
|
|||
|
should be unique. Usually the first word following the **\>** character
|
|||
|
is considered as the sequence identifier. The end of the title line
|
|||
|
corresponding to a description of the sequence. Several sequences can be
|
|||
|
concatenated in a same file. The description of the next sequence is
|
|||
|
just pasted at the end of the record of the previous one
|
|||
|
|
|||
|
>sequence_A this is my first pretty sequence
|
|||
|
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
|
|||
|
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
|
|||
|
AACGACGTTGCAGTACGTTGCAGT
|
|||
|
>sequence_B this is my second pretty sequence
|
|||
|
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
|
|||
|
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
|
|||
|
AACGACGTTGCAGTACGTTGCAGT
|
|||
|
>sequence_C this is my third pretty sequence
|
|||
|
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
|
|||
|
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
|
|||
|
AACGACGTTGCAGTACGTTGCAGT
|
|||
|
|
|||
|
### The *fastq* sequence format {#classical-fastq}
|
|||
|
|
|||
|
.. note::
|
|||
|
|
|||
|
This article uses material from the Wikipedia article
|
|||
|
`FASTQ format `
|
|||
|
which is released under the
|
|||
|
`Creative Commons Attribution-Share-Alike License 3.0 `
|
|||
|
|
|||
|
**fastq format** is a text-based format for storing both a biological
|
|||
|
sequence (usually nucleotide sequence) and its corresponding quality
|
|||
|
scores. Both the sequence letter and quality score are encoded with a
|
|||
|
single ASCII character for brevity. It was originally developed at the
|
|||
|
`Wellcome Trust Sanger Institute` to bundle a [fasta](#classical-fasta)
|
|||
|
sequence and its quality data, but has recently become the *de facto*
|
|||
|
standard for storing the output of high throughput sequencing
|
|||
|
instruments such as the Illumina Genome Analyzer Illumina. [1]\_
|
|||
|
|
|||
|
#### Format
|
|||
|
|
|||
|
A fastq file normally uses four lines per sequence.
|
|||
|
|
|||
|
- Line 1 begins with a '\@' character and is followed by a sequence
|
|||
|
identifier and an *optional* description (like a :ref:`fasta` title
|
|||
|
line).
|
|||
|
- Line 2 is the raw sequence letters.
|
|||
|
- Line 3 begins with a '+' character and is *optionally* followed by
|
|||
|
the same sequence identifier (and any description) again.
|
|||
|
- Line 4 encodes the quality values for the sequence in Line 2, and
|
|||
|
must contain the same number of symbols as letters in the sequence.
|
|||
|
|
|||
|
A fastq file containing a single sequence might look like this:
|
|||
|
|
|||
|
@SEQ_ID
|
|||
|
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
|
|||
|
+
|
|||
|
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
|
|||
|
|
|||
|
The character '!' represents the lowest quality while '\~' is the
|
|||
|
highest. Here are the quality value characters in left-to-right
|
|||
|
increasing order of quality (`ASCII`):
|
|||
|
|
|||
|
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|
|||
|
|
|||
|
The original Sanger FASTQ files also allowed the sequence and quality
|
|||
|
strings to be wrapped (split over multiple lines), but this is generally
|
|||
|
discouraged as it can make parsing complicated due to the unfortunate
|
|||
|
choice of "\@" and "+" as markers (these characters can also occur in
|
|||
|
the quality string).
|
|||
|
|
|||
|
#### Variations
|
|||
|
|
|||
|
##### Quality
|
|||
|
|
|||
|
A quality value *Q* is an integer mapping of *p* (i.e., the probability
|
|||
|
that the corresponding base call is incorrect). Two different equations
|
|||
|
have been in use. The first is the standard Sanger variant to assess
|
|||
|
reliability of a base call, otherwise known as Phred quality score:
|
|||
|
|
|||
|
$$
|
|||
|
Q_\text{sanger} = -10 \, \log_{10} p
|
|||
|
$$
|
|||
|
|
|||
|
The Solexa pipeline (i.e., the software delivered with the Illumina
|
|||
|
Genome Analyzer) earlier used a different mapping, encoding the odds
|
|||
|
$\mathbf{p}/(1-\mathbf{p})$ instead of the probability $\mathbf{p}$:
|
|||
|
|
|||
|
$$
|
|||
|
Q_\text{solexa-prior to v.1.3} = -10 \, \log_{10} \frac{p}{1-p}
|
|||
|
$$
|
|||
|
|
|||
|
Although both mappings are asymptotically identical at higher quality
|
|||
|
values, they differ at lower quality levels (i.e., approximately
|
|||
|
$\mathbf{p} > 0.05$, or equivalently, $\mathbf{Q} < 13$).
|
|||
|
|
|||
|
\|Relationship between *Q* and *p* using the Sanger (red) and Solexa
|
|||
|
(black) equations (described above). The vertical dotted line indicates
|
|||
|
$\mathbf{p}= 0.05$, or equivalently, $Q = 13$.\|
|
|||
|
|
|||
|
#### Encoding
|
|||
|
|
|||
|
- Sanger format can encode a Phred quality score from 0 to 93 using
|
|||
|
ASCII 33 to 126 (although in raw read data the Phred quality score
|
|||
|
rarely exceeds 60, higher scores are possible in assemblies or read
|
|||
|
maps).
|
|||
|
- Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality
|
|||
|
score from -5 to 62 using ASCII 59 to 126 (although in raw read data
|
|||
|
Solexa scores from -5 to 40 only are expected)
|
|||
|
- Starting with Illumina 1.3 and before Illumina 1.8, the format
|
|||
|
encoded a Phred quality score from 0 to 62 using ASCII 64 to 126
|
|||
|
(although in raw read data Phred scores from 0 to 40 only are
|
|||
|
expected).
|
|||
|
- Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0
|
|||
|
to 2 have a slightly different meaning. The values 0 and 1 are no
|
|||
|
longer used and the value 2, encoded by ASCII 66 "B".
|
|||
|
|
|||
|
Sequencing Control Software, Version 2.6, Catalog \# SY-960-2601, Part
|
|||
|
\# 15009921 Rev. A, November
|
|||
|
2009] [[http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf\\\\](http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf\){.uri}]([http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf\\](http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf)%7B.uri%7D){.uri}
|
|||
|
(page 30) states the following: *If a read ends with a segment of mostly
|
|||
|
low quality (Q15 or below), then all of the quality values in the
|
|||
|
segment are replaced with a value of 2 (encoded as the letter B in
|
|||
|
Illumina's text-based encoding of quality scores)... This Q2 indicator
|
|||
|
does not predict a specific error rate, but rather indicates that a
|
|||
|
specific final portion of the read should not be used in further
|
|||
|
analyses.* Also, the quality score encoded as "B" letter may occur
|
|||
|
internally within reads at least as late as pipeline version 1.6, as
|
|||
|
shown in the following example:
|
|||
|
|
|||
|
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
|
|||
|
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
|
|||
|
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
|
|||
|
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
|
|||
|
|
|||
|
An alternative interpretation of this ASCII encoding has been proposed.
|
|||
|
Also, in Illumina runs using PhiX controls, the character 'B' was
|
|||
|
observed to represent an "unknown quality score". The error rate of 'B'
|
|||
|
reads was roughly 3 phred scores lower the mean observed score of a
|
|||
|
given run.
|
|||
|
|
|||
|
- Starting in Illumina 1.8, the quality scores have basically returned
|
|||
|
to the use of the Sanger format (Phred+33).
|
|||
|
|
|||
|
## File extension
|
|||
|
|
|||
|
There is no standard file extension for a FASTQ file, but .fq and
|
|||
|
.fastq, are commonly used.
|
|||
|
|
|||
|
## See also
|
|||
|
|
|||
|
- :ref:`fasta`
|
|||
|
|
|||
|
## References
|
|||
|
|
|||
|
.. [1] Cock et al (2009) The Sanger FASTQ file format for sequences with
|
|||
|
quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids
|
|||
|
Research,
|
|||
|
|
|||
|
.. [2] Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego,
|
|||
|
Illumina `1`\_\_
|
|||
|
|
|||
|
.. \|Relationship between *Q* and *p* using the Sanger (red) and Solexa
|
|||
|
(black) equations (described above). The vertical dotted line indicates
|
|||
|
*p* = 0.05, or equivalently, *Q* Å 13.\| image:: Probability metrics.png
|
|||
|
|
|||
|
See <http://en.wikipedia.org/wiki/FASTQ_format>
|
|||
|
|
|||
|
|