Files
obitools4/doc/book/_freeze/formats/execute-results/html.json

11 lines
12 KiB
JSON
Raw Normal View History

{
"hash": "e80972301bc122bdde66c980a1b1cea3",
"result": {
"markdown": "# File formats usable with *OBITools*\n\n*OBITools* manipulate have to manipulate DNA sequence data and taxonomical data. They can use some supplentary metadata describing the experiment and produce some stats about the processed DNA data. All the manipulated data are stored in text files, following standard data format.\n\n## The DNA sequence data\n\nSequences can be stored following various format. *OBITools* knows some of them. The central formats for sequence files manipulated by *OBITools* scripts are the [*FASTA*](#sec-fasta) and [*FASTQ*](#sec-fastq) format. *OBITools* extends the both these formats by specifying a syntax to include in the definition line data qualifying the sequence. All file formats use the [`IUPAC`](#sec-iupac) code for encoding nucleotides.\n\nMoreover these two formats that can be used as input and output formats, *OBITools4* can read the following format :\n\n- [EBML flat file](https://ena-docs.readthedocs.io/en/latest/submit/fileprep/flat-file-example.html) format (use by ENA)\n- [Genbank flat file format](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)\n- [ecoPCR output files](https://pythonhosted.org/OBITools/scripts/ecoPCR.html)\n\n### The IUPAC Code {#sec-iupac}\n\nThe International Union of Pure and Applied Chemistry ([IUPAC]()) defined the standard code for representing protein or DNA sequences.\n\n| **Code** | **Nucleotide** |\n|----------|-----------------------------|\n| A | Adenine |\n| C | Cytosine |\n| G | Guanine |\n| T | Thymine |\n| U | Uracil |\n| R | Purine (A or G) |\n| Y | Pyrimidine (C, T, or U) |\n| M | C or A |\n| K | T, U, or G |\n| W | T, U, or A |\n| S | C or G |\n| B | C, T, U, or G (not A) |\n| D | A, T, U, or G (not C) |\n| H | A, T, U, or C (not G) |\n| V | A, C, or G (not T, not U) |\n| N | Any base (A, C, G, T, or U) |\n\n### The *FASTA* sequence format {#sec-fasta}\n\nThe [*FASTA*](#sec-fasta) format is certainly the most widely used sequence file format. This is certainly due to its great simplicity. It was originally created for the Lipman and Pearson [`FASTA` program](http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation). *OBITools* use in more of the classical [*FASTA*](#sec-fasta) format an `extended version` of this format where structured data are included in the title line.\n\nIn [*FASTA*](#sec-fasta) format a sequence is represented by a title line beginning with a **`>`** character and the sequences by itself following the [`IUPAC`](#sec-iupac) code. The sequence is usually split other severals lines of the same length (expect for the last one)\n\n``` \n>my_sequence this is my pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n```\n\nThis is no special format for the title line excepting that this line should be unique. Usually the first word following the **\\>** character is considered as the sequence identifier. The end of the title line corresponding to a description of the sequence. Several sequences can be concatenated in a same file. The description of the next sequence is just pasted at the end of the record of the previous one\n\n``` \n>sequence_A this is my first pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n>sequence_B this is my second pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n>sequence_C this is my third pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTAC
"supporting": [
"formats_files/figure-html"
],
"filters": [],
"includes": {}
}
}