"markdown":"#Fileformatsusablewith*OBITools*\n\n*OBITools*manipulatehavetomanipulateDNAsequencedataandtaxonomicaldata.TheycanusesomesupplentarymetadatadescribingtheexperimentandproducesomestatsabouttheprocessedDNAdata.Allthemanipulateddataarestoredintextfiles,followingstandarddataformat.\n\n##TheDNAsequencedata\n\nSequencescanbestoredfollowingvariousformat.*OBITools*knowssomeofthem.Thecentralformatsforsequencefilesmanipulatedby*OBITools*scriptsarethe[*FASTA*](#sec-fasta)and[*FASTQ*](#sec-fastq)format.*OBITools*extendstheboththeseformatsbyspecifyingasyntaxtoincludeinthedefinitionlinedataqualifyingthesequence.Allfileformatsusethe[`IUPAC`](#sec-iupac)codeforencodingnucleotides.\n\nMoreoverthesetwoformatsthatcanbeusedasinputandoutputformats,*OBITools4*canreadthefollowingformat:\n\n-[EBMLflatfile](https://ena-docs.readthedocs.io/en/latest/submit/fileprep/flat-file-example.html) format (use by ENA)\n- [Genbank flat file format](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)\n- [ecoPCR output files](https://pythonhosted.org/OBITools/scripts/ecoPCR.html)\n\n### The IUPAC Code {#sec-iupac}\n\nThe International Union of Pure and Applied Chemistry ([IUPAC]()) defined the standard code for representing protein or DNA sequences.\n\n| **Code** | **Nucleotide** |\n|----------|-----------------------------|\n| A | Adenine |\n| C | Cytosine |\n| G | Guanine |\n| T | Thymine |\n| U | Uracil |\n| R | Purine (A or G) |\n| Y | Pyrimidine (C, T, or U) |\n| M | C or A |\n| K | T, U, or G |\n| W | T, U, or A |\n| S | C or G |\n| B | C, T, U, or G (not A) |\n| D | A, T, U, or G (not C) |\n| H | A, T, U, or C (not G) |\n| V | A, C, or G (not T, not U) |\n| N | Any base (A, C, G, T, or U) |\n\n### The *FASTA* sequence format {#sec-fasta}\n\nThe [*FASTA*](#sec-fasta) format is certainly the most widely used sequence file format. This is certainly due to its great simplicity. It was originally created for the Lipman and Pearson [`FASTA` program](http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation). *OBITools* use in more of the classical [*FASTA*](#sec-fasta) format an `extended version` of this format where structured data are included in the title line.\n\nIn [*FASTA*](#sec-fasta) format a sequence is represented by a title line beginning with a **`>`** character and the sequences by itself following the [`IUPAC`](#sec-iupac) code. The sequence is usually split other severals lines of the same length (expect for the last one)\n\n``` \n>my_sequence this is my pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n```\n\nThis is no special format for the title line excepting that this line should be unique. Usually the first word following the **\\>** character is considered as the sequence identifier. The end of the title line corresponding to a description of the sequence. Several sequences can be concatenated in a same file. The description of the next sequence is just pasted at the end of the record of the previous one\n\n``` \n>sequence_A this is my first pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n>sequence_B this is my second pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT\nAACGACGTTGCAGTACGTTGCAGT\n>sequence_C this is my third pretty sequence\nACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT\nGTGCTGACGTTGCAGTAC