[{"id":0,"href":"/obidoc/docs/cookbook/illumina/","title":"Analysing an Illumina data set","section":"Cookbook","content":" The wolf diet tutorial # Here is a short tutorial for analyzing metabarcoding data, on an Illumina dataset from a wolf diet study, using the OBITools 4 and basic unix commands. It presents the following analysis steps:\nPairing (i.e. partial alignment) of forward and reverse reads Exclusion of unpaired reads Reads demultiplexing (i.e. assignment to their original sample) Reads dereplication Dataset denoising Sequence taxonomic assignment Exporting the results in a tabular format The dataset to analyze and the reference database # The dataset used in this tutorial corresponds to data obtained from the analysis of four wolf scats using the protocol published in ( Citation: Shehzad,\u0026#32;Riaz \u0026amp; al.,\u0026#32;2012 Shehzad,\u0026#32; W.,\u0026#32; Riaz,\u0026#32; T.,\u0026#32; Nawaz,\u0026#32; M.,\u0026#32; Miquel,\u0026#32; C.,\u0026#32; Poillot,\u0026#32; C.,\u0026#32; Shah,\u0026#32; S.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Coissac,\u0026#32; E.\u0026#32;\u0026amp;\u0026#32;Taberlet,\u0026#32; P. \u0026#32; (2012). \u0026#32;Carnivore diet analysis based on next-generation sequencing: application to the leopard cat (Prionailurus bengalensis) in Pakistan: LEOPARD CAT DIET. Molecular ecology,\u0026#32;21(8).\u0026#32;1951–1965. https://doi.org/10.1111/j.1365-294X.2011.05424.x ) for carnivore diet assessment. After extraction of DNA from feces, DNA amplification was performed using the Vert01 primers (TTAGATACCCCACTATGC and TAGAACAGGCTCCTCTAG amplifying the 12S-V5 region ( Citation: Riaz,\u0026#32;Shehzad \u0026amp; al.,\u0026#32;2011 Riaz,\u0026#32; T.,\u0026#32; Shehzad,\u0026#32; W.,\u0026#32; Viari,\u0026#32; A.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Taberlet,\u0026#32; P.\u0026#32;\u0026amp;\u0026#32;Coissac,\u0026#32; E. \u0026#32; (2011). \u0026#32;ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research,\u0026#32;39(21).\u0026#32;e145. https://doi.org/10.1093/nar/gkr732 ) ), together with a wolf blocking oligonucleotide.\nAn archive containing all the files needed for the analysis can be downloaded by clicking here: wolf_diet_dataset\nThe downloaded archive can be unarchived using the following unix command:\ntar zxvf wolf_diet_dataset.tgz It creates a directory named wolf_data, containing the following files:\nTwo fastq files generated by the sequencing of DNA extracted and amplified from four wolf feces using the Genome Analyzer IIx plateform (Illumina) and the paired-end (2 x 108 bp) sequencing chemistry:\nwolf_F.fastq.gz with the forward sequences wolf_R.fastq.gz with the reverse sequences A csv tabular file for the reads demultiplexing step, named wolf_diet_ngsfilter.csv. This file contains the primer and tag sequences used for each sample. The tags correspond to short and specific sequences added to the 5' end of each primer to distinguish the different samples.\nA reference database in fasta format named db_v05_r117.fasta.gz, extracted from the EMBL release 117 following the procedure indicated in the tutorial build a reference database.\nWe recommend to create a new folder to store the results and separate them from the raw data:\nmkdir results Recover full length sequences from forward and reverse reads # When using the result of a paired-end sequencing with supposedly overlapping forward and reverse reads, the first step is to assemble them in order to recover the corresponding full length sequence.\nThe forward and reverse reads of the same fragment are located at the same line position in both fastq files. These two files are used as inputs by the obipairing program to assemble the forward and reverse reads. This program then returns the reconstructed sequence as output:\nobipairing --min-identity=0.8 \\ --min-overlap=10 \\ -F wolf_data/wolf_F.fastq.gz \\ -R wolf_data/wolf_R.fastq.gz \\ \u0026gt; results/wolf.fastq The --min-identity and --min-overlap options allow to discard sequences with low alignment quality. In the example command above, a low alignment quality corresponds to paired-end reads overlapping over less than 10 base pairs, or to paired-end reads exhibiting an alignment of less than 80% of identity. Paired-end reads producing such low quality alignments are returned concatenated with an attribute \u0026quot;mode\u0026quot;:\u0026quot;join\u0026quot;. Those that do not fulfill the above criteria are assembled and the result is returned with the attribute \u0026quot;mode\u0026quot;:\u0026quot;alignment\u0026quot;. For more information, please refer to the command obipairing .\nThe output of the above procedure can be rapidly checked by looking at the first sequence record of wolf_assembled.fastq. This can be done with the unix command:\nhead -n 4 results/wolf.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_fast_count\u0026#34;:53,\u0026#34;pairing_fast_overlap\u0026#34;:62,\u0026#34;pairing_fast_score\u0026#34;:0.898,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:26)-\u0026gt;(G:13)\u0026#34;:62,\u0026#34;(T:34)-\u0026gt;(G:18)\u0026#34;:48},\u0026#34;score\u0026#34;:1826,\u0026#34;score_norm\u0026#34;:0.968,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:60,\u0026#34;seq_b_single\u0026#34;:46} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcDccbe[`F`accXV=TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC The -n 4 option of the head command indicates to print only the first four lines of the file, i.e. to print only the first sequence record (each sequence record in the fastq format is stored on four lines).\nExclude unpaired reads # Sequences corresponding to unpaired reads exhibit an attribute \u0026quot;mode\u0026quot;:\u0026quot;join\u0026quot; and cannot be reliably used in downstream analyses. They can be removed from the dataset using the obigrep command, as follows:\nobigrep -p \u0026#39;annotations.mode != \u0026#34;join\u0026#34;\u0026#39; \\ results/wolf.fastq \u0026gt; results/wolf_assembled.fastq The -p requires a OBITools4 expression, here annotations.mode != \u0026quot;join\u0026quot;, which means that if the value of the mode annotation of a sequence is different from join, then the corresponding sequence record should be kept in the output.\nAssign each sequence record to the corresponding sample and marker combination # Each sequence record is assigned to its corresponding sample and marker using the information provided in the file wolf_diet_ngsfilter.csv. This file, which is in a CSV tabular format, exemplifies the type of information necessary for the obimultiplex program to run.\nšŸ“„ wolf_diet_ngsfilter.csv 1 2 3 4 5 6 7 8 @param,matching,strict @param,primer_mismatches,2 @param,indels,false experiment,sample,sample_tag,forward_primer,reverse_primer wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG The minimal file should contain at least the five columns below. The order of the column is not mandatory.\nexperiment: the name/identifier of the experiment/project (several experiments/projects can be included in the same file) sample: the name/identifier of the sample or of the PCR sample_tag: the sequences of the tags (e.g. aattaac if a same tag has been used on each extremity of the PCR products, or aattaac:gaagtag if two different tags were used) forward_primer: the sequence of the forward primer reverse_primer: the sequence of the reverse primer Other information can be added as extra columns (e.g. position of the sample/PCR in the PCR plate, type of sample or control, etc.)\nSome extra lines can be added at the top of this file. They start with the @param value. Here three parameters have been provided:\n@param,matching,strict: The match between the sequence of the tags in the file [wolf_diet_ngsfilter.csv] and their corresponding sequences in the sequencing data should be strict, without any mismatches. @param,primer_mismatches,2: The match between the primers and their corresponding sequences in the sequencing data can exhibit at most two mismatches. @param,indels,false: The mismatches between the primers and their corresponding sequences in the sequencing data cannot be insertions or deletions, but only substitutions. See obimultiplex for more details.\nobimultiplex -s wolf_data/wolf_diet_ngsfilter.csv \\ -u results/unidentified.fastq \\ results/wolf_assembled.fastq \\ \u0026gt; results/wolf_assembled_assigned.fastq The command obimultiplex written above creates two files:\nunidentified.fastq containing the sequences records that failed to be assigned to a sample/marker combination wolf_assembled_assigned.fastq containing the sequence records that were properly assigned to a sample/marker combination Note that each sequence record of the wolf_assembled_assigned.fastq file contains only the barcode sequence as the sequences of primers and tags are removed by the obimultiplex program. Information concerning the experiment, sample, primers and tags is added as attributes in the sequence header.\nFor example, the first sequence record of wolf_assembled_assigned.fastq is:\n@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:26)-\u0026gt;(G:13)\u0026#34;:35,\u0026#34;(T:34)-\u0026gt;(G:18)\u0026#34;:21},\u0026#34;paring_fast_count\u0026#34;:53,\u0026#34;paring_fast_overlap\u0026#34;:62,\u0026#34;paring_fast_score\u0026#34;:0.898,\u0026#34;sample\u0026#34;:\u0026#34;29a_F260619\u0026#34;,\u0026#34;score\u0026#34;:1826,\u0026#34;score_norm\u0026#34;:0.968,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:60,\u0026#34;seq_b_single\u0026#34;:46} ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt + CCCBCCCCCBCCCCCCC\u0026lt;CcDccbe[`F`accXV=TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC The sample to which the sequence above belongs to is shown in the attribute \u0026quot;sample\u0026quot;:\u0026quot;29a_F260619\u0026quot;. The other attributes added correspond to the tags and primers matching properties against the sequence.\nReads dereplication # A DNA metabarcoding experiment inherently yields the same DNA sequence several times (i.e. replicated reads). Such a redundancy can be reduced by processing unique sequences instead of reads so as to reduce both file size and computation time, as well as to obtain more interpretable results. Dereplicating replicated reads into unique sequences can be done with the obiuniq command.\nThe program performs a pairwise comparison of all reads in the dataset. For reads that are strictly identical, only one representative sequence is kept while its frequency in the dataset is saved in the count attribute.\nIn the command below, we use the obiuniq command with the -m sample option to also store the frequency of the sequence in each sample. The program returns a fasta file.\nobiuniq -m sample \\ results/wolf_assembled_assigned.fastq \\ \u0026gt; results/wolf_assembled_assigned_uniq.fasta The first sequence record of the output, wolf_assembled_assigned_uniq.fasta is:\n\u0026gt;HELIUM_000100422_612GNAAXX:7:99:12017:19418#0/1_sub[28..127] {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;count\u0026#34;:1,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(A:02)-\u0026gt;(C:07)\u0026#34;:54,\u0026#34;(A:02)-\u0026gt;(G:17)\u0026#34;:59,\u0026#34;(C:02)-\u0026gt;(G:10)\u0026#34;:42},\u0026#34;paring_fast_count\u0026#34;:43,\u0026#34;paring_fast_overlap\u0026#34;:62,\u0026#34;paring_fast_score\u0026#34;:0.729,\u0026#34;sample\u0026#34;:\u0026#34;29a_F260619\u0026#34;,\u0026#34;score\u0026#34;:567,\u0026#34;score_norm\u0026#34;:0.935,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:58,\u0026#34;seq_b_single\u0026#34;:46} ttagccctaaacacaagtaattaatataacaaaattattcggcagagtactaccggcagt agcttaaaactcaaaggacttggcggtgctttatacccct The obiuniq command has added two key:value entries in the sequences attributes:\n\u0026quot;merged_sample\u0026quot;:{\u0026quot;29a_F260619\u0026quot;:1}: means that this sequence has been found once, in a single sample called \u0026ldquo;29a_F260619\u0026rdquo;. \u0026quot;count\u0026quot;:1 : represents the number of times, i.e. 1, this sequence has been found in the whole dataset. To keep only these two attributes in the sequence definition, we can use the obiannotate command:\nobiannotate -k count -k merged_sample \\ results/wolf_assembled_assigned_uniq.fasta \\ \u0026gt; results/wolf_assembled_assigned_simple.fasta The first five sequence records of the result, wolf_assembled_assigned_simple.fasta, become:\n\u0026gt;HELIUM_000100422_612GNAAXX:7:99:12017:19418#0/1_sub[28..127] {\u0026#34;count\u0026#34;:1,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1}} ttagccctaaacacaagtaattaatataacaaaattattcggcagagtactaccggcagt agcttaaaactcaaaggacttggcggtgctttatacccct \u0026gt;HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {\u0026#34;count\u0026#34;:37,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:37}} ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:117:10934:7472#0/1_sub[28..127] {\u0026#34;count\u0026#34;:1,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1}} ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttatacccgt \u0026gt;HELIUM_000100422_612GNAAXX:7:28:9432:2506#0/1_sub[28..127] {\u0026#34;count\u0026#34;:4,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:4}} ccagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:94:11447:14902#0/1_sub[28..127] {\u0026#34;count\u0026#34;:1,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:1}} ttagccctaaacacaagtaattagtataacaaaattattccccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt Dataset denoising # Having all sequences assigned to their respective samples does not mean that all these sequences are biologically meaningful. Some of these sequences can correspond to PCR/sequencing errors, or chimeras.\nFlagging PCR errors # The obiclean program flags sequence variants as:\npotential error generated during PCR amplification (flagged as internal sequences), genuine sequences: flagged as head, or singletons sequences, i.e. sequences for which the program could not identify a variant. In the example below, a sequence is considered as a variant of another one if:\nboth occurred in the same sample (-s sample), it exist only a single difference between both sequences (substitution, insertion, or deletion) if the abondance of the variant is less than 5% of the abondance of the main sequence (-r 0.05 option). We ask obiclean to keep only the sequences that are considered as genuine head or singleton in at least one sample (-H option). See the obiclean documentation for details. obiclean -s sample -r 0.05 --detect-chimera -H \\ results/wolf_assembled_assigned_simple.fasta \\ \u0026gt; results/wolf_assembled_assigned_simple_clean.fasta Below an example of a sequence record of wolf_assembled_assigned_simple_clean.fasta:\n\u0026gt;HELIUM_000100422_612GNAAXX:7:3:3008:16359#0/1_sub[28..127] {\u0026#34;count\u0026#34;:1,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:0,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:1,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1}} ttagccctaaacacaagtaattaatataacaaaattattcgacagagtaccaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt The attribute \u0026quot;obiclean_head\u0026quot;:true indicates that this sequence record is considered as a head, hence a genuine sequence, but the attributes \u0026quot;obiclean_status\u0026quot;:{\u0026quot;29a_F260619\u0026quot;:\u0026quot;s\u0026quot;} also indicates that this sequence is actually a \u0026ldquo;singleton\u0026rdquo; sequence.\nGetting some statistics on the dataset size # A good practice is to monitor the effect of each filtering step on the dataset characteristics. Basic statistics can be obtained with obicount command. This command counts the number of sequence variants, of reads and of symbols (i.e. nucleotides) in the dataset. The output is a CSV file with two columns: the first one being the type of entity/statistic and the second one its corresponding count in the whole dataset.\nobicount results/wolf_assembled_assigned_simple_clean.fasta entities,n variants,715 reads,33762 symbols,70775 As a CSV file, the result can be easily read by many tools, such as the csvlook command-line tool from the csvkit package to return the result in a more readable way (pretty-print):\nobicount results/wolf_assembled_assigned_simple_clean.fasta \\ | csvlook | entities | n | | -------- | ------ | | variants | 715 | | reads | 33 762 | | symbols | 70 775 | At this stage of the analysis, the wolf_assembled_assigned_simple_clean.fasta file contains 715 sequence variants corresponding to 33762 sequencing reads. Amongst these variants, we expect many of them to occur only once in the whole data set, i.e. to be singletons. Using the obigrep command, we can see how many singletons there are:\nobigrep -p \u0026#39;sequence.Count() == 1\u0026#39; \\ results/wolf_assembled_assigned_simple_clean.fasta |\\ obicount | csvlook | entities | n | | -------- | ------ | | variants | 604 | | reads | 604 | | symbols | 60 101 | To understand the obigrep command, you need to know more about the -p option. This option allows you to specify a predicate function to be applied to each sequence in the dataset. If the function returns True, the sequence is included in the output; if it returns False, it is excluded. In this case, we use a predicate that checks whether the count of sequences (which is what sequence.Count() gives us) is equal to 1. In our data set, there are 649 singletons (or variants). These singleton sequences have more chances to be errors than genuine sequences, and it is of common practice to exclude them from the dataset. The obigrep command below keeps only sequences that occur at least twice in the data set.\nobigrep -c 2 \\ results/wolf_assembled_assigned_simple_clean.fasta \\ \u0026gt; results/wolf_assembled_no_singleton.fasta We can also get insights into the distribution of the sequence across samples with obisummary . This command provides a summary of the dataset including the number of sequencing reads, sequence variants and singletons occurring in each sample. Here singleton has to be interpreted as sequence variants occurring only once in the sample.\nobisummary --yaml results/wolf_assembled_no_singleton.fasta annotations: keys: map: merged_sample: 111 obiclean_mutation: 5 obiclean_status: 111 obiclean_weight: 111 scalar: count: 111 obiclean_head: 111 obiclean_headcount: 111 obiclean_internalcount: 111 obiclean_samplecount: 111 obiclean_singletoncount: 111 map_attributes: 4 scalar_attributes: 6 vector_attributes: 0 count: reads: 33158 total_length: 10674 variants: 111 samples: sample_count: 4 sample_stats: 13a_F730603: obiclean_bad: 0 reads: 7318 singletons: 1 variants: 22 15a_F730814: obiclean_bad: 0 reads: 7503 singletons: 5 variants: 18 26a_F040644: obiclean_bad: 0 reads: 10963 singletons: 1 variants: 49 29a_F260619: obiclean_bad: 0 reads: 7374 singletons: 7 variants: 36 In this example, the sample 29a_F260619 produced 7374 reads that are distributed over 36 sequence variants. Amongst these variants, 7 occur only once, i.e. are singletons.\nIn a diet analysis - and many other DNA metabarcoding application, we are often not interested in sequences that represent less than one percent of the diet. In other words, we can filter out any sequence that occurs less than one percent of the 7000 times in the dataset, i.e. less than 70 times.\nTo get an idea of the effect of this filtering, we can run the following command to plot the distribution of the count attribute in the data set:\nobicsv -k count results/wolf_assembled_no_singleton.fasta \\ | tail -n +2 \\ | sort -n \\ | uniq -c \\ | awk \u0026#39;{print $2,$1}\u0026#39; \\ | uplot -d \u0026#39; \u0026#39; barplot ā”Œ ┐ 2 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 43.0 3 ┤■■■■■■■■ 10.0 4 ┤■■■■■■ 8.0 5 ┤■■■■■■■ 9.0 6 ┤■■■■ 5.0 7 ┤■■■ 4.0 8 ┤■■ 2.0 9 ┤■■ 2.0 10 ┤■■ 2.0 11 ┤■ 1.0 12 ┤■■ 2.0 13 ┤■■ 2.0 14 ┤■ 1.0 15 ┤■ 1.0 16 ┤■ 1.0 17 ┤■ 1.0 19 ┤■ 1.0 20 ┤■ 1.0 22 ┤■ 1.0 26 ┤■ 1.0 37 ┤■ 1.0 38 ┤■ 1.0 43 ┤■ 1.0 69 ┤■ 1.0 87 ┤■ 1.0 95 ┤■ 1.0 260 ┤■ 1.0 319 ┤■ 1.0 366 ┤■ 1.0 2007 ┤■ 1.0 7146 ┤■ 1.0 10172 ┤■ 1.0 12004 ┤■ 1.0 ā”” ā”˜ The y-axis represents the \u0026lsquo;count\u0026rsquo; attribute, which is the number of occurrences of a sequence in the dataset. The x-axis represents the number of sequences that occur that many times. For example, 43 sequences occur twice in the data set.\nIn this sequence abundance distribution, we can see that with a 1% filter, we will only keep 9 sequence variants, i.e. those that occur at least 87 times in the entire dataset.\nobigrep -c 70 \\ results/wolf_assembled_no_singleton.fasta \\ \u0026gt; results/wolf_assembled_1_percent.fasta obicount results/wolf_assembled_1_percent.fasta \\ | csvlook | entities | n | | -------- | ------ | | variants | 9 | | reads | 32 456 | | symbols | 800 | Another criterion commonly used to filter out sequences relies on their length. We know the expected length of the marker, as well as that of the sequences in our dataset. Therefore, we can define the sequences that are too long or too short as potential errors. Inspired by the command above, we can build another command plotting the distribution of sequences length in the dataset:\nobiannotate --length \\ results/wolf_assembled_1_percent.fasta\\ | obicsv -k seq_length \\ | uplot -H hist -n 20 seq_length ā”Œ ┐ [ 0.0, 5.0) ┤▇▇▇▇▇▇▇▇▇ 1 [ 5.0, 10.0) ┤ 0 [ 10.0, 15.0) ┤ 0 [ 15.0, 20.0) ┤ 0 [ 20.0, 25.0) ┤ 0 [ 25.0, 30.0) ┤ 0 [ 30.0, 35.0) ┤ 0 [ 35.0, 40.0) ┤ 0 [ 40.0, 45.0) ┤ 0 [ 45.0, 50.0) ┤ 0 [ 50.0, 55.0) ┤ 0 [ 55.0, 60.0) ┤ 0 [ 60.0, 65.0) ┤ 0 [ 65.0, 70.0) ┤ 0 [ 70.0, 75.0) ┤ 0 [ 75.0, 80.0) ┤ 0 [ 80.0, 85.0) ┤ 0 [ 85.0, 90.0) ┤ 0 [ 90.0, 95.0) ┤ 0 [ 95.0, 100.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4 [100.0, 105.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4 ā”” ā”˜ Frequency The DNA marker amplified here, i.e. the v5 region of the mitochondrial 12S rRNA gene, should be about 100 bp long. Here, one sequence is very short (\u0026lt;5 bp). We can filter this sequence out with obigrep :\nobigrep -l 50 \\ results/wolf_assembled_1_percent.fasta \\ \u0026gt; results/wolf_assembled_no_short.fasta To check the effectiveness of your filtering command, you can check the distribution of sequences length in the new file wolf_assembled_no_short.fasta: wolf_assembled_no_short.fasta):\nobiannotate --length \\ results/wolf_assembled_no_short.fasta \\ | obicsv -k seq_length \\ | uplot -H hist seq_length ā”Œ ┐ [ 99.0, 99.5) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4 [ 99.5, 100.0) ┤ 0 [100.0, 100.5) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4 ā”” ā”˜ Frequency šŸ“„ wolf_assembled_no_short.fasta \u0026gt;HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {\u0026#34;count\u0026#34;:10172,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:10172},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12205}} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca gcctgaaactcaaaggacttggcggtgctttacatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {\u0026#34;count\u0026#34;:260,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:260},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:337}} ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {\u0026#34;count\u0026#34;:7146,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:7146},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:8039}} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {\u0026#34;count\u0026#34;:87,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:87},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:202}} ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {\u0026#34;count\u0026#34;:95,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:11,\u0026#34;29a_F260619\u0026#34;:84},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:2,\u0026#34;obiclean_singletoncount\u0026#34;:1,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:105}} ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca gattaaacctcaaaggacttggcagtgctttatacccct \u0026gt;HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {\u0026#34;count\u0026#34;:12004,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:7465,\u0026#34;29a_F260619\u0026#34;:4539},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:2,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:2,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:8822,\u0026#34;29a_F260619\u0026#34;:5789}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {\u0026#34;count\u0026#34;:319,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:319},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:376}} ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {\u0026#34;count\u0026#34;:366,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:13,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:347,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:4,\u0026#34;obiclean_singletoncount\u0026#34;:3,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;15a_F730814\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:17,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:468,\u0026#34;29a_F260619\u0026#34;:1}} ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct Sequences taxonomic assignment # Once the dataset is curated, the next step in a classical diet metabarcoding analysis is to assign the barcodes a taxon name (species, genus, etc.), in order to retrieve the list of taxa detected in each sample.\nThe taxonomic assignment of sequences requires a reference database to detect all possible taxa identified in the samples, which is provided in this tutorial as db_v05_r117.fasta.gz (see the tutorial build a reference database for know how to obtain this reference database). The taxonomic annotation is then based on a comparison of the metabarcoding sequences against a pool of reference sequences. This operation is done with the obitag programm.\nDownloading of the taxonomy # The obitag programm requires access to the full taxonomy in order to compute its inferences. The NCBI taxonomy is complete and available online. It is possible to download a copy of it with the following command:\nobitaxonomy --download-ncbi --out ncbitaxo.tgz The full copy of the NCBI taxonomy is now locally stored in the ncbitaxo.tgz file of your current working directory.\nAssigning taxa to the sequences # Using the reference database db_v05_r117.fasta.gz and the full NCBI taxonomy, assigning taxa to the sequences can be done with obitag as follows:\nobitag -t ncbitaxo.tgz \\ -R wolf_data/db_v05_r117.fasta.gz \\ results/wolf_assembled_no_short.fasta \\ \u0026gt; results/wolf_assembled_taxo.fasta The resulting file, containing only few sequences in this tutorial, looks like this:\nšŸ“„ wolf_assembled_taxo.fasta \u0026gt;HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {\u0026#34;count\u0026#34;:10172,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:10172},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12205},\u0026#34;obitag_bestid\u0026#34;:0.9797979797979798,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227529\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9992 [Marmota]@genus\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca gcctgaaactcaaaggacttggcggtgctttacatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {\u0026#34;count\u0026#34;:260,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:260},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:337},\u0026#34;obitag_bestid\u0026#34;:0.9405940594059405,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AF154263\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:9,\u0026#34;obitag_rank\u0026#34;:\u0026#34;infraorder\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:35500 [Pecora]@infraorder\u0026#34;} ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {\u0026#34;count\u0026#34;:7146,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:7146},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:8039},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB245427\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9860 [Cervus elaphus]@species\u0026#34;} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {\u0026#34;count\u0026#34;:87,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:87},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:202},\u0026#34;obitag_bestid\u0026#34;:0.9494949494949495,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227530\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:2,\u0026#34;obitag_rank\u0026#34;:\u0026#34;tribe\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:337730 [Marmotini]@tribe\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {\u0026#34;count\u0026#34;:95,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:11,\u0026#34;29a_F260619\u0026#34;:84},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:2,\u0026#34;obiclean_singletoncount\u0026#34;:1,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:105},\u0026#34;obitag_bestid\u0026#34;:0.9595959595959596,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AC187326\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;subspecies\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9615 [Canis lupus familiaris]@subspecies\u0026#34;} ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca gattaaacctcaaaggacttggcagtgctttatacccct \u0026gt;HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {\u0026#34;count\u0026#34;:12004,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:7465,\u0026#34;29a_F260619\u0026#34;:4539},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:2,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:2,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:8822,\u0026#34;29a_F260619\u0026#34;:5789},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ885202\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {\u0026#34;count\u0026#34;:319,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:319},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:376},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ972683\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {\u0026#34;count\u0026#34;:366,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:13,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:347,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:1,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:4,\u0026#34;obiclean_singletoncount\u0026#34;:3,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;15a_F730814\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:17,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:468,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB048590\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9611 [Canis]@genus\u0026#34;} ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct The obitag command adds several attributes in the sequence record header, like:\nobitag_bestmatch:ACCESSION where ACCESSION is the id of the sequence in the reference database that best aligns to the query sequence obitag_bestid:FLOAT where FLOAT*100 is the percentage of identity between the best match sequence and the query sequence taxid:TAXID where TAXID is the taxonomic ID of the taxon assigned to the sequence by obitag Exporting the results in a tabular format # To reduce the file size and make it easier to analyze, we can make some cosmetic changes to the data file, for example by removing some useless information that OBITools4 inserts in the sequence header to explain its decisions.\nobiannotate is the tool to make such changes. In the next command, we will remove some tags inserted by obiclean . obiannotate --delete-tag=obiclean_head \\ --delete-tag=obiclean_headcount \\ --delete-tag=obiclean_internalcount \\ --delete-tag=obiclean_samplecount \\ --delete-tag=obiclean_singletoncount \\ results/wolf_assembled_taxo.fasta \\ \u0026gt; results/wolf_minimal.fasta The effect of the above command can be seen below:\nšŸ“„ wolf_minimal.fasta \u0026gt;HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {\u0026#34;count\u0026#34;:10172,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:10172},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12205},\u0026#34;obitag_bestid\u0026#34;:0.9797979797979798,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227529\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9992 [Marmota]@genus\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca gcctgaaactcaaaggacttggcggtgctttacatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {\u0026#34;count\u0026#34;:260,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:260},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:337},\u0026#34;obitag_bestid\u0026#34;:0.9405940594059405,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AF154263\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:9,\u0026#34;obitag_rank\u0026#34;:\u0026#34;infraorder\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:35500 [Pecora]@infraorder\u0026#34;} ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {\u0026#34;count\u0026#34;:7146,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:7146},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:8039},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB245427\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9860 [Cervus elaphus]@species\u0026#34;} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {\u0026#34;count\u0026#34;:87,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:87},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:202},\u0026#34;obitag_bestid\u0026#34;:0.9494949494949495,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227530\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:2,\u0026#34;obitag_rank\u0026#34;:\u0026#34;tribe\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:337730 [Marmotini]@tribe\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {\u0026#34;count\u0026#34;:95,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:11,\u0026#34;29a_F260619\u0026#34;:84},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:105},\u0026#34;obitag_bestid\u0026#34;:0.9595959595959596,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AC187326\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;subspecies\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9615 [Canis lupus familiaris]@subspecies\u0026#34;} ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca gattaaacctcaaaggacttggcagtgctttatacccct \u0026gt;HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {\u0026#34;count\u0026#34;:12004,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:7465,\u0026#34;29a_F260619\u0026#34;:4539},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:8822,\u0026#34;29a_F260619\u0026#34;:5789},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ885202\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {\u0026#34;count\u0026#34;:319,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:319},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:376},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ972683\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {\u0026#34;count\u0026#34;:366,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:13,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:347,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;15a_F730814\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:17,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:468,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB048590\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9611 [Canis]@genus\u0026#34;} ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct The sequence id is very long and refers to some information that is useful for analyzing the sequencing process, but useless for us, especially after a obiuniq command, as the sequence id correponds to the id of only one of the merged sequences. We can thus change it to make it more readable. This is done in two steps. First, we use the first obiannotate command to add a seq_number attribute in the sequence header that numbers the sequence from 1 to n, the number of sequence variants. Second, we use the value of this new attribute to create a new, more readable sequence identifier using the sprintf function of the OBITools4 expression language. The new sequence identifier is a string consisting of the prefix \u0026ldquo;seq\u0026rdquo; followed by the sequence number, padded with zeros to make it 4 characters long (e.g., seq0001, seq0002, etc.).\nobiannotate --number results/wolf_minimal.fasta \\ | obiannotate --set-id \u0026#39;sprintf(\u0026#34;seq%04d\u0026#34;,annotations.seq_number)\u0026#39; \\ \u0026gt; results/wolf_final.fasta šŸ“„ wolf_final.fasta \u0026gt;seq0001 {\u0026#34;count\u0026#34;:10172,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:10172},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12205},\u0026#34;obitag_bestid\u0026#34;:0.9797979797979798,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227529\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:1,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9992 [Marmota]@genus\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca gcctgaaactcaaaggacttggcggtgctttacatccct \u0026gt;seq0002 {\u0026#34;count\u0026#34;:260,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:260},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:337},\u0026#34;obitag_bestid\u0026#34;:0.9405940594059405,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AF154263\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:9,\u0026#34;obitag_rank\u0026#34;:\u0026#34;infraorder\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:2,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:35500 [Pecora]@infraorder\u0026#34;} ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;seq0003 {\u0026#34;count\u0026#34;:7146,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:7146},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:8039},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB245427\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:3,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9860 [Cervus elaphus]@species\u0026#34;} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;seq0004 {\u0026#34;count\u0026#34;:87,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:87},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:202},\u0026#34;obitag_bestid\u0026#34;:0.9494949494949495,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AY227530\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:2,\u0026#34;obitag_rank\u0026#34;:\u0026#34;tribe\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:4,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:337730 [Marmotini]@tribe\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;seq0005 {\u0026#34;count\u0026#34;:95,\u0026#34;merged_sample\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:11,\u0026#34;29a_F260619\u0026#34;:84},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;26a_F040644\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:105},\u0026#34;obitag_bestid\u0026#34;:0.9595959595959596,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AC187326\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;subspecies\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:5,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9615 [Canis lupus familiaris]@subspecies\u0026#34;} ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca gattaaacctcaaaggacttggcagtgctttatacccct \u0026gt;seq0006 {\u0026#34;count\u0026#34;:12004,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:7465,\u0026#34;29a_F260619\u0026#34;:4539},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:8822,\u0026#34;29a_F260619\u0026#34;:5789},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ885202\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:6,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;seq0007 {\u0026#34;count\u0026#34;:319,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:319},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;h\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:376},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AJ972683\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;species\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:7,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9858 [Capreolus capreolus]@species\u0026#34;} ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;seq0008 {\u0026#34;count\u0026#34;:366,\u0026#34;merged_sample\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:13,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:347,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obiclean_status\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;15a_F730814\u0026#34;:\u0026#34;s\u0026#34;,\u0026#34;26a_F040644\u0026#34;:\u0026#34;h\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;13a_F730603\u0026#34;:17,\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;26a_F040644\u0026#34;:468,\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;obitag_bestid\u0026#34;:1,\u0026#34;obitag_bestmatch\u0026#34;:\u0026#34;AB048590\u0026#34;,\u0026#34;obitag_match_count\u0026#34;:1,\u0026#34;obitag_rank\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obitag_similarity_method\u0026#34;:\u0026#34;lcs\u0026#34;,\u0026#34;seq_number\u0026#34;:8,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9611 [Canis]@genus\u0026#34;} ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata gcttaaaactcaaaggacttggcggtgctttatatccct It is now possible to extract the useful information for our ecological analysis from our sequence file. The results of this extraction consists of two CSV files, one describing the occurrence of each sequence variant in the different samples, and one for the metadata describing each sequence variant, which can at this stage of the analysis be considered as a Molecular Taxonomic Unit, i.e. MOTU.\nThe MOTU occurrence table # In the results file wolf_final.fasta, two attributes inform us about the distribution of MOTU abundances across samples (which here correspond to individual PCR): the merge_sample attribute and the obiclean_weight attribute.\nThe merge_sample attribute was set by obiuniq during the initial reads dereplication procedure. It contains the observed number of reads for each sequence variant in the different samples. The obiclean_weight attribute is the number of reads assigned to each sequence variant after the obiclean denoising (or clustering) step. The number of reads shown in this attribute takes into account not only the number of reads observed for this variant, but also the number of reads observed for the erroneous sequences clustered to this estimated genuine sequence. According to obiclean , obiclean_weight is a better estimate of the true sequence occurrence than the merge_sample attribute.\nThe obimatrix command creates the CSV file representing any map attribute of a OBITools4 sequence file. By default, it dumps the merge_sample attribute, but you can specify any other map attribute. Here we decided to use the obiclean_weight attribute, as we prefer to report the abundances of the MOTUs.\nobimatrix --map obiclean_weight \\ results/wolf_final.fasta \\ \u0026gt; results/wolf_final_occurrency.csv csvlook results/wolf_final_occurrency.csv | id | seq0001 | seq0002 | seq0003 | seq0004 | seq0005 | seq0006 | seq0007 | seq0008 | | ----------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | | 29a_F260619 | 0 | 337 | 0 | 0 | 105 | 5 789 | 376 | 1 | | 15a_F730814 | 0 | 0 | 0 | 0 | 0 | 8 822 | 0 | 5 | | 13a_F730603 | 0 | 0 | 8 039 | 0 | 0 | 0 | 0 | 17 | | 26a_F040644 | 12 205 | 0 | 0 | 202 | 12 | 0 | 0 | 468 | | | | | | | | | | | To create the CSV metadata file describing the MOTUs attributes, you can use obicsv with the --auto option. This will create a CSV file from the wolf_final.fasta file and automatically determine which columns to include based on their contents from the first sequence records of the input dataset. In the example below, the -i and -s options are used to include the sequence identifier and the sequence itself in the output CSV file. The result can be viewed with csvlook:\nobicsv --auto -i -s \\ results/wolf_final.fasta \\ \u0026gt; results/wolf_final_motus.csv csvlook results/wolf_final_motus.csv | id | count | obitag_bestid | obitag_bestmatch | obitag_match_count | obitag_rank | obitag_similarity_method | seq_number | taxid | sequence | | ------- | ------ | ------------- | ---------------- | ------------------ | ----------- | ------------------------ | ---------- | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------- | | seq0001 | 10 172 | 0,980… | AY227529 | 1 | genus | lcs | 1 | taxon:9992 [Marmota]@genus | ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccct | | seq0002 | 260 | 0,941… | AF154263 | 9 | infraorder | lcs | 2 | taxon:35500 [Pecora]@infraorder | ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaacagcttaaaactcaaaggacttggcggtgctttataccctt | | seq0003 | 7 146 | 1,000… | AB245427 | 1 | species | lcs | 3 | taxon:9860 [Cervus elaphus]@species | ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt | | seq0004 | 87 | 0,949… | AY227530 | 2 | tribe | lcs | 4 | taxon:337730 [Marmotini]@tribe | ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaatagcttaaaactcaaaggacttggcggtgctttatatccct | | seq0005 | 95 | 0,960… | AC187326 | 1 | subspecies | lcs | 5 | taxon:9615 [Canis lupus familiaris]@subspecies | ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaacagattaaacctcaaaggacttggcagtgctttatacccct | | seq0006 | 12 004 | 1,000… | AJ885202 | 1 | species | lcs | 6 | taxon:9858 [Capreolus capreolus]@species | ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt | | seq0007 | 319 | 1,000… | AJ972683 | 1 | species | lcs | 7 | taxon:9858 [Capreolus capreolus]@species | ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt | | seq0008 | 366 | 1,000… | AB048590 | 1 | genus | lcs | 8 | taxon:9611 [Canis]@genus | ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaatagcttaaaactcaaaggacttggcggtgctttatatccct | References # Riaz,\u0026#32; Shehzad,\u0026#32; Viari,\u0026#32; Pompanon,\u0026#32; Taberlet\u0026#32;\u0026amp;\u0026#32;Coissac (2011) Riaz,\u0026#32; T.,\u0026#32; Shehzad,\u0026#32; W.,\u0026#32; Viari,\u0026#32; A.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Taberlet,\u0026#32; P.\u0026#32;\u0026amp;\u0026#32;Coissac,\u0026#32; E. \u0026#32; (2011). \u0026#32;ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research,\u0026#32;39(21).\u0026#32;e145. https://doi.org/10.1093/nar/gkr732 Shehzad,\u0026#32; Riaz,\u0026#32; Nawaz,\u0026#32; Miquel,\u0026#32; Poillot,\u0026#32; Shah,\u0026#32; Pompanon,\u0026#32; Coissac\u0026#32;\u0026amp;\u0026#32;Taberlet (2012) Shehzad,\u0026#32; W.,\u0026#32; Riaz,\u0026#32; T.,\u0026#32; Nawaz,\u0026#32; M.,\u0026#32; Miquel,\u0026#32; C.,\u0026#32; Poillot,\u0026#32; C.,\u0026#32; Shah,\u0026#32; S.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Coissac,\u0026#32; E.\u0026#32;\u0026amp;\u0026#32;Taberlet,\u0026#32; P. \u0026#32; (2012). \u0026#32;Carnivore diet analysis based on next-generation sequencing: application to the leopard cat (Prionailurus bengalensis) in Pakistan: LEOPARD CAT DIET. Molecular ecology,\u0026#32;21(8).\u0026#32;1951–1965. https://doi.org/10.1111/j.1365-294X.2011.05424.x "},{"id":1,"href":"/obidoc/docs/file_format/taxonomy_file/csv_taxdump/","title":"CSV formatted taxdump","section":"Taxonomy file formats","content":" The CSV format to describe a taxonomy # OBITools4 allow to describe a taxonomy with a CSV file of four columns that must be named as below:\nField Description taxid A unique taxonomic identifier composed only of digits (0-9) lower case (a-z) and upper case (A-Z) characters parent The taxid of the parent taxon of the current taxon scientific_name The name used by the OBITools as the scientific name of the taxon taxonomic_rank The taxonomic rank of the taxon (e.g. species, genus, family, etc.) The four columns can be freely ordered.\nSome constraints exist on the order of the rows describing the taxa in the CSV file. The first row must contain the taxid of the root taxon (i.e. the taxid of the first taxon in the taxonomic hierarchy). The taxid of the parent taxon of the root taxon must be the same as the taxid of the root taxon. For the following taxa, the parent taxon must precede the declaration of a taxon using it as parent.\nExample of a taxonomy formatted in CSV # Following this format, here a four-taxa example with the root taxon, the Betula genus and two species Betula nana and Betula pubescence.\ntaxid,parents,scientific_name,taxonomic_rank 1,1,root,root 2ABC,1,Betula,genus 3,2ABC,Betula nana,species 4,2ABC,Betula pubescens,species The corresponding taxonomic hierarchy is displayed below:\ngraph RL 1[/\"root (1)\"\\] 2ABC[\"Betula (2ABC)\"] 3[\"Betula nana (3)\"] 4[\"Betula pubescence (4)\"] 2ABC --\u003e 1 3 --\u003e 2ABC 4 --\u003e 2ABC classDef root fill:#fff,stroke:#333,stroke-width:2px classDef genus fill:#bbf,stroke:#333,stroke-width:1px classDef species fill:#dfd,stroke:#333,stroke-width:1px class 1 root class 2ABC genus class 3,4 species That simple format allows to convert easily with a small UNIX script any available taxonomic hierarchy into a format useable by OBITools4.\nGenerating a CSV taxonomy file from a larger taxonomy # The obitaxonomy command can be used to generate a CSV file from another taxonomy format. The main aim of this command functionality is to extract a subtaxonomy corresponding to a clade from a largest taxonomy.\nIf it was not already done, a copy of the NCBI taxonomy can be downloaded and saved into the ncbitaxo.tgz file.\nobitaxonomy --download-ncbi --out ncbitaxo.tgz obitaxonomy is used to identify the taxid of the taxon of interest, here the genus Betula.\nThe -t option allows for specifying the file containing the taxonomy The \u0026ndash;rank option allows for restricting the search to the taxa with the genus taxonomic rank The \u0026ndash;fixed option indicates to look for an exact match with the taxon name Betula is the pattern used to match the taxon name. The result of the following obitaxonomy command is CSV formatted, and the piped result can be displayed as a nice table with the csvlook command:\nobitaxonomy -t ncbitaxo.tgz \\ --rank genus \\ --fixed Betula \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | ------------------------- | ------------------------------ | -------------- | --------------- | | taxon:3504 [Betula]@genus | taxon:3514 [Betulaceae]@family | genus | Betula | A single taxon meets all the specified criteria. It has the taxid 3504 or taxon:3504 if we include the taxonomy code.\nIt is now possible to request obitaxonomy for dumping the sub taxonomy corresponding to the taxon:3504 taxon. The result is saved by redirecting the stdout to the file betula_subtaxo.csv.\nobitaxonomy -t ncbitaxo.tgz \\ --dump taxon:3504 \u0026gt; betula_subtaxo.csv As usual with obitaxonomy the result is CSV formatted. That allows for using the csvtk dim UNIX command from csvtk program to display the number of columns (four as expected) and of rows, here 131 taxa. Once again csvlook is used to print out the result in the form of a nice ASCII table:\ncsvtk dim betula_subtaxo.csv \\ | csvlook | file | num_cols | num_rows | | ------------------ | -------- | -------- | | betula_subtaxo.csv | 4 | 131 | head -30 betula_subtaxo.csv \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | --------------------------------------------- | ----------------------------------------- | -------------- | --------------------- | | taxon:1 [root]@no rank | taxon:1 [root]@no rank | no rank | root | | taxon:131567 [cellular organisms]@no rank | taxon:1 [root]@no rank | no rank | cellular organisms | | taxon:2759 [Eukaryota]@superkingdom | taxon:131567 [cellular organisms]@no rank | superkingdom | Eukaryota | | taxon:33090 [Viridiplantae]@kingdom | taxon:2759 [Eukaryota]@superkingdom | kingdom | Viridiplantae | | taxon:35493 [Streptophyta]@phylum | taxon:33090 [Viridiplantae]@kingdom | phylum | Streptophyta | | taxon:131221 [Streptophytina]@subphylum | taxon:35493 [Streptophyta]@phylum | subphylum | Streptophytina | | taxon:3193 [Embryophyta]@clade | taxon:131221 [Streptophytina]@subphylum | clade | Embryophyta | | taxon:58023 [Tracheophyta]@clade | taxon:3193 [Embryophyta]@clade | clade | Tracheophyta | | taxon:78536 [Euphyllophyta]@clade | taxon:58023 [Tracheophyta]@clade | clade | Euphyllophyta | | taxon:58024 [Spermatophyta]@clade | taxon:78536 [Euphyllophyta]@clade | clade | Spermatophyta | | taxon:3398 [Magnoliopsida]@class | taxon:58024 [Spermatophyta]@clade | class | Magnoliopsida | | taxon:1437183 [Mesangiospermae]@clade | taxon:3398 [Magnoliopsida]@class | clade | Mesangiospermae | | taxon:71240 [eudicotyledons]@clade | taxon:1437183 [Mesangiospermae]@clade | clade | eudicotyledons | | taxon:91827 [Gunneridae]@clade | taxon:71240 [eudicotyledons]@clade | clade | Gunneridae | | taxon:1437201 [Pentapetalae]@clade | taxon:91827 [Gunneridae]@clade | clade | Pentapetalae | | taxon:71275 [rosids]@clade | taxon:1437201 [Pentapetalae]@clade | clade | rosids | | taxon:91835 [fabids]@clade | taxon:71275 [rosids]@clade | clade | fabids | | taxon:3502 [Fagales]@order | taxon:91835 [fabids]@clade | order | Fagales | | taxon:3514 [Betulaceae]@family | taxon:3502 [Fagales]@order | family | Betulaceae | | taxon:3504 [Betula]@genus | taxon:3514 [Betulaceae]@family | genus | Betula | | taxon:361421 [Betula middendorffii]@species | taxon:3504 [Betula]@genus | species | Betula middendorffii | | taxon:1603696 [Betula austrosinensis]@species | taxon:3504 [Betula]@genus | species | Betula austrosinensis | | taxon:216993 [Betula fruticosa]@species | taxon:3504 [Betula]@genus | species | Betula fruticosa | | taxon:361422 [Betula ovalifolia]@species | taxon:3504 [Betula]@genus | species | Betula ovalifolia | | taxon:253223 [Betula uber]@species | taxon:3504 [Betula]@genus | species | Betula uber | | taxon:1685986 [Betula megrelica]@species | taxon:3504 [Betula]@genus | species | Betula megrelica | | taxon:1685997 [Betula tianschanica]@species | taxon:3504 [Betula]@genus | species | Betula tianschanica | | taxon:312792 [Betula raddeana]@species | taxon:3504 [Betula]@genus | species | Betula raddeana | | taxon:1685980 [Betula bomiensis]@species | taxon:3504 [Betula]@genus | species | Betula bomiensis | From taxon:1 (the root taxon) to taxon:3504 (the taxon of interest Betula), the command obitaxonomy has dumped the taxonomic path classifying the Betula genus. The following taxa correspond to the species belonging to the Betula genus.\nThis new taxonomy saved as a CSV file betula_subtaxo.csv can be used by any OBITools as a taxonomy. For example, obitaxonomy can use it to identify the taxid of Betula megrelica:\nobitaxonomy -t betula_subtaxo.csv \u0026#34;Betula megrelica\u0026#34; \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | ---------------------------------------- | ------------------------- | -------------- | ---------------- | | taxon:1685986 [Betula megrelica]@species | taxon:3504 [Betula]@genus | species | Betula megrelica | or to just dump the subtree of the Betula nana species:\nobitaxonomy -t betula_subtaxo.csv \\ --dump taxon:216990 \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | ------------------------------------------------------- | ----------------------------------------- | -------------- | ---------------------------- | | taxon:1 [root]@no rank | taxon:1 [root]@no rank | no rank | root | | taxon:131567 [cellular organisms]@no rank | taxon:1 [root]@no rank | no rank | cellular organisms | | taxon:2759 [Eukaryota]@superkingdom | taxon:131567 [cellular organisms]@no rank | superkingdom | Eukaryota | | taxon:33090 [Viridiplantae]@kingdom | taxon:2759 [Eukaryota]@superkingdom | kingdom | Viridiplantae | | taxon:35493 [Streptophyta]@phylum | taxon:33090 [Viridiplantae]@kingdom | phylum | Streptophyta | | taxon:131221 [Streptophytina]@subphylum | taxon:35493 [Streptophyta]@phylum | subphylum | Streptophytina | | taxon:3193 [Embryophyta]@clade | taxon:131221 [Streptophytina]@subphylum | clade | Embryophyta | | taxon:58023 [Tracheophyta]@clade | taxon:3193 [Embryophyta]@clade | clade | Tracheophyta | | taxon:78536 [Euphyllophyta]@clade | taxon:58023 [Tracheophyta]@clade | clade | Euphyllophyta | | taxon:58024 [Spermatophyta]@clade | taxon:78536 [Euphyllophyta]@clade | clade | Spermatophyta | | taxon:3398 [Magnoliopsida]@class | taxon:58024 [Spermatophyta]@clade | class | Magnoliopsida | | taxon:1437183 [Mesangiospermae]@clade | taxon:3398 [Magnoliopsida]@class | clade | Mesangiospermae | | taxon:71240 [eudicotyledons]@clade | taxon:1437183 [Mesangiospermae]@clade | clade | eudicotyledons | | taxon:91827 [Gunneridae]@clade | taxon:71240 [eudicotyledons]@clade | clade | Gunneridae | | taxon:1437201 [Pentapetalae]@clade | taxon:91827 [Gunneridae]@clade | clade | Pentapetalae | | taxon:71275 [rosids]@clade | taxon:1437201 [Pentapetalae]@clade | clade | rosids | | taxon:91835 [fabids]@clade | taxon:71275 [rosids]@clade | clade | fabids | | taxon:3502 [Fagales]@order | taxon:91835 [fabids]@clade | order | Fagales | | taxon:3514 [Betulaceae]@family | taxon:3502 [Fagales]@order | family | Betulaceae | | taxon:3504 [Betula]@genus | taxon:3514 [Betulaceae]@family | genus | Betula | | taxon:216990 [Betula nana]@species | taxon:3504 [Betula]@genus | species | Betula nana | | taxon:2820156 [Betula nana subsp. tundrarum]@subspecies | taxon:216990 [Betula nana]@species | subspecies | Betula nana subsp. tundrarum | | taxon:717482 [Betula nana subsp. exilis]@subspecies | taxon:216990 [Betula nana]@species | subspecies | Betula nana subsp. exilis | | taxon:3080005 [Betula nana var. macrophylla]@varietas | taxon:216990 [Betula nana]@species | varietas | Betula nana var. macrophylla | | taxon:1623466 [Betula nana subsp. nana]@subspecies | taxon:216990 [Betula nana]@species | subspecies | Betula nana subsp. nana | Using an appropriate sub-taxonomy can significantly reduce the time needed for an OBITools to read the taxonomy, compared with the time needed to read the entire taxonomy.\n"},{"id":2,"href":"/obidoc/docs/cookbook/ecoprimers/","title":"Designing new barcodes","section":"Cookbook","content":" Designing new barcodes with ecoPrimers # ecoPrimers ( Citation: Riaz,\u0026#32;Shehzad \u0026amp; al.,\u0026#32;2011 Riaz,\u0026#32; T.,\u0026#32; Shehzad,\u0026#32; W.,\u0026#32; Viari,\u0026#32; A.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Taberlet,\u0026#32; P.\u0026#32;\u0026amp;\u0026#32;Coissac,\u0026#32; E. \u0026#32; (2011). \u0026#32;ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research,\u0026#32;39(21).\u0026#32;e145. https://doi.org/10.1093/nar/gkr732 ) is a tool for designing new DNA metabarcodes. It is capable of working with a collection of mitochondrial genomes, chloroplast genomes or rRNA nuclear gene clusters. It is an alignment free method, which guarantees its efficiency.\nThe ecoPrimers program was developed to be used in conjunction with the original OBITools. Therefore, using it with the new OBITools4 requires some special care in data preparation.\nIn this recipe we will use ecoPrimers to design a new bony fish DNA metabarcode.\nInstallation of ecoPrimers # ecoPrimers is available from the git reposiroty of metabarcoding site at\nhttps://git.metabarcoding.org/obitools/ecoprimers Installation can be done by cloning the project:\ngit clone https://git.metabarcoding.org/obitools/ecoprimers.git This will create a new ecoprimers directory with a src subdirectory containing the source code. You will need to change your current working directory to this ecoprimers/src directory.\ncd ecoprimers/src It is now possible to compile the ecoPrimers program using the make command:\nmake This command will produce a series of messages on your screen similar to the following. You may get some extra warning messages, but no errors should be reported. If compilation is successful, an ecoPrimers executable will be created in the current directory.\ngcc -DMAC_OS_X -M -o ecoprimer.d ecoprimer.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoprimer.o ecoprimer.c /Library/Developer/CommandLineTools/usr/bin/make -C libecoPCR gcc -DMAC_OS_X -M -o econame.d econame.c gcc -DMAC_OS_X -M -o ecofilter.d ecofilter.c gcc -DMAC_OS_X -M -o ecotax.d ecotax.c gcc -DMAC_OS_X -M -o ecoseq.d ecoseq.c gcc -DMAC_OS_X -M -o ecorank.d ecorank.c gcc -DMAC_OS_X -M -o ecoMalloc.d ecoMalloc.c gcc -DMAC_OS_X -M -o ecoIOUtils.d ecoIOUtils.c gcc -DMAC_OS_X -M -o ecoError.d ecoError.c gcc -DMAC_OS_X -M -o ecodna.d ecodna.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecodna.o ecodna.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoError.o ecoError.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoIOUtils.o ecoIOUtils.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoMalloc.o ecoMalloc.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecorank.o ecorank.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoseq.o ecoseq.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecotax.o ecotax.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecofilter.o ecofilter.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o econame.o econame.c ar -cr libecoPCR.a ecodna.o ecoError.o ecoIOUtils.o ecoMalloc.o ecorank.o ecoseq.o ecotax.o ecofilter.o econame.o ranlib libecoPCR.a /Library/Developer/CommandLineTools/usr/bin/make -C libecoprimer gcc -DMAC_OS_X -M -o ahocorasick.d ahocorasick.c gcc -DMAC_OS_X -M -o PrimerSets.d PrimerSets.c gcc -DMAC_OS_X -M -o filtering.d filtering.c gcc -DMAC_OS_X -M -o apat_search.d apat_search.c gcc -DMAC_OS_X -M -o taxstats.d taxstats.c gcc -DMAC_OS_X -M -o pairs.d pairs.c gcc -DMAC_OS_X -M -o pairtree.d pairtree.c gcc -DMAC_OS_X -M -o sortmatch.d sortmatch.c gcc -DMAC_OS_X -M -o libstki.d libstki.c gcc -DMAC_OS_X -M -o queue.d queue.c gcc -DMAC_OS_X -M -o merge.d merge.c gcc -DMAC_OS_X -M -o aproxpattern.d aproxpattern.c gcc -DMAC_OS_X -M -o strictprimers.d strictprimers.c gcc -DMAC_OS_X -M -o hashsequence.d hashsequence.c gcc -DMAC_OS_X -M -o sortword.d sortword.c gcc -DMAC_OS_X -M -o smothsort.d smothsort.c gcc -DMAC_OS_X -M -o readdnadb.d readdnadb.c gcc -DMAC_OS_X -M -o goodtaxon.d goodtaxon.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o goodtaxon.o goodtaxon.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o readdnadb.o readdnadb.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o smothsort.o smothsort.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o sortword.o sortword.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o hashsequence.o hashsequence.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o strictprimers.o strictprimers.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o aproxpattern.o aproxpattern.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o merge.o merge.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o queue.o queue.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o libstki.o libstki.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o sortmatch.o sortmatch.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o pairtree.o pairtree.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o pairs.o pairs.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o taxstats.o taxstats.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o apat_search.o apat_search.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o filtering.o filtering.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o PrimerSets.o PrimerSets.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ahocorasick.o ahocorasick.c ar -cr libecoprimer.a goodtaxon.o readdnadb.o smothsort.o sortword.o hashsequence.o strictprimers.o aproxpattern.o merge.o queue.o libstki.o sortmatch.o pairtree.o pairs.o taxstats.o apat_search.o filtering.o PrimerSets.o ahocorasick.o ranlib libecoprimer.a /Library/Developer/CommandLineTools/usr/bin/make -C libthermo gcc -DMAC_OS_X -M -o thermostats.d thermostats.c gcc -DMAC_OS_X -M -o nnparams.d nnparams.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o nnparams.o nnparams.c gcc -DMAC_OS_X -W -Wall -m64 -g -c -o thermostats.o thermostats.c ar -cr libthermo.a nnparams.o thermostats.o ranlib libthermo.a gcc -g -O5 -m64 -o ecoPrimers ecoprimer.o -LlibecoPCR -Llibecoprimer -Llibthermo -L/usr/local/lib -lecoprimer -lecoPCR -lthermo -lz -lm You can now copy the ecoPrimers executable to a directory that is part of your PATH environment variable. You can use the following command to list all these directories. For example, the result is:\nfor p in $path; do echo $p; done | sort -u /Users/coissac/bin /Users/coissac/go/bin /bin /opt/X11/bin /sbin /usr/bin /usr/local/bin /usr/local/go/bin /usr/sbin From this list you can choose the directory where you want to install the ecoPrimers executable. Here we can choose the folder /Users/coissac/bin to store it, as it is in the path of the home directory, and therefore does not require root privileges to copy the ecoPrimers executable into. /usr/local/bin is also a good choice, as it is the default directory for installing non-standard software on a UNIX system. When software is installed in /usr/local/bin, it is available to all users of the system. However, copying the ecoPrimers executable to /usr/local/bin requires root privileges.\nIf we install the software without root privileges:\ncp ecoPrimers /Users/coissac/bin If we install the software for all users on the system, but with root privileges:\nsudo cp ecoPrimers /usr/local/bin Preparing the data # What do we need ? # To design a new animal DNA metabarcode, we have to download the following data from the NCBI website:\nThe complete set of whole mitochondrial genomes The NCBI taxonomy Downloading the mitochondrial genomes # The file containing the complete set of mitochondrial genomes can be downloaded using your favourite web browser from the NCBI FTP website.\nYou will need to download the GenBank flat file format of the data, with extension gbff.gz. This is the only one that contains the link to the NCBI taxonomy for each sequence.\nIf you need to download the data on a UNIX computer, you may not have access to a web browser on that system. In this case, use the curl command to download the file:\ncurl \u0026#39;https://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/mitochondrion.1.genomic.gbff.gz\u0026#39; \\ \u0026gt; mito.all.gb.gz Because the file is compressed, you must use the zless command instead of the classic less command to inspect the file without decompressing it first:\nzless mito.all.gb.gz LOCUS NW_009243181 45189 bp DNA linear CON 06-OCT-2014 DEFINITION Fonticula alba strain ATCC 38817 mitochondrial scaffold supercont2.211, whole genome shotgun sequence. ACCESSION NW_009243181 NZ_AROH01000000 VERSION NW_009243181.1 DBLINK BioProject: PRJNA262900 Assembly: GCF_000388065.1 KEYWORDS WGS; RefSeq. SOURCE mitochondrion Fonticula alba ORGANISM Fonticula alba Eukaryota; Rotosphaerida; Fonticulaceae; Fonticula. REFERENCE 1 (bases 1 to 45189) AUTHORS Russ,C., Cuomo,C., Burger,G., Gray,M.W., Holland,P.W.H., King,N., Lang,F.B.F., Roger,A.J., Ruiz-Trillo,I., Brown,M., Walker,B., Young,S., Zeng,Q., Gargeya,S., Fitzgerald,M., Haas,B., Abouelleil,A., Allen,A.W., Alvarado,L., Arachchi,H.M., Berlin,A.M., Chapman,S.B., Gainer-Dewar,J., Goldberg,J., Griggs,A., Gujja,S., Hansen,M., Howarth,C., Imamovic,A., Ireland,A., Larimer,J., McCowan,C., Murphy,C., Pearson,M., Poon,T.W., Priest,M., Roberts,A., Saif,S., Shea,T., Sisk,P., Sykes,S., Wortman,J., Nusbaum,C. and Birren,B. CONSRTM The Broad Institute Genomics Platform TITLE The Genome Sequence of Fonticula alba ATCC 38817 JOURNAL Unpublished REFERENCE 2 (bases 1 to 45189) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (06-OCT-2014) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 45189) AUTHORS Russ,C., Cuomo,C., Burger,G., Gray,M.W., Holland,P.W.H., King,N., Lang,F.B.F., Roger,A.J., Ruiz-Trillo,I., Brown,M., Walker,B., Young,S., Zeng,Q., Gargeya,S., Fitzgerald,M., Haas,B., Abouelleil,A., Allen,A.W., Alvarado,L., Arachchi,H.M., Berlin,A.M., Chapman,S.B., Gainer-Dewar,J., Goldberg,J., Griggs,A., Gujja,S., Hansen,M., Howarth,C., Imamovic,A., Ireland,A., Larimer,J., McCowan,C., Murphy,C., Pearson,M., Poon,T.W., Priest,M., Roberts,A., Saif,S., Shea,T., Sisk,P., Sykes,S., Wortman,J., Nusbaum,C. and Birren,B. CONSRTM The Broad Institute Genomics Platform TITLE Direct Submission JOURNAL Submitted (26-APR-2013) Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence is identical to KB932304. ##Genome-Assembly-Data-START## Assembly Method :: ALLPATHS v. R44024; Mito ALLPATHS v. R43919 Assembly Name :: Font_alba_ATCC_38817_V2 Genome Coverage :: 317.0x; Mito 63.0x Sequencing Technology :: Illumina ##Genome-Assembly-Data-END## FEATURES Location/Qualifiers source 1..45189 /organism=\u0026#34;Fonticula alba\u0026#34; /organelle=\u0026#34;mitochondrion\u0026#34; /mol_type=\u0026#34;genomic DNA\u0026#34; /strain=\u0026#34;ATCC 38817\u0026#34; /isolation_source=\u0026#34;dog dung\u0026#34; /culture_collection=\u0026#34;ATCC:38817\u0026#34; /db_xref=\u0026#34;taxon:691883\u0026#34; /geo_loc_name=\u0026#34;USA: Grainfield, Kansas\u0026#34; /collection_date=\u0026#34;1960\u0026#34; At the end of the top of the file shown above, we can see the /db_xref=\u0026quot;taxon:691883\u0026quot; field, which provides the link to the NCBI taxonomy for this first entry in the file.\nDownload the full taxonomy # The NCBI taxonomy is available as a tarball file. It can be downloaded in the same way as the RefSeq mitochondrial database. You can also download the NCBI taxonomy using the obitaxonomy command with the --download-ncbi option.\nobitaxonomy --download-ncbi INFO[0000] Number of workers set 16 INFO[0000] Downloading NCBI Taxdump to ncbitaxo_20250211.tgz downloading 100% ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| (66/66 MB, 5.1 MB/s) By default, obitaxonomy downloads the latest version of the NCBI taxonomy available from the NCBI FTP site and saves it to the current directory in a file named ncbitaxo_YYYYMMDD.tgz where YYYY is the year, MM is the month and DD is the day of the download. Here the date is 2025/02/11, so the filename is ncbitaxo_20250211.tgz.\nYou can also specify the filename of the downloaded file using the --out filename option. For example:\nobitaxonomy --download-ncbi --out ncbitaxo.tgz The archive contains several files # The NCBI taxonomy dump file contains all the relationships between taxa. This information is stored in two files: nodes.dmp and names.dmp.\nThe nodes.dmp file:\nIt contains the taxonomic hierarchy of the NCBI taxonomy. It is a tabular file where the columns are separated by a | character and some whitespace.\nThe first column is the taxid of the taxon. The second column is the parent taxid of the taxon. The third column is the taxonomic rank of the taxon. The remaining columns are not used by the OBITools.\n1 | 1 | no rank |\t|\t8\t|\t0\t| ... 2\t|\t131567\t|\tsuperkingdom\t|\t|\t0\t|\t0\t| 6\t|\t335928\t|\tgenus\t|\t|\t0\t|\t1\t| 7\t|\t6\t|\tspecies\t|\tAC\t|\t0\t|\t1\t| 9\t|\t32199\t|\tspecies\t|\tBA\t|\t0\t| 10\t|\t135621\t|\tgenus\t|\t|\t0\t| 11\t|\t1707\t|\tspecies\t|\tCG\t|\t0\t|\t1\t| 13\t|\t203488\t|\tgenus\t|\t|\t0\t|\t1\t| 14\t|\t13\t|\tspecies\t|\tDT\t|\t0\t|\t1\t| The names.dmp file:\nIt contains the scientific names, and a set of alternative names, for all the taxa. It is also a tabular file where the columns are separated by a | character and some whitespace.\nThe first column is the taxid of the taxon. The second column is the name of the taxon. The third column is the class name of this name (e.g scientific name, or blast name\u0026hellip;) 1\t|\troot\t|\t|\tscientific name\t| 2\t|\tBacteria\t|\tBacteria \u0026lt;prokaryote\u0026gt;\t|\tscientific name\t| 2\t|\tMonera\t|\tMonera \u0026lt;Bacteria\u0026gt;\t|\tin-part\t| 2\t|\tProcaryotae\t|\tProcaryotae \u0026lt;Bacteria\u0026gt;\t|\tin-part\t| 2\t|\tProkaryota\t|\tProkaryota \u0026lt;Bacteria\u0026gt;\t|\tin-part\t| 2\t|\tProkaryotae\t|\tProkaryotae \u0026lt;Bacteria\u0026gt;\t|\tin-part\t| 2\t|\tbacteria\t|\tbacteria \u0026lt;blast2\u0026gt;\t|\tblast name\t| 2\t|\teubacteria\t|\t|\tgenbank common name\t| 2\t|\tprokaryote\t|\tprokaryote \u0026lt;Bacteria\u0026gt;\t|\tin-part\t| ... 10\t|\tCellvibrio\t|\t|\tscientific name\t| 11\t|\t[Cellvibrio] gilvus\t|\t|\tscientific name\t| 13\t|\tDictyoglomus\t|\t|\tscientific name\t| 14\t|\tDictyoglomus thermophilum\t|\t|\tscientific name\t| A readme.txt file is present in the archive for more information about the NCBI taxonomy dump file.\nPreparing the set of complete genomes # With OBITools, the favorite format for storing sequences is the fasta format. Therefore, we will use the obiconvert tool to convert the GenBank files into fasta format.\nobiconvert --skip-empty \\ --update-taxid \\ -t ncbitaxo_20250211.tgz \\ mito.all.gb.gz \\ \u0026gt; mito.all.fasta head -5 mito.all.fasta It is not equivalent downloading directly the fasta formatted file from the NCBI FTP site and downloading a GenBank file that will be converted in fasta format using obiconvert . By converting from GenBank format, the fasta formatted file will contain the taxid of the taxon.\nHere are the first lines of the mito.all.fasta file:\n\u0026gt;NC_072933 {\u0026#34;definition\u0026#34;:\u0026#34;Echinosophora koreensis mitochondrion, complete genome.\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Echinosophora koreensis\u0026#34;,\u0026#34;taxid\u0026#34;:228658} ctttcgggtcggaaatagaagatctggattagatcccttctcgatagctttagtcagagc tcatccctcgaaaaagggagtagtgagatgagaaaagggtgactagaatacggaaattca actagtgaagtcagatccgggaattccactattgaagttatccgtcttaggcttcaagca agctatctttcaaggaagtcagtctaagccctaagccaagatctgctttttgccagtcaa Preparing a database for new barcode inference # Preparing a database for new barcode inference involves three steps:\nAnnotate the sequences by their species taxid. Make sure that no species is represented much more than the others. Extract only vertebrate genomes. Searching for the taxid of vertebrates. # First we will search for the taxid of Vertebrata, as the taxid is the only way to pass taxonomic information to the OBITools. The --fixed option asks for exact matches of the name. The name search is not case-sensitive.\nobitaxonomy -t ncbitaxo_20250211.tgz \\ --fixed \\ \u0026#39;vertebrata\u0026#39; taxid,parent,taxonomic_rank,scientific_name taxon:1261581 [Vertebrata]@genus,taxon:2008651 [Polysiphonioideae]@subfamily,genus,Vertebrata taxon:7742 [Vertebrata]@clade,taxon:89593 [Craniata]@subphylum,clade,Vertebrata The csvlook command allows to have a pretty and more readable table:\nobitaxonomy -t ncbitaxo_20250211.tgz \\ --fixed \\ \u0026#39;vertebrata\u0026#39; \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | -------------------------------- | ------------------------------------------- | -------------- | --------------- | | taxon:1261581 [Vertebrata]@genus | taxon:2008651 [Polysiphonioideae]@subfamily | genus | Vertebrata | | taxon:7742 [Vertebrata]@clade | taxon:89593 [Craniata]@subphylum | clade | Vertebrata | Surprisingly, the Latin name Vertebrata is shared by two different taxa. The first is a genus and obviously not the one we are looking for. The second is a clade, and it is the one we are looking for.\nLooking for the Vertebrata genus taxid # Just out of curiosity, we are going to search for the taxonomic path Vertebrata genus taxid.\nobitaxonomy -t ncbitaxo_20250211.tgz \\ -p 2008651 \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | ------------------------------------------- | ------------------------------------------- | -------------- | ------------------ | | taxon:2008651 [Polysiphonioideae]@subfamily | taxon:2803 [Rhodomelaceae]@family | subfamily | Polysiphonioideae | | taxon:2803 [Rhodomelaceae]@family | taxon:2802 [Ceramiales]@order | family | Rhodomelaceae | | taxon:2802 [Ceramiales]@order | taxon:2045261 [Rhodymeniophycidae]@subclass | order | Ceramiales | | taxon:2045261 [Rhodymeniophycidae]@subclass | taxon:2806 [Florideophyceae]@class | subclass | Rhodymeniophycidae | | taxon:2806 [Florideophyceae]@class | taxon:2763 [Rhodophyta]@phylum | class | Florideophyceae | | taxon:2763 [Rhodophyta]@phylum | taxon:2759 [Eukaryota]@superkingdom | phylum | Rhodophyta | | taxon:2759 [Eukaryota]@superkingdom | taxon:131567 [cellular organisms]@no rank | superkingdom | Eukaryota | | taxon:131567 [cellular organisms]@no rank | taxon:1 [root]@no rank | no rank | cellular organisms | | taxon:1 [root]@no rank | taxon:1 [root]@no rank | no rank | root | You can see that Vertebrata genus belongs to the Rhodophyta phylum, which corresponds to red algae.\nRe-annotation of sequences to species level and selection of genomes # In order to know how species are represented in the database, and more specifically how many sequences represent each species, we will annotate the sequences with taxonomic information at the species level. We need to do this because some mitochondrial genomes can be annotated at other taxonomic levels, such as subspecies.\nobiannotate can perform this task using the --with-taxon-at-rank option. This option requires you to specify the taxonomic rank at which the annotation should be performed. In this example case, we have to use the rank species. The species taxid is stored in the species_taxid tag of the sequence.\nIn the following command we combine three obiannotate commands with one obiuniq command using the | pipe operator (see the General operating principles section):\nobiannotate -t ncbitaxo_20250211.tgz \\ --with-taxon-at-rank species \\ mito.all.fasta | \\ obiannotate -S \u0026#39;ori_taxid=annotations.taxid\u0026#39; | \\ obiannotate -S \u0026#39;taxid=annotations.species_taxid\u0026#39; | \\ obiuniq -c taxid \u0026gt; mito.one.fasta Looking at the sequence of NC_050066, it is annotated with taxon 2756270, which corresponds to the subspecies Monochamus alternatus alternatus:\n\u0026gt;NC_050066 {\u0026#34;definition\u0026#34;:\u0026#34;Monochamus alternatus alternatus mitochondrion, complete genome.\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Monochamus alternatus alternatus\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;} aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... The first obiannotate command adds the species_taxid tag to the sequences.\n\u0026gt;NC_050066 {\u0026#34;definition\u0026#34;:\u0026#34;Monochamus alternatus alternatus mitochondrion, complete genome.\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Monochamus alternatus alternatus\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Monochamus alternatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;} aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... The second obiannotate copies the original taxid tag into a new tag named ori_taxid to preserve the original taxid for possible future use.\n\u0026gt;NC_050066 {\u0026#34;definition\u0026#34;:\u0026#34;Monochamus alternatus alternatus mitochondrion, complete genome.\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Monochamus alternatus alternatus\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Monochamus alternatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;} aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... The third obiannotate then copies the species_taxid tag into the main taxid tag. From now on, the OBITools will use the species taxid stored in the taxid tag as the taxonomic annotation for the sequence.\n\u0026gt;NC_050066 {\u0026#34;definition\u0026#34;:\u0026#34;Monochamus alternatus alternatus mitochondrion, complete genome.\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Monochamus alternatus alternatus\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Monochamus alternatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;} aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... Look carefully at this latest version of the sequence. The taxid tag has been updated to the species taxid, the ori_taxid tag contains the original taxid as provided by Genbank, and the species_taxid tag also contains the species taxid.\nThe last obiuniq merges in a single sequence entry all the sequences strictly identical. Here, the -c taxid option ensures that only sequences with the same taxid are merged. Therefore, two strictly identical sequences not annotated with the same taxid will be kept as two sequence entries.\nLook at the evenness of the species representation # The goal here is to create a histogram representing the number of sequences per species, thanks to UNIX commands. More specifically, how many species are represented by one, two, three or more sequences.\nThe last command to run is the following:\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | sort \\ | uniq -c \\ | sort -nk1 \\ | cut -w -f 2 \\ | uplot count But first, try to understand what is going on.\nobicsv converts a sequence file into a CSV file. Here because of the -k taxid option, the CSV file will only contain the taxid tag for every sequence. The head command is used to display the top ten first lines of the result.\nobicsv -k taxid mito.one.fasta \\ | head taxid taxon:2065826 [Sineleotris saccharae]@species taxon:2219250 [Ocinara albicollis]@species taxon:8306 [Ambystoma talpoideum]@species taxon:80600 [Rhizopogon vinicolor]@species taxon:270463 [Vanessa indica]@species taxon:1028098 [Hierodula patellifera]@species taxon:56258 [Sagittarius serpentarius]@species taxon:457650 [Myadora brevis]@species taxon:763200 [Arma chinensis]@species The tail command is used to remove the header line from the CSV file, to keep only the data part of the file. It is done by extracting the tail, the end of the file, from its second line (option -n +2).\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | head taxon:2065826 [Sineleotris saccharae]@species taxon:2219250 [Ocinara albicollis]@species taxon:8306 [Ambystoma talpoideum]@species taxon:80600 [Rhizopogon vinicolor]@species taxon:270463 [Vanessa indica]@species taxon:1028098 [Hierodula patellifera]@species taxon:56258 [Sagittarius serpentarius]@species taxon:457650 [Myadora brevis]@species taxon:763200 [Arma chinensis]@species taxon:2060314 [Neotrygon indica]@species As you can see, the first line of the output does not contain the taxid column name header present in the previous output. In the next command, the sort command is used to sort the line to put identical taxid values in a row.\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | sort \\ | head \u0026#34;taxon:1030158 [Ficus variegata Roding, 1798]@species\u0026#34; \u0026#34;taxon:244488 [Pillucina pisidium (Dunker, 1860)]@species\u0026#34; \u0026#34;taxon:352057 [Anopheles albitarsis F Brochero et al., 2007]@species\u0026#34; \u0026#34;taxon:646521 [Contracaecum rudolphii B Bullini et al., 1986]@species\u0026#34; \u0026#34;taxon:908352 [Anopheles albitarsis G Krzywinski et al., 2011]@species\u0026#34; taxon:1000982 [Steindachneridion melanodermatum]@species taxon:1001283 [Calameuta idolon]@species taxon:1001291 [Trachelus tabidus]@species taxon:1001332 [Phylloporia weberiana]@species taxon:1001553 [Dephomys defua]@species We can then add the uniq -c command to count the number of times each taxid appears in the file.\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | sort \\ | uniq -c \\ | head 1 \u0026#34;taxon:1030158 [Ficus variegata Roding, 1798]@species\u0026#34; 1 \u0026#34;taxon:244488 [Pillucina pisidium (Dunker, 1860)]@species\u0026#34; 1 \u0026#34;taxon:352057 [Anopheles albitarsis F Brochero et al., 2007]@species\u0026#34; 1 \u0026#34;taxon:646521 [Contracaecum rudolphii B Bullini et al., 1986]@species\u0026#34; 1 \u0026#34;taxon:908352 [Anopheles albitarsis G Krzywinski et al., 2011]@species\u0026#34; 1 taxon:1000982 [Steindachneridion melanodermatum]@species 1 taxon:1001283 [Calameuta idolon]@species 1 taxon:1001291 [Trachelus tabidus]@species 1 taxon:1001332 [Phylloporia weberiana]@species 1 taxon:1001553 [Dephomys defua]@species The uniq command added the first column to the output, which is the number of times each taxid appears in the original file.\nNext step is to remove the taxid column from the output and keep only the count first column. Because the uniq command adds a space between before the count column, the cut command will consider it as the second column despite for us it looks like the first column.\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | sort \\ | uniq -c \\ | cut -w -f 2 \\ | head 1 1 1 1 1 1 1 1 1 1 The -w is used to specify that the column separator is the space character. The -f 2 is used to specify that the second column is the one to be cut. The last step is to send this output to the uplot command to plot the histogram.\nobicsv -k taxid mito.one.fasta \\ | tail -n +2 \\ | sort \\ | uniq -c \\ | sort -nk1 \\ | cut -w -f 2 \\ | uplot count ā”Œ ┐ 1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 17769.0 2 ┤ 90.0 3 ┤ 17.0 4 ┤ 5.0 5 ┤ 4.0 6 ┤ 2.0 7 ┤ 1.0 ā”” ā”˜ Very few taxa are represented by more than one mitochondrial genome, while 17769 species are represented by a single genome. Here we can assume that the mitochondrial genomes are not too much biased in favour of a particular taxon.\nSelection of vertebrate genomes # The mitochondrial database we have downloaded contains mitochondrial genomes from vertebrates, but also from invertebrates, fungi, plants\u0026hellip; Since the ecoPrimers require that potentially all sequences provided in the learning database can contain the barcode we are looking for, we will restrict the learning database to contain only vertebrate genomes.\nobigrep command will do this for us. We just need to provide the taxid of the vertebrata taxon use as the -r option, and the taxonomy using the -t option.\nobigrep -t ncbitaxo_20250211.tgz \\ -r 7742 \\ mito.one.fasta \u0026gt; mito.vert.fasta Now we can count the number of sequences in the new learning database.\nobicount mito.vert.fasta \\ | csvlook | entities | n | | -------- | ----------- | | variants | 7,822 | | reads | 7,823 | | symbols | 131,378,756 | Formatting data for ecoPrimers # As mentioned in the introduction, the ecoPrimers have been designed to work with the original version of OBITools. We now need to perform three more steps to prepare the data for the ecoPrimers.\nUnarchiving the taxonomy # The old OBITools cannot use archived and compressed taxonomies. So we need to\nCreate a new directory to store the unarchived taxonomy using the mkdir command. Change to the new directory using the `cd\u0026rsquo; command. Extract the taxonomy from the compressed file using the tar command. Return to the original directory using the `cd\u0026rsquo; command. mkdir ncbitaxo_20250211 cd ncbitaxo_20250211 tar zxvf ../ncbitaxo_20250211.tgz cd .. Converting the database to the old obitools format # Now OBITools4 stores the annotations in JSON format.\n\u0026gt;NC_050066 {\u0026#34;definition\u0026#34;:\u0026#34;Monochamus alternatus alternatus mitochondrion, complete genome. \u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:2756270 [Monochamus alternatus alternatus]@subspecies\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Monochamus alternatus alternatus\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Monochamus alternatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:192382 [Monochamus alternatus]@species\u0026#34;} aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... The original OBITools stored the annotation in a key=value; format.\n\u0026gt;NC_050066 ori_taxid=taxon:2756270 [Monochamus alternatus alternatus]@subspecies; scientific_name=mitochondrion Monochamus alternatus alternatus; species_name=Monochamus alternatus; species_taxid=taxon:192382 [Monochamus alternatus]@species; taxid=taxon:192382 [Monochamus alternatus]@species; count=1; Monochamus alternatus alternatus mitochondrion, complete genome. aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca ... When the -O option is added to a OBITools4 command, the old OBITools format is used instead of the new JSON-based format.\nobiconvert -O mito.vert.fasta \u0026gt; mito.vert.old.fasta head -5 mito.vert.old.fasta \u0026gt;NC_071784 taxid=taxon:2065826 [Sineleotris saccharae]@species; count=1; ori_taxid=taxon:2065826 [Sineleotris saccharae]@species; scientific_name=mitochondrion Sineleotris saccharae; species_name=Sineleotris saccharae; species_taxid=taxon:2065826 [Sineleotris saccharae]@species; Sineleotris saccharae mitochondrion, complete genome. gctagcgtagcttaaccaaagcataacactgaagatgttaagatgggccctagaaagccc cgcaagcacaaaagcttggtcctggctttactatcagcttaggctaaacttacacatgca agtatccgcatccccgtgagaatgcccttaagctcccaccgctaacaggagtcaaggagc cggtatcaggcacaaccctgagttagcccacgacaccttgctcagccacacccccaaggg Indexing the mitochondrial learning database # The last step for preparing the data for the ecoPrimers is to index the learning database. This job was done by the original OBITools, but the new OBITools4 do not.\nUsing the ecoPCRFormat python script, you can do that indexing without the need of the original OBITools.\nOnce you have downloaded the ecoPCRFormat python script by clicking here, you have to make it executable and to copy it to the same directory as the ecoPrimers program.\nHere, an example of how to do that:\ncurl http://localhost:1313/obitools4-doc/docs/cookbook/ecoprimers/ecoPCRFormat \u0026gt; ecoPCRFormat chmod +x ecoPCRFormat cp ecoPCRFormat /Users/coissac/bin You can now run the ecoPCRFormat script to create the index files.\necoPCRFormat -t ncbitaxo_20250211 \\ -f \\ -n vertebrata \\ mito.vert.old.fasta The -t option specifies the directory where the taxonomy database is located. The -f option specifies that the input file is in fasta format. The -n option specifies the name of the indexed learning database. The last parameter mito.vert.old.fasta is the name of the input file containing the sequences to be indexed. This command creates the following index files:\nls -l vertebrata* -rw-r--r--@ 1 coissac staff 260899785 Feb 11 11:53 vertabrata.ndx -rw-r--r--@ 1 coissac staff 546 Feb 11 11:53 vertabrata.rdx -rw-r--r--@ 1 coissac staff 121379751 Feb 11 11:53 vertabrata.tdx -rw-r--r--@ 1 coissac staff 40446318 Feb 11 11:54 vertabrata_001.sdx Selecting the best primer pairs # Searching the Teleostei taxid # To design a new DNA metabarcode for bony fish, we have first to find the Teleostei taxid.\nobitaxonomy -t ncbitaxo_20250211.tgz \\ --fixed \\ \u0026#39;Teleostei\u0026#39; \\ | csvlook | taxid | parent | taxonomic_rank | scientific_name | | ---------------------------------- | ---------------------------------- | -------------- | --------------- | | taxon:32443 [Teleostei]@infraclass | taxon:41665 [Neopterygii]@subclass | infraclass | Teleostei | Running the ecoPrimers program # The ecoPrimers command is responsible for looking for the priming sites. ecoPrimers is an alignment free software able to identify conserved regions among a large set of sequences.\necoPrimers -d vertebrata \\ -e 3 -3 2 \\ -l 30 -L 150 \\ -r 32443 \\ -c \u0026gt; Teleostei.ecoprimers The -d option allows you to specify the learning database, here the vertebrate mitochondrial genome database indexed above. The -e option specifies the maximum number of mismatches allowed between the primer and the priming site. The number of mismatches is per primer. The -3 option, used here with the 2 argument (-3 2), indicates that no mismatches are allowed on the last two nucleotides (3\u0026rsquo; end) of the primer. The -l option specifies the minimum length of the barcode (excluding primers) to search for. The -L option specifies the maximum length of the barcode (excluding primers) to search for. The -r indicates which taxon (here Teleostei) ecoPrimers will focus on. The -c indicates that the learning database consists of circular genomes. After a few minutes of running and writing information about its progress to the terminal, ecoPrimer returns a here, indicating that it has identified :\nTotal number of pairs : 9407 Total number of good pairs : 407 We can now have a look at the beginning of the result file.\nhead -35 Teleostei.ecoprimers # # ecoPrimer version 0.5 # Rank level optimisation : species # max error count by oligonucleotide : 3 # # Restricted to taxon: # 32443 : Teleostei (infraclass) # # strict primer quorum : 0.70 # example quorum : 0.90 # counterexample quorum : 0.10 # # database : vertebrata # Database is constituted of 3909 examples corresponding to 3876 species # and 0 counterexamples corresponding to 0 species # # amplifiat length between [30,150] bp # DB sequences are considered as circular # Pairs having specificity less than 0.60 will be ignored # 0 AGAGTGACGGGCGGTGTG CGTCAGGTCGAGGTGTAG 62.8 42.4 57.5 34.1 12 11 GG 3864 0 0.988 3832 0 0.989 2731 0.713 134 146 138.22 1 CGTCAGGTCGAGGTGTAG GAGTGACGGGCGGTGTGT 57.5 34.1 63.1 42.9 11 12 GG 3863 0 0.988 3831 0 0.988 2730 0.713 133 145 137.22 2 CGTCAGGTCGAGGTGTAG GGGAGAGTGACGGGCGGT 57.5 34.1 64.5 37.0 11 13 GG 3811 0 0.975 3779 0 0.975 2689 0.712 137 149 141.22 3 CGTCAGGTCGAGGTGTAG GGGGAGAGTGACGGGCGG 57.5 34.1 65.5 38.4 11 14 GG 3804 0 0.973 3772 0 0.973 2682 0.711 138 149 142.22 4 ACACCGCCCGTCACTCTC ACCTTCCGGTACACTTAC 62.5 36.8 54.0 16.6 12 9 GG 3850 0 0.985 3818 0 0.985 2658 0.696 46 132 66.51 5 AACGTCAGGTCGAGGTGT AGAGTGACGGGCGGTGTG 58.8 28.4 62.8 41.7 10 12 GG 3779 0 0.967 3746 0 0.966 2653 0.708 137 148 140.23 6 ACACCGCCCGTCACTCTC CACCTTCCGGTACACTTA 62.5 36.8 54.0 16.6 12 9 GG 3846 0 0.984 3814 0 0.984 2654 0.696 47 133 67.51 7 AACGTCAGGTCGAGGTGT GAGTGACGGGCGGTGTGT 58.8 28.4 63.1 42.1 10 12 GG 3778 0 0.966 3745 0 0.966 2652 0.708 136 147 139.23 8 ACCTTCCGGTACACTTAC CACACCGCCCGTCACTCT 54.0 16.6 62.8 37.3 9 12 GG 3845 0 0.984 3813 0 0.984 2653 0.696 47 133 67.51 9 ACACCGCCCGTCACTCTC TCCGGTACACTTACCATG 62.5 36.8 54.1 18.1 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 42 128 62.51 10 ACACCGCCCGTCACTCTC CCGGTACACTTACCATGT 62.5 36.8 54.4 18.6 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 41 127 61.51 11 ACACCGCCCGTCACTCTC CCAAGTGCACCTTCCGGT 62.5 36.8 60.7 28.9 12 11 GG 3837 0 0.982 3805 0 0.982 2650 0.696 54 140 74.51 12 ACACCGCCCGTCACTCTC GCACCTTCCGGTACACTT 62.5 36.8 57.7 22.5 12 10 GG 3842 0 0.983 3810 0 0.983 2650 0.696 48 134 68.51 13 ACACCGCCCGTCACTCTC CGGTACACTTACCATGTT 62.5 36.8 52.4 15.7 12 8 GG 3850 0 0.985 3818 0 0.985 2650 0.694 40 126 60.51 14 ACACCGCCCGTCACTCTC CACTTACCATGTTACGAC 62.5 36.8 51.1 27.7 12 8 GG 3850 0 0.985 3817 0 0.985 2649 0.694 35 121 55.51 The result file consists of two parts. The header, consisting of lines starting with the # character, contains all the parameters used by the ecoPrimer algorithms and some statistics about the database and the current search.\nThe second part is a tabular text describing all potential primer pairs identified. Immediately below this is a detailed description of the information contained in each column.\nTable result description :\ncolumn 1 : serial number column 2 : primer1 column 3 : primer2 column 4 : primer1 Tm without mismatch column 5 : primer1 lowest Tm against exemple sequences column 6 : primer2 Tm without mismatch column 7 : primer2 lowest Tm against exemple sequences column 8 : primer1 G+C count column 9 : primer2 G+C count column 10 : good/bad column 11 : amplified example sequence count column 12 : amplified counterexample sequence count column 13 : yule column 14 : amplified example taxa count column 15 : amplified counterexample taxa count column 16 : ratio of amplified example taxa versus all example taxa (Bc index) column 17 : unambiguously identified example taxa count column 18 : ratio of specificity unambiguously identified example taxa versus all example taxa (Bs index) column 19 : minimum amplified length column 20 : maximum amplified length column 21 : average amplified length Suppose we decide to focus on the 11th pair because it seems to have relatively good properties and, in particular, a relatively balanced melting temperature between the two primers.\nPrimer ID : 11 Primer sequence tm max tm min GC count Forward ACACCGCCCGTCACTCTC 62.5 36.8 12 Reverse CCAAGTGCACCTTCCGGT 60.7 28.9 11 amplifying 3837/3909 sequences identify 2650/3876 Species Size ranging from 54bp to 140bp (mean: 74.75 bp) Testing the new primer pair # To better characterise this pair, we can now use the obipcr tool to extract the barcode sequence corresponding to this pair from the learning database.\nobipcr --forward ACACCGCCCGTCACTCTC \\ --reverse CCAAGTGCACCTTCCGGT \\ -e 5 \\ -l 30 -L 150 \\ -c \\ mito.vert.fasta \\ \u0026gt; Teleostei_11.fasta head Teleostei_11.fasta \u0026gt;NC_022183_sub[925..998] {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Acrossocheilus hemispinus mitochondrion, complete genome.\u0026#34;,\u0026#34;direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;forward_error\u0026#34;:1,\u0026#34;forward_match\u0026#34;:\u0026#34;acaccgcccgtcaccctc\u0026#34;,\u0026#34;forward_primer\u0026#34;:\u0026#34;ACACCGCCCGTCACTCTC\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:356810 [Acrossocheilus hemispinus]@species\u0026#34;,\u0026#34;reverse_error\u0026#34;:0,\u0026#34;reverse_match\u0026#34;:\u0026#34;ccaagtgcaccttccggt\u0026#34;,\u0026#34;reverse_primer\u0026#34;:\u0026#34;CCAAGTGCACCTTCCGGT\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Acrossocheilus hemispinus\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Acrossocheilus hemispinus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:356810 [Acrossocheilus hemispinus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:356810 [Acrossocheilus hemispinus]@species\u0026#34;} cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgt aacatggtaagtgt \u0026gt;NC_018560_sub[916..988] {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Astatotilapia calliptera mitochondrion, complete genome.\u0026#34;,\u0026#34;direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;forward_error\u0026#34;:0,\u0026#34;forward_match\u0026#34;:\u0026#34;acaccgcccgtcactctc\u0026#34;,\u0026#34;forward_primer\u0026#34;:\u0026#34;ACACCGCCCGTCACTCTC\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:8154 [Astatotilapia calliptera]@species\u0026#34;,\u0026#34;reverse_error\u0026#34;:1,\u0026#34;reverse_match\u0026#34;:\u0026#34;ccaagtacaccttccggt\u0026#34;,\u0026#34;reverse_primer\u0026#34;:\u0026#34;CCAAGTGCACCTTCCGGT\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Astatotilapia calliptera (eastern happy)\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Astatotilapia calliptera\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:8154 [Astatotilapia calliptera]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:8154 [Astatotilapia calliptera]@species\u0026#34;} cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgta acatggtaagtgt \u0026gt;NC_056117_sub[923..997] {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Pseudocrossocheilus tridentis mitochondrion, complete genome.\u0026#34;,\u0026#34;direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;forward_error\u0026#34;:0,\u0026#34;forward_match\u0026#34;:\u0026#34;acaccgcccgtcactctc\u0026#34;,\u0026#34;forward_primer\u0026#34;:\u0026#34;ACACCGCCCGTCACTCTC\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:887881 [Pseudocrossocheilus tridentis]@species\u0026#34;,\u0026#34;reverse_error\u0026#34;:0,\u0026#34;reverse_match\u0026#34;:\u0026#34;ccaagtgcaccttccggt\u0026#34;,\u0026#34;reverse_primer\u0026#34;:\u0026#34;CCAAGTGCACCTTCCGGT\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Pseudocrossocheilus tridentis\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Pseudocrossocheilus tridentis\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:887881 [Pseudocrossocheilus tridentis]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:887881 [Pseudocrossocheilus tridentis]@species\u0026#34;} ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcg taacacggtaagtgt \u0026gt;NC_045904_sub[919..997] {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Eospalax fontanierii mitochondrion, complete genome.\u0026#34;,\u0026#34;direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;forward_error\u0026#34;:1,\u0026#34;forward_match\u0026#34;:\u0026#34;acaccgcccgtcgctctc\u0026#34;,\u0026#34;forward_primer\u0026#34;:\u0026#34;ACACCGCCCGTCACTCTC\u0026#34;,\u0026#34;ori_taxid\u0026#34;:\u0026#34;taxon:146134 [Eospalax fontanierii]@species\u0026#34;,\u0026#34;reverse_error\u0026#34;:4,\u0026#34;reverse_match\u0026#34;:\u0026#34;ccaagcacactttccagt\u0026#34;,\u0026#34;reverse_primer\u0026#34;:\u0026#34;CCAAGTGCACCTTCCGGT\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;mitochondrion Eospalax fontanierii\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Eospalax fontanierii\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:146134 [Eospalax fontanierii]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:146134 [Eospalax fontanierii]@species\u0026#34;} To be able to process the fasta file with R and produce some statistics describing the conservation of barcodes between taxa and the ability of the barcode to discriminate between taxa, we need to convert the fasta file to CSV format. This can be done with the command obicsv . The command, when run with the --auto option, will automatically identify all tags present in the annotations of the first few records and create a CSV file with the corresponding columns.\nobicsv --auto -s -i Teleostei_11.fasta \u0026gt; Teleostei_11.csv It is now possible to view the first few lines of the generated CSV file using a combination of the head and csvlook commands.\nhead Teleostei_11.csv | csvlook | id | count | direction | forward_error | forward_match | forward_primer | ori_taxid | reverse_error | reverse_match | reverse_primer | scientific_name | species_name | species_taxid | taxid | sequence | | ------------------------- | ----- | --------- | ------------- | ------------------ | ------------------ | ---------------------------------------------------- | ------------- | ------------------ | ------------------ | ------------------------------------------------------ | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------- | | NC_022183_sub[925..998] | True | forward | True | acaccgcccgtcaccctc | ACACCGCCCGTCACTCTC | taxon:356810 [Acrossocheilus hemispinus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Acrossocheilus hemispinus | Acrossocheilus hemispinus | taxon:356810 [Acrossocheilus hemispinus]@species | taxon:356810 [Acrossocheilus hemispinus]@species | cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgtaacatggtaagtgt | | NC_018560_sub[916..988] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:8154 [Astatotilapia calliptera]@species | 1 | ccaagtacaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Astatotilapia calliptera (eastern happy) | Astatotilapia calliptera | taxon:8154 [Astatotilapia calliptera]@species | taxon:8154 [Astatotilapia calliptera]@species | cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgtaacatggtaagtgt | | NC_056117_sub[923..997] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:887881 [Pseudocrossocheilus tridentis]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Pseudocrossocheilus tridentis | Pseudocrossocheilus tridentis | taxon:887881 [Pseudocrossocheilus tridentis]@species | taxon:887881 [Pseudocrossocheilus tridentis]@species | ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcgtaacacggtaagtgt | | NC_045904_sub[919..997] | True | forward | True | acaccgcccgtcgctctc | ACACCGCCCGTCACTCTC | taxon:146134 [Eospalax fontanierii]@species | 4 | ccaagcacactttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Eospalax fontanierii | Eospalax fontanierii | taxon:146134 [Eospalax fontanierii]@species | taxon:146134 [Eospalax fontanierii]@species | ctcaagtacataaacttggatatattcttaataacccaacaaaaatattagaggagataagtcgtaacaaggtaagcat | | NC_018546_sub[916..987] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:30732 [Oryzias melastigma]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Oryzias melastigma (Indian medaka) | Oryzias melastigma | taxon:30732 [Oryzias melastigma]@species | taxon:30732 [Oryzias melastigma]@species | cccgacccattttaaaaattaaataaaagatttcaggaactaaggggaggcaagtcgtaacatggtaagtgt | | NC_044151_sub[922..993] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597641 [Sicyopterus squamosissimus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus squamosissimus (cling goby) | Sicyopterus squamosissimus | taxon:2597641 [Sicyopterus squamosissimus]@species | taxon:2597641 [Sicyopterus squamosissimus]@species | cccaaaacaaacacacacataaataagaaaaaatgaaaataaaggggaggcaagtcgtaacatggtaagtgt | | NC_044152_sub[922..994] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597642 [Sicyopterus stiphodonoides]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus stiphodonoides (cling goby) | Sicyopterus stiphodonoides | taxon:2597642 [Sicyopterus stiphodonoides]@species | taxon:2597642 [Sicyopterus stiphodonoides]@species | cccaaaacaaacacacacataaataagaaaaaantgaaaataaaggggaggcaagtcgtaacatggtaagtgt | | NC_026976_sub[1453..1531] | True | forward | True | acaccgcccgtcactccc | ACACCGCCCGTCACTCTC | taxon:9545 [Macaca nemestrina]@species | 1 | ccaagtgcaccttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Macaca nemestrina (pig-tailed macaque) | Macaca nemestrina | taxon:9545 [Macaca nemestrina]@species | taxon:9545 [Macaca nemestrina]@species | ctcaaatatatttaaggaacatcttaactaaacgccctaatatttatatagaggggataagtcgtaacatggtaagtgt | | NC_031553_sub[921..995] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:643337 [Puntioplites proctozystron]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Puntioplites proctozystron | Puntioplites proctozystron | taxon:643337 [Puntioplites proctozystron]@species | taxon:643337 [Puntioplites proctozystron]@species | ccctgtcaaaacgcactaaaaatatctaatacaaaagcaccgacaaggggaggcaagtcgtaacacggtaagtgt | References # Riaz,\u0026#32; Shehzad,\u0026#32; Viari,\u0026#32; Pompanon,\u0026#32; Taberlet\u0026#32;\u0026amp;\u0026#32;Coissac (2011) Riaz,\u0026#32; T.,\u0026#32; Shehzad,\u0026#32; W.,\u0026#32; Viari,\u0026#32; A.,\u0026#32; Pompanon,\u0026#32; F.,\u0026#32; Taberlet,\u0026#32; P.\u0026#32;\u0026amp;\u0026#32;Coissac,\u0026#32; E. \u0026#32; (2011). \u0026#32;ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research,\u0026#32;39(21).\u0026#32;e145. https://doi.org/10.1093/nar/gkr732 "},{"id":3,"href":"/obidoc/docs/file_format/taxonomy_file/ncbi_taxdump/","title":"NCBI taxdump","section":"Taxonomy file formats","content":" The NCBI taxonomy dump # The NCBI provides a taxonomy that is used as a reference taxonomy for all molecular data published by NCBI, EBI and DDBJ. This taxonomy is available via a web interface, but can also be downloaded from the NCBI FTP server.\nThe NCBI taxonomy can be used by OBITools4 by downloading the taxdump from the NCBI FTP server. The file is a gzipped tarball archive containing the following files required by OBITools4:\nnodes.dmp : a tab-separated file containing the taxonomic hierarchy names.dmp : a tab-separated file containing the scientific names of the organisms merged.dmp : a tab-separated file containing the information about reassignment of taxids delnodes.dmp : a tab-separated file containing the information about old taxids today deleted from the taxonomy. Downloading the NCBI taxonomy dump # The obitaxonomy command provides the --download-ncbi option, which downloads a copy of the NCBI taxonomy dump tarball from the NCBI FTP server. By default, the file is downloaded to the current directory with the name ncbitaxo_YYYYMMDD.tgz, where YYYY is the year, MM the month and DD the current date. The filename used to save the tarball can be specified with the --out option, as in the following example:\nobitaxonomy --download-ncbi --out ncbitaxo.tgz Note OBITools4 do not require extracting the downloaded file. The name of the compressed file can be passed directly to any OBITools4 command using the --taxonomy option.\nStructure of the NCBI taxonomy directory # The ncbitaxo.tgz archive can be unpacked using the following bash commands:\nmkdir ncbitaxo cd ncbitaxo tar -zxvf ../ncbitaxo.tgz cd .. The ncbitaxo directory contains all the files provided by NCBI. The readme.txt file describes the content of each file provided. Only the files used by OBITools4 are described below.\nThe nodes.dmp file # The nodes.dmp file is a tab-separated file, here is the description of the first columns used by OBITools4:\nField Description tax_id A unique taxonomic identifier composed only of digits (0-9) parent tax_id The taxid of the parent taxon of the current taxon rank The taxonomic rank of the taxon (e.g. species, genus, family, etc.) Here are the first lines of this file:\n1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0| 0 | | 2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0 | 0 |0 | 0 | | 6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | | 7 | 6 | species | AC | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | | 9 | 32199 | species | BA | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | | 10 | 1706371 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | | 11 | 1707 | species | CG | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | effective current name; | 13 | 203488 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | | 14 | 13 | species | DT | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | | 16 | 32011 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | | The names.dmp file # The names.dmp file is a tab-separated file with the following columns:\nField Description tax_id The node identifier associated with this name name_txt The name itself unique name The unique variant of this name if name not unique name class (synonym, common name, \u0026hellip;) Here are the first lines of this file:\n1 | all | | synonym | 1 | root | | scientific name | 2 | Bacteria | Bacteria \u0026lt;bacteria\u0026gt; | scientific name | 2 | bacteria | | blast name | 2 | eubacteria | | genbank common name | 2 | Monera | Monera \u0026lt;bacteria\u0026gt; | in-part | 2 | Procaryotae | Procaryotae \u0026lt;bacteria\u0026gt; | in-part | 2 | Prokaryotae | Prokaryotae \u0026lt;bacteria\u0026gt; | in-part | 2 | Prokaryota | Prokaryota \u0026lt;bacteria\u0026gt; | in-part | 2 | prokaryote | prokaryote \u0026lt;bacteria\u0026gt; | in-part | The merged.dmp file # The merged.dmp file is a tab-separated file with the following columns:\nField Description old_tax_id The node identifier which has been merged new_tax_id The node identifier which is result of merging Here are the first lines of this file:\n12 | 74109 | 30 | 29 | 36 | 184914 | 37 | 42 | 46 | 39 | 67 | 32033 | 76 | 155892 | 77 | 74311 | 79 | 74313 | 80 | 155892 | The delnodes.dmp file # The delnodes.dmp file is a tab-separated file with the following columns:\nField Description tax_id The deleted node ID Here are the first lines of this file:\n3025011 | 3025010 | 3025009 | 3025008 | 3025007 | 3025006 | 3025005 | 3025004 | 3025003 | 3025002 | "},{"id":4,"href":"/obidoc/docs/cookbook/local_genbank/","title":"Prepare a local copy of Genbank","section":"Cookbook","content":" Prepare a local copy of Genbank # A local copy of the GenBank database requires a lot of disk space. A whole copy of GenBank stored as compressed fasta files takes up about 1TB of disk space.\nThree bioinformatics centres distribute all publicly available DNA sequences worldwide. They are\nNCBI: distributes GenBank EMBL-EBI: distributes EMBL DDBJ: distributes DDBJ The three centres are associated in an international agreement, the International Nucleotide Sequence Database Collaboration (INSDC). This agreement allows the three centres to share the sequences submitted by biologists. As a result, all sequences are available in the three databases, where they are identified by the same accession number.\nThe content of these databases is available via a web interface, but can also be downloaded to have a local copy. The NCBI and the EMBL-EBI have two different strategies for distributing data. The EMBL-EBI distributes fewer large files, whereas the NCBI platform prefers to distribute many small files. This is why we choose to download the sequences from GenBank here.\nEach of these databases is divided into several taxonomic divisions. The main GenBank divisions useful for metabarcoding are:\nbct: Bacteria inv: Invertebrates mam: Mammals phg: Phages pln: Plants pri: Primates rod: Rodents vrl: Viruses vrt: Vertebrates Other divisions exist, but are less useful for metabarcoding ( click here more information).\nDownload GenBank # GenBank is distributed in two main formats: fasta and GenBank . The fasta format has the advantage of being smaller than the GenBank format because all the sequence annotations stored in the GenBank format are not present in the fasta format. For metabarcoding, however, the disadvantage is that the fasta format does not contain the sequence taxonomic information stored as a taxon identifier (taxid).\nTo combine the advantages of both formats, you can download the GenBank format and convert it to the fasta format using the obiconvert command. The obiconvert command ensures that taxonomic information is preserved during conversion.\nNetwork interruptions can occur quite frequently during the process of downloading all these files, so there is a risk of the download failing. To solve this problem, here is a make script that downloads the GenBank files and converts them in fasta files. The choice of make allows the download process to be restarted at the point of failure if it fails.\nTo download GenBank, copy the Makefile file to your local computer in the directory where you want to store the GenBank files.\nThe Makefile script must be called Makefile without any extension. Then, execute the following command:\nmake By default, the script download the divisions of GenBank listed above. To download one or more specific divisions of GenBank, you can use the GBDIV variable. For example, to download only the mam division, enter the following command:\nmake GBDIV=mam To download several divisions like mam and rod, separate the names by a space:\nmake GBDIV=\u0026#34;mam rod\u0026#34; If the download fails, restart the download process by using the make command again, without specifying the GBDIV variable again:\nmake The Makefile will create a directory called Release_###, where ### is the number of the current release. This directory will contain the following files:\n. šŸ“‚ Release_264 └── šŸ“‚ depends/ │ ā”œā”€ā”€ šŸ“„ gbfiles.d │ ā”œā”€ā”€ šŸ“„ gbfiles.d.full └── šŸ“‚ fasta/ │ └── šŸ“‚ mam/ │ ā”œā”€ā”€ šŸ“„ gbmam1.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbmam10.fasta.gz │ └── šŸ“„ ... │ └── šŸ“‚ rod/ │ ā”œā”€ā”€ šŸ“„ gbrod1.fasta.gz │ └── šŸ“„ ... └── šŸ“‚ stamp/ │ ā”œā”€ā”€ šŸ“„ gbmam1.seq.gz.stamp │ ā”œā”€ā”€ šŸ“„ gbmam10.seq.gz.stamp │ ā”œā”€ā”€ šŸ“„ gbrod1.seq.gz.stamp └── šŸ“‚ taxonomy/ ā”œā”€ā”€ šŸ“„ citations.dmp ā”œā”€ā”€ šŸ“„ delnodes.dmp ā”œā”€ā”€ šŸ“„ division.dmp ā”œā”€ā”€ šŸ“„ gc.prt ā”œā”€ā”€ šŸ“„ gencode.dmp ā”œā”€ā”€ šŸ“„ images.dmp ā”œā”€ā”€ šŸ“„ merged.dmp ā”œā”€ā”€ šŸ“„ names.dmp ā”œā”€ā”€ šŸ“„ nodes.dmp └── šŸ“„ readme.txt The taxonomy directory contains a copy of the NCBI taxonomy database at the time of download. The fasta directory contains the fasta files sorted by taxonomic division in subdirectories, here mam and rod. The stamp directory allows the Makefile script to restart the download process if it fails, without having to download the whole GenBank database again. To free up space, the stamp directory can be deleted at the end of the download process. The depends directory contains a make script with all the instructions for downloading the GenBank files. It is first created by the Makefile script. It contains instructions for downloading the files that need to be downloaded according to the specified GenBank division. To free up space, the depends directory can be deleted at the end of the download process. The tmp directory is used to store the downloaded GenBank files before they are converted into fasta . It does not normally persist after the download process. To free up space, the tmp directory can be deleted at the end of the download process if it persists. The Makefile script for downloading Genbank # šŸ“„ Makefile\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 SHELL := /bin/bash FTPNCBI=ftp.ncbi.nlm.nih.gov GBURL=https://$(FTPNCBI)/genbank GBRELEASE_URL=$(GBURL)/GB_Release_Number TAXOURL=https://$(FTPNCBI)/pub/taxonomy/taxdump.tar.gz GBRELEASE:=$(shell curl $(GBRELEASE_URL)) GBDIV_ALL:=$(shell curl -L ${GBURL} \\ | grep -E \u0026#39;gb.+\\.seq\\.gz\u0026#39; \\ | sed -E \u0026#39;s@^.*\u0026lt;a href=\u0026#34;gb([^0-9]+)[0-9]+\\.seq.gz.*$$@\\1@\u0026#39; \\ | sort \\ | uniq) GBDIV=bct inv mam phg pln pri rod vrl vrt DIRECTORIES=fasta fasta_fgs GBFILE_ALL:=$(shell curl -L ${GBURL} \\ | grep -E \u0026#34;gb($$(tr \u0026#39; \u0026#39; \u0026#39;|\u0026#39; \u0026lt;\u0026lt;\u0026lt; \u0026#34;${GBDIV}\u0026#34;))[0-9]+\u0026#34; \\ | sed -E \u0026#39;s@^\u0026lt;a href=\u0026#34;(gb.+.seq.gz)\u0026#34;\u0026gt;.*$$@\\1@\u0026#39;) SUFFIXES += .d NODEPS:=clean taxonomy DEPFILES:=$(wildcard Release_$(GBRELEASE)/depends/*.d) ifeq (0, $(words $(findstring $(MAKECMDGOALS), $(NODEPS)))) #Chances are, these files don\u0026#39;t exist. GMake will create them and #clean up automatically afterwards -include $(DEPFILES) endif all: depends directories FORCE @make downloads downloads: taxonomy fasta_files @echo Genbank Release number $(GBRELEASE) @echo all divisions : $(GBDIV_ALL) FORCE: .PHONY: all directories depends taxonomy fasta_files FORCE depends: directories Release_$(GBRELEASE)/depends/gbfiles.d Makefile division: $(GBDIV) taxonomy: directories Release_$(GBRELEASE)/taxonomy directories: Release_$(GBRELEASE)/fasta Release_$(GBRELEASE)/stamp Release_$(GBRELEASE)/tmp Release_$(GBRELEASE): @mkdir -p $@ @echo Create $@ directory Release_$(GBRELEASE)/fasta: Release_$(GBRELEASE) @mkdir -p $@ @echo Create $@ directory Release_$(GBRELEASE)/stamp: Release_$(GBRELEASE) @mkdir -p $@ @echo Create $@ directory Release_$(GBRELEASE)/tmp: Release_$(GBRELEASE) @mkdir -p $@ @echo Create $@ directory Release_$(GBRELEASE)/depends/gbfiles.d: Makefile @echo Create depends directory @mkdir -p Release_$(GBRELEASE)/depends @for f in ${GBFILE_ALL} ; do \\ echo -e \u0026#34;Release_$(GBRELEASE)/stamp/$$f.stamp:\u0026#34; ; \\ echo -e \u0026#34;\\t@echo Downloading file : $$f...\u0026#34; ; \\ echo -e \u0026#34;\\t@mkdir -p Release_$(GBRELEASE)/tmp\u0026#34; ; \\ echo -e \u0026#34;\\t@mkdir -p Release_$(GBRELEASE)/stamp\u0026#34; ; \\ echo -e \u0026#34;\\t@curl -L ${GBURL}/$$f \u0026gt; Release_$(GBRELEASE)/tmp/$$f \u0026amp;\u0026amp; touch \\$$@\u0026#34; ; \\ echo ; \\ div=$$(sed -E \u0026#39;s@^gb(...).*$$@\\1@\u0026#39; \u0026lt;\u0026lt;\u0026lt; $$f) ; \\ fasta=\u0026#34;Release_$(GBRELEASE)/fasta/$$div/$${f/.seq.gz/.fasta.gz}\u0026#34; ; \\ fasta_fgs=\u0026#34;Release_$(GBRELEASE)/fasta_fgs/$$div/$${f/.seq.gz/.fasta.gz}\u0026#34; ; \\ fasta_files=\u0026#34;$$fasta_files $$fasta\u0026#34; ; \\ fasta_fgs_files=\u0026#34;$$fasta_fgs_files $$fasta_fgs\u0026#34; ; \\ echo -e \u0026#34;$$fasta: Release_$(GBRELEASE)/stamp/$$f.stamp\u0026#34; ; \\ echo -e \u0026#34;\\t@echo converting file : \\$$\u0026lt; in fasta\u0026#34; ; \\ echo -e \u0026#34;\\t@mkdir -p Release_$(GBRELEASE)/fasta/$$div\u0026#34; ; \\ echo -e \u0026#34;\\t@obiconvert -Z --fasta-output --skip-empty \\\\\u0026#34; ; \\ echo -e \u0026#34;\\t Release_$(GBRELEASE)/tmp/$$f \u0026gt; Release_$(GBRELEASE)/tmp/$${f/.seq.gz/.fasta.gz} \\\\\u0026#34; ; \\ echo -e \u0026#34;\\t \u0026amp;\u0026amp; mv Release_$(GBRELEASE)/tmp/$${f/.seq.gz/.fasta.gz} \\$$@ \\\\\u0026#34; ; \\ echo -e \u0026#34;\\t \u0026amp;\u0026amp; rm -f Release_$(GBRELEASE)/tmp/$$f \\\\\u0026#34; ; \\ echo -e \u0026#34;\\t || rm -f \\$$@\u0026#34; ; \\ echo -e \u0026#34;\\t@echo conversion of $$@ done.\u0026#34; ; \\ echo ; \\ done \u0026gt; $@ ; \\ echo \u0026gt;\u0026gt; $@ ; \\ echo \u0026#34;fasta_files: $$fasta_files\u0026#34; \u0026gt;\u0026gt; $@ ; Release_$(GBRELEASE)/taxonomy: mkdir -p $@ curl -iL $(TAXOURL) \\ | tar -C $@ -zxf - "},{"id":5,"href":"/obidoc/formats/csv/","title":"The CSV format","section":"File formats","content":" The CSV (Coma-Separated Values) flat file format # The CSV format is a simple text format that is widely used for storing tabular data, such as spreadsheets, databases, and other data storage systems. It is a comma-separated values format, meaning that each value in a row is separated by a comma.\nEach line of the file corresponds to a record that consists of the same number of fields. The first row of the file is a header row that contains the fields names. The field delimiter, the comma, can itself appear in a field using quotation marks around it, with \u0026quot;.\nHere is an example with two sequences in a fasta file:\nšŸ“„ two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct The following command allows counting the number of records, and provides a CSV file:\nobicount two_sequences.fasta entities,n variants,2 reads,3 symbols,200 In a prettier presentation:\nobicount two_sequences.fasta | csvlook | entities | n | | -------- | --- | | variants | 2 | | reads | 3 | | symbols | 200 | The CSV format of the result of obicount is an easy way to make plots with uplot:\nobicount two_sequences.fasta \\ | uplot barplot -d, -H --xscale log10 n ā”Œ ┐ variants ┤■■■■ 2.0 reads ┤■■■■■■■ 3.0 symbols ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 200.0 ā”” ā”˜ [log10] "},{"id":6,"href":"/obidoc/docs/commands/alignments/obipairing/fasta-like/","title":"The FASTA-like alignment","section":"obipairing","content":" The FASTA-like first step of alignment # The FASTA algorithm ( Citation: Lipman\u0026#32;\u0026amp;\u0026#32;Pearson,\u0026#32;1985 Lipman,\u0026#32; D.\u0026#32;\u0026amp;\u0026#32;Pearson,\u0026#32; W. \u0026#32; (1985). \u0026#32;Rapid and sensitive protein similarity searches. Science,\u0026#32;227(4693).\u0026#32;1435–1441.\u0026#32;Retrieved from\u0026#32; http://www.ncbi.nlm.nih.gov/pubmed/2983426 ) can be considered as the ancestor of BLAST ( Citation: Altschul,\u0026#32;Gish \u0026amp; al.,\u0026#32;1990 Altschul,\u0026#32; S.,\u0026#32; Gish,\u0026#32; W.,\u0026#32; Miller,\u0026#32; W.,\u0026#32; Myers,\u0026#32; E.\u0026#32;\u0026amp;\u0026#32;Lipman,\u0026#32; D. \u0026#32; (1990). \u0026#32;Basic local alignment search tool. Journal of molecular biology,\u0026#32;215(3).\u0026#32;403–410. https://doi.org/10.1006/jmbi.1990.9999 ) . It has the advantage of being easy to implement. It primarily calculates the best shift to apply between the two sequences under consideration to minimize the Hamming distance (number of differences) between them. This alignment algorithm is used in obipairing to determine the position and the size of the overlapping region of paired-end reads. These two criteria will guide the exact alignment method for the subsequent step, and determine the segments of the reads to align.\nThe FASTA-like algorithm builds a table of 4mers (DNA word of length 4) with their positions for the forward and reverse reads.\nTo illustrate, let us consider two short reads, A: ACGTTAGCTAGCTAGCTAA and B: CGCTAGCTAGCTAATTTGG, each of 19 nucleotides, with positions indexed from 00 to 18:\nThe sequences are both composed of \\(19 - 4 + 1 = 16\\) overlapping 4mers. An illustration of the overlapping 4mers is shown below for sequence A:\n0000000000111111111 0123456789012345678 ACGTTAGCTAGCTAGCTAA ACGT AGCT GCTA CTAA CGTT GCTA CTAG GTTA CTAG TAGC TTAG TAGC AGCT TAGC AGCT GCTA The 4mer indices of sequences A and B can then be compared as follows:\ngraph LR %%{init: {'flowchart': {'nodeSpacing': 10, 'rankSpacing': 30, 'htmlLabels': true}} }%% subgraph Sequence_A A_ACGT:::list@{ shape: hex, label: \"00\"} A_AGCT:::list@{ shape: hex, label: \"05, 09, 13\"} A_CGTT:::list@{ shape: hex, label: \"01\"} A_CTAA:::list@{ shape: hex, label: \"15\"} A_CTAG:::list@{ shape: hex, label: \"07, 11\"} A_GCTA:::list@{ shape: hex, label: \"06, 12, 14\"} A_GTTA:::list@{ shape: hex, label: \"02\"} A_TAGC:::list@{ shape: hex, label: \"04, 08, 12\"} A_TTAG:::list@{ shape: hex, label: \"03\"} end subgraph Sequence_B B_AATT:::list@{shape: hex, label: \"12\"} B_AGCT:::list@{shape: hex, label: \"04, 08\"} B_CAGC:::list@{shape: hex, label: \"02\"} B_CGCT:::list@{shape: hex, label: \"00\"} B_CTAA:::list@{shape: hex, label: \"10\"} B_GAGC:::list@{shape: hex, label: \"01\"} B_GCTA:::list@{shape: hex, label: \"05, 09\"} B_GGCT:::list@{shape: hex, label: \"18\"} B_TAAT:::list@{shape: hex, label: \"13\"} B_TAGC:::list@{shape: hex, label: \"03\"} B_TTGG:::list@{shape: hex, label: \"16\"} B_TTTC:::list@{shape: hex, label: \"15\"} end AATT:::red --\u003e B_AATT A_ACGT --\u003e ACGT:::red A_AGCT --\u003e AGCT:::green --\u003e B_AGCT CAGC:::red --\u003e B_CAGC CGCT:::red --\u003e B_CGCT A_CGTT --\u003e CGTT:::red A_CTAA --\u003e CTAA:::green --\u003e B_CTAA A_CTAG --\u003e CTAG:::red GAGC:::red --\u003e B_GAGC A_GCTA --\u003e GCTA:::green --\u003e B_GCTA GGCT:::red --\u003e B_GGCT A_GTTA --\u003e GTTA:::red TAAT:::red --\u003e B_TAAT A_TAGC --\u003e TAGC:::green --\u003e B_TAGC A_TTAG --\u003e TTAG:::red TTGG:::red --\u003e B_TTGG TTTC:::red --\u003e B_TTTC classDef green fill:#9f6, width:80px, height:50px, font-size:14px, align-items:center, text-align:center classDef red fill:#FF8080, width:80px, height:50px, font-size:14px, align-items:center classDef list stroke:#333, stroke-width:2px, width:80px, height:30px, font-size:14px The diagram above indicates that the two sequences share four 4mers, highlighted in green.\nNext, the algorithm computes \\(\\Delta = Pos_A - Pos_B\\) for each 4mer shared between sequences A and B. If a 4mer occurs more than once, all the combinations of the differences are considered for that 4mer.\n4mer positions on A positions on B \\(\\Delta\\) AGCT 05, 09, 13 04, 08 1, -3, 5, 1, 9, 5 CTAA 15 10 5 GCTA 06, 12, 14 05, 09 1, -3, 7, 3, 9, 5 TAGC 04, 08, 12 03 1, 5, 9 Then, for each \\(\\Delta\\) value, the algorithm computes its frequency, as well its relative score (RelScore), expressed as the frequency of the \\(\\Delta\\) value normalized by the number of 4mers involved in the overlap. This allows to identify the best \\(\\Delta\\) value, i.e. the one having the highest RelScore\n\\[ RelScore = \\frac{Frequency}{length(overlap) - (4-1)} \\] \\(\\Delta\\) Frequency RelScore = Frequency/(Overlap - 3) 5 5 0.455 9 3 0.428 1 4 0.267 -3 2 0.154 7 1 0.111 3 1 0.077 The table above is the equivalent of a DNA Dot Plot. A \\(\\Delta\\) value corresponds to one diagonal in the dot plot. By default, obipairing considers the diagonal with the highest RelScore as the optimal alignment, i.e. the best shift-only alignment (no insertions or deletions allowed in the overlap).\nIf the --fasta-absolute option is used, the best \\(\\Delta\\) (or diagonal) is the one exhibiting the highest Frequency.\nShared 4mer positions:This plot is a thresholded DNA dot plot of both sequences. Each point corresponds to a shared 4mer and is located at its respective positions on sequences A \u0026amp; B. The red dotted line indicates the diagonal with the greatest number of dots, here representing five overlapping 4mers and corresponding to a positional difference of 5 between sequences.\nIn this example, the diagonal having the largest RelScore (0.455) corresponds to a \\(\\Delta\\) value of 5, which was observed five times. Thus, the sequence similarity between the sequences A and B is maximized by shifting B of 5 positions relatively to A.\nSequence A: ACGTTAGCTAGCTAGCTAA----- Sequence B: -----CGCTAGCTAGCTAATTTGG Overlap : .+++++++++++++ This delimits the overlapping region between the two reads.\nReferences # Altschul,\u0026#32; Gish,\u0026#32; Miller,\u0026#32; Myers\u0026#32;\u0026amp;\u0026#32;Lipman (1990) Altschul,\u0026#32; S.,\u0026#32; Gish,\u0026#32; W.,\u0026#32; Miller,\u0026#32; W.,\u0026#32; Myers,\u0026#32; E.\u0026#32;\u0026amp;\u0026#32;Lipman,\u0026#32; D. \u0026#32; (1990). \u0026#32;Basic local alignment search tool. Journal of molecular biology,\u0026#32;215(3).\u0026#32;403–410. https://doi.org/10.1006/jmbi.1990.9999 Lipman\u0026#32;\u0026amp;\u0026#32;Pearson (1985) Lipman,\u0026#32; D.\u0026#32;\u0026amp;\u0026#32;Pearson,\u0026#32; W. \u0026#32; (1985). \u0026#32;Rapid and sensitive protein similarity searches. Science,\u0026#32;227(4693).\u0026#32;1435–1441.\u0026#32;Retrieved from\u0026#32; http://www.ncbi.nlm.nih.gov/pubmed/2983426 "},{"id":7,"href":"/obidoc/docs/commands/alignments/obipairing/exact-alignment/","title":"Exact alignment","section":"obipairing","content":" Exact alignment of paired reads # The obipairing command uses an exact alignment algorithm based on dynamic programming to finalize the alignment between forward and reverse reads (the reverse read being reverse-complemented before being aligned). It corresponds to a semi-global alignment (end-gap free) algorithm but asymmetric, in the sense that the penalization of the gap is not done in the same way at both extremities of the read (see below a dedicated section).\nScoring system # The alignment algorithm used by obipairing relies on the following scoring principles :\nMatch: A positive score ( \\( score \u003e 0 \\) ) is assigned when two nucleotides at the same position in the alignment are identical. Thus, the accumulation of matches during the alignment process will increase the alignment score. Mismatch: A negative score ( \\( score \u003c 0 \\) ) is assigned when two nucleotides at the same position in the alignment are different. Thus, the accumulation of mismatches during the alignment process will decrease the alignment score. Gap: A negative score ( \\( score \u003c 0 \\) ) is assigne for nucleotide insertion or deletion in one of the two reads. Thus, the accumulation of insertions or deletions during the alignment process will decrease the alignment score. The scores numerical values are then based on the sequencing quality scores \\(Q_F\\) and \\(Q_R\\) of the nucleotides from the forward and reverse reads, respectively. The quality score \\(Q\\) represents the likelihood of a base-calling error, i.e. that the sequencer assigned (or \u0026ldquo;called\u0026rdquo;) the incorrect base to (for) a given nucleotide, or more exactly, the probability \\(P(X\\, \\text{is unknown}) \\) that the base called, X, is actually unknown:\n\\[ P(X\\, \\text{is unknown}) = 10^{-\\frac{Q}{10}} \\] If the base called by the sequencer is X at position i, where X is one of the nucleotides A, C, G, or T, with a quality score Q, the probability \\(P(truth = X) \\) that the base called by the sequencer, X, is actually X is:\n\\[P(truth = X) = 1 - 10^{-\\frac{Q}{10}}\\] The complementary probability corresponds to the case where the nucleotide corresponding to X is unknown. Assuming equal probability of each of the four nucleotides (A, C, G, T), the probability that the base called, X, is in fact Y, one of the four possible nucleotides, is:\n\\[P(Y | Obs(X)) = \\frac{10^{-\\frac{Q}{10}}}{4} \\] Estimating the probability of a true match # Let\u0026rsquo;s then define the corresponding log probability of the base-calling uncertainty, where \\(Q_F\\) and \\(Q_R\\) are the forward and reverse sequencing quality scores, respectively:\n\\[ \\begin{aligned} q_F \u0026= -\\frac{Q_F}{10} \\cdot \\log(10) \\\\ q_R \u0026= -\\frac{Q_R}{10} \\cdot \\log(10) \\end{aligned} \\] When a match is observed # If an alignment between a forward and a reverse read shows two X nucleotides at the same position, it could represent either a genuine match, or a true mismatch.\nThere are three reasons why it could be a true match:\nFor both reads, the two base calls, X, are actually the nucleotide X. \\[P(case\\,1) = (1-e^{q_F}) (1-e^{q_R})\\] . For one of the read, the base call, X, is correct, while for the second read, the base call is correct but its likelyhood is low. \\[P(case\\,2) = (1-e^{q_F})\\frac{e^{q_R}}{4}+ (1-e^{q_R})\\frac{e^{q_F}}{4}\\] . For both reads, base calls are incorrect, but both correspond to the same nucleotide. \\[P(case\\,3) = \\frac{e^{q_F+q_R}}{4} \\] . When a match is observed, the probability of it being a true match is the sum of the probabilities of these three cases.\n\\[ \\begin{aligned} P(match | Observed(match), QF, QR) \u0026= 1 - \\frac{3}{4}\\left(e^{q_F}+e^{q_R}-e^{q_F+q_R}\\right) \\end{aligned} \\] When a mismatch is observed # Even if a mismatch XY is observed, it can still correspond to a true match. There are three explainations for this:\nThe base call X is actually the nucleotide X, but the base call Y is actually the nucleotide X. \\[P(case\\,1') = \\frac{(1-e^{q_F})e^{q_R}}{4} \\] The base call Y is actually the nucleotide Y, but the base call X is actually the nucleotide Y. \\[P(case\\,2') = \\frac{(1-e^{q_R})e^{q_F}}{4}\\] Neither of the base calls are X or Y, but both correspond the same nucleotide. \\[P(case\\,3') = \\frac{e^{q_F+q_R}}{4}\\] When a mismatch is observed, the probability of it to be instead a genuine match is thus the sum of the probabilities of these three cases.\n\\[ \\begin{aligned} P(match | Observed(mismatch), QF, QR) \u0026= \\frac{e^{q_F} + e^{q_R} - e^{q_F+q_R}}{4} \\end{aligned} \\] Match and Mismatch Score # At each position of the alignment, obipairing assign it a score (i.e. a score for match and mismatch) that corresponds to the log odd ratio between the probability of being a true match given the observation and the probability of being a true mismatch given the observation.\n\\[ \\begin{aligned} Score(QF,QR) = \\log \u0026P(match | Observed(match|mismatch), QF, QR) - \\\\ \u0026 \\log P(mismatch | Observed(match|mismatch), QF, QR) \\end{aligned} \\] With:\n\\[ P(mismatch | Observed(match|mismatch), QF, QR) = \\\\ 1 - P(match | Observed(match|mismatch), QF, QR) \\] This compares the probability of a match given the observation to the probability of a mismatch given the same observation. If the match hypothesis is the most likely, the match score is positive, otherwise it is negative. The score is equal to zero when both the hypotheses are equally probable. On Illumina sequencers, the sequencing quality score \\(Q\\) range from 0 to 40, with 0 being an absolute ambiguity \\(P_{Error} = 10^{-\\frac{0}{10}}= 1 \\) .\nMatch scores as a function of forward and reverse reads sequencing quality scoresThe score is calculated as the log odds ratio of the probability of a true match vs. a true mismatch. When a match is observed in the alignment, if the sequencing quality score is 0 for at least one of the two reads, the probability of a match is 0.25, and the probability of a mismatch is 0.75. The resulting match score is -1.1. In contrast, when both reads have a sequencing quality score of 40, the probability of a match approaches 1, while the probability of a mismatch approach 1/10,000, resulting in a match score of 8.8.\nMismatch scores as a function of forward and reverse reads sequencing quality scoresThe score is calculated as the log odd ratio of the probability of a true match vs. a true mismatch. When a mismatch is observed in the alignment, if the sequencing quality score is 0 for at least one of the two reads, the probability of a match is 0.25, and the probability of a mismatch is 0.75. The resulting mismatch score is therefore equal to the match score: -1.1. In contrast, when both reads have a sequencing quality score of 40, the chance of a mismatch close to 1, while the probability of a match approach 5/100,000, resulting in a mismatch score pf -9.9.\nGap Penalty # The gap penalty is a negative value applied to the alignment score when a gap is inserted during the alignment process. obipairing estimates this penalty as the score of a mismatch between two nucleotides with a quality score of 40, scaled by a constant value (GapWeight). This gap weight, which is set to 2 by default, can be modified with the `\u0026ndash;gap-pelnalty\u0026rsquo; option of the obipairing command.\n\\[ \\begin{aligned} Gap\\,Penalty = Score(40,40 | Observed(mismatch)) \\times GapWeigth \\end{aligned} \\] Left and right alignments # When aligning paired reads, two scenarios arise depending on the amplicon and read lengths:\nThe amplicon is longer than the read. In that case, the forward and reverse reads must be aligned to reconstruct the full barcode. overlap amplicon 5' 3' Forward Read 3' 5' Reverse Read Aligning paired reads when the amplicon is longer than the readWhen aligning two paired reads (arrows) to reconstruct a full-length amplicon that is longer than the reads (blue rectangle), the alignment is done in the reads 3\u0026rsquo; ends (green rectangle), leaving two flanking, unaligned regions (red rectangles) located in the reads 5\u0026rsquo; ends.\nThe amplicon is shorter than the read. In that case, each read contains the entire amplicon. overlap amplicon 5' 3' Forward Read 3' 5' Reverse Read Aligning paired reads when the amplicon shorter than the readWhen aligning two paired reads (arrows) to reconstruct a full-length amplicon that is shorter than the reads (blue rectangle), the alignment is done in the reads 5’ ends (green rectangle), leaving two flanking, unaligned regions (red rectangles) located in the reads 3’ ends\nFor both scenarios, an exact alignment algorithm based on dynamic programming will consider the red regions as gaps. However, when matching paired-end reads, these gaps should not penalize the alignment score. The alternative to find the paired-end reads overlap is to use a semi-global (i.e. end-gap free) alignment, but asymmetric, since depending on the above scenario, gap penality should be applied only on one of the two ends of the reads. In the first case scenario, gap penality should not apply at the reads 5\u0026rsquo; ends but apply at the reads 3\u0026rsquo; ends. In the second case scenario, gap penalties should apply in the opposite direction. And in both cases, gaps insertion within the overlapping region (green part) should be penalized.\nLeft Alignment # The left alignment mode corresponds to the first case scenario, where the amplicon is longer than the reads. It does not penalize the alignment score for gaps introduced at the reads 5\u0026rsquo; ends, but does penalize the alignment score for gaps introduced in the reads 3\u0026rsquo; ends or within the paired-end reads overlapping region. It is referred to as a left alignment because the forward read is shifted to the left compared to the reverse read.\nRight Alignment # The right alignment mode corresponds to the second case scenario, where the amplicon is shorter than the reads. It penalizes the alignment score for gaps introduced in the reads 5\u0026rsquo; ends and within the reads overlapping region, but does not penalize the alignment score for gaps introduced in the reads 3\u0026rsquo; ends . It is referred to as a right alignment because the forward read is shifted to the right compared to the reverse read.\nChoosing the alignment method # The default behavior of obipairing is to choose between the left or right alignment mode depending on the result of a draft alignment obtained with the FASTA-derived algorithm. If the FASTA-derived algorithm specifies that forward reads must be shifted to the left, the left alignment mode is used, otherwise the right alignment mode is used. If the FASTA-derived algorithm is not run (by setting the `\u0026ndash;exact-mode\u0026rsquo; option), both left and right alignments are run and the one that gives the best alignment score is selected.\nWhich part of the read is aligned when using the exact alignment algorithm? # The default behavior of obipairing is to rely on the FASTA-derived algorithm to decide which part of the reads to align. The FASTA-derived algorithm estimate heuristically the best shift to apply, and therefore, the position of the overlapping region. The exact alignment algorithm is then applied to that overlapping region augmented of \\(\\Delta\\) nucleotides on each side. By default, the value of \\(\\Delta\\) is set to 5. The user can change this value by setting the --delta option. Doing this extension of the overlapping region allows the exact alignment algorithm to be less sensitive to the approximation done by the FASTA-derived algorithm when determining the overlapping region.\nIn the exact mode (option --exact-mode), the exact alignment algorithm is applied to the full pair of reads, increasing calculation time.\n"},{"id":8,"href":"/obidoc/obitools/obiannotate/","title":"obiannotate","section":"Basics","content":" obiannotate: edit sequence annotations # Description # obiannotate is a tool for editing the sequence records of a dataset. It allows you to add, delete or modify annotations of sequence records, as well as edit the identifier, definition or sequence itself.\nThere are two particularly important groups of options in obiannotate . The first group is shared with obigrep and enables the selection of sequences. The second group specifies the changes to be made to the sequence records. In obigrep , the selection options determine which sequences the program will retain in its output. In contrast, every sequence in the input dataset is included in the result produced by obiannotate ; however, only the sequences selected by the selection options are modified according to the editing options. Non-selected sequences are transferred to the result without modification.\nThe selection options # The edition options # Edition of the annotations # OBITools4 store annotations attached to each sequence using a tag/value mechanism. The annotation of a sequence if a set of tags each of them being associated to a value. Therefor, annotating a sequence is changing this set of tags by adding new tags, deleting some others or changing the value associated to a tag.\nAdding annotations # To add a new tag/value pair to a sequence obiannotate propose the generic option --set-tag Considering the following file:\nšŸ“„ empty.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqA2 gtagctagctagctagctagctagctaga \u0026gt;seqC1 cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 cgatggctccatgctagtgctagtcgatga To add a foo tag to each sequence associated to the numeric value 3 the command is:\nobiannotate --set-tag foo=3 empty.fasta \u0026gt;seqA1 {\u0026#34;foo\u0026#34;:3} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;foo\u0026#34;:3} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;foo\u0026#34;:3} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;foo\u0026#34;:3} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;foo\u0026#34;:3} cgatggctccatgctagtgctagtcgatga The argument of the --set-tag option foo=3 can be decomposed in two parts separated by the equal sign. The left part foo is the name of the target tag, the right part is the value to assign to the tag.\nThe left part must be a string when the right part is actually an OBITools4 expression language. Here the expression is simple 3, which is evaluated to the 3 integer value.\nTo assign as string value to a tag, the rigth part of the option argument must be a valid OBITools4 expression language corresponding to a string: \u0026quot;bar\u0026quot; with double quotes flanking the text having to be assigned. But to prevent the Bash UNIX shell to interpret itself the option parameter foo=\u0026quot;bar\u0026quot;, it has to be protected itself by single quote.\nobiannotate --set-tag \u0026#39;foo=\u0026#34;bar\u0026#34;\u0026#39; empty.fasta \u0026gt;seqA1 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga As the right part is an expression, it can be more complex and realize some basic computations. In the next example the foo tag is valuated with the sequence identifier prefixed by \u0026quot;bar-\u0026quot;.\nobiannotate --set-tag \u0026#39;foo=\u0026#34;bar-\u0026#34; + sequence.Id()\u0026#39; empty.fasta \u0026gt;seqA1 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqA1\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqB1\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqA2\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqC1\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqB2\u0026#34;} cgatggctccatgctagtgctagtcgatga The complete description of the OBITools4 expression language is available here.\nAll the previous examples are tagging each sequence in the same way, but you can also use obiannotate to modify the annotation of only a subset of the sequence. As explained in the introduction of this documentation, this is achieved by combining selection and edition options.\nFor instance, to add a foo tag only to the single sequence having the id seqA2, is achieved by combining the selection option -I seqA2 and the edition option --set-tag 'foo=\u0026quot;bar\u0026quot;'\nobiannotate -I seqA2 --set-tag \u0026#39;foo=\u0026#34;bar\u0026#34;\u0026#39; empty.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;foo\u0026#34;:\u0026#34;bar\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 cgatggctccatgctagtgctagtcgatga Used with obigrep the -I seqA2 would have selected only the modified sequence.\nobigrep -I seqA2 empty.fasta \u0026gt;seqA2 gtagctagctagctagctagctagctaga The selection options being shared between obiannotate and obigrep , good method to check which sequences will be modified by obiannotate is to check the selection options at first with obigrep . Only the sequences present in the obigrep output will be edited by obiannotate .\nobigrep -l 30 empty.fasta \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqB2 cgatggctccatgctagtgctagtcgatga obiannotate -l 30 \\ --set-tag \u0026#39;foo=\u0026#34;bar-\u0026#34; + sequence.Id()\u0026#39; \\ empty.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqB1\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 gtagctagctagctagctagctagctaga \u0026gt;seqC1 cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;foo\u0026#34;:\u0026#34;bar-seqB2\u0026#34;} cgatggctccatgctagtgctagtcgatga Renaming tags # Renaming tags can be useful for accounting for changes in a pipeline, adapting old datasets to new scripts or saving annotations produced by an *OBITools* command before rerunning it with different parameters. Consider the following fasta file:\nšŸ“„ five_tags.fasta\n\u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga If you want to keep the taxonomic annotations as a reference before running the obitag command to produce a new one and then be able to compare the new one to the old one later, you can rename the taxid tag to ref_taxid and then run the obitag command, which will set a new \u0026rsquo;taxid\u0026rsquo; tag.\nobiannotate --rename-tag ref_taxid=taxid five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;ref_taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;ref_taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;ref_taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;ref_taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Adding a serial number to each sequence # It can be useful to add a serial number to each sequence. This can be done by using the obiannotate command with the --number. That option will add a new tag to each sequence, with the name seq_number valued with an integer value that is incremented for each sequence.\nobiannotate --number empty.fasta \u0026gt;seqA1 {\u0026#34;seq_number\u0026#34;:1} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;seq_number\u0026#34;:2} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;seq_number\u0026#34;:3} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;seq_number\u0026#34;:4} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;seq_number\u0026#34;:5} cgatggctccatgctagtgctagtcgatga Adding sequence related annotations # \u0026ndash;length\n\u0026ndash;aho-corasick \u0026ndash;pattern \u0026ndash;pattern-name \u0026ndash;pattern-error \u0026ndash;allows-indels\nEdit taxonomy related annotations # \u0026ndash;scientific-name\n\u0026ndash;with-taxon-at-rank \u0026lt;RANK_NAME\u0026gt;\n\u0026ndash;taxonomic-rank\n\u0026ndash;taxonomic-path\n\u0026ndash;raw-taxid\n\u0026ndash;add-lca-in \u0026lt;SLOT_NAME\u0026gt; \u0026ndash;lca-error \u0026lt;#.###\u0026gt;\nDeleting annotations # There are three options that allow for deleting annotations associated with a sequence. The easiest one is --clear. It removes every annotation associated to a sequence.\nConsidering the fasta sequence file\nšŸ“„ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga The next command removes all the annotations\nobiannotate --clear five_tags.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqA2 gtagctagctagctagctagctagctaga \u0026gt;seqC1 cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 cgatggctccatgctagtgctagtcgatga If you combine a selection option, here -C 10 which selects all the sequences occurring at most ten times, and the --clear option, you will delete annotations only on selected sequences. For other sequences the annotations are kept.\nobiannotate -C 10 --clear five_tags.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqA2 gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga It is possible to delete a given tag based on its name using the --delete-tag option. In the following example the taxid tag is deleted. As the seqB2 sequence does not exhibe a taxid tag, it is not affected.\nobiannotate --delete-tag taxid five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Several --delete-tag options can be inserted in a single obiannotate command.\nobiannotate --delete-tag taxid \\ --delete-tag count \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga The last way to delete annotations is indirect. It is based on the --keep option, indicating the annotation to be kept. Consequently, all the other tags, the not kept, are deleted\nobiannotate --keep taxid five_tags.fasta \u0026gt;seqA1 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 cgatggctccatgctagtgctagtcgatga Similarly to --delete-tag several --keep options can be provided to keep several annotations.\nobiannotate --keep taxid \\ --keep count \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25} cgatggctccatgctagtgctagtcgatga Changing annotation values # Edition of the identifier # The identifier of a sequence can be updated by using the --set-id option. One of the most useful use cases of this option is to substitute the long id generated by the sequencer, by a new short one based on a number incremented from sequence to sequence, as the one generated by the --number option. To achieve this, one can use two piped obiannotate commands. The first adds the seq_number annotation to the sequences, and then the second updates the sequence id from this newly added seq_number tag.\nobiannotate --number empty.fasta \\ | obiannotate --set-id \u0026#39;sprintf(\u0026#34;motus_%04d\u0026#34;, annotations.seq_number)\u0026#39; \u0026gt;motus_0001 {\u0026#34;seq_number\u0026#34;:1} cgatgctgcatgctagtgctagtcgat \u0026gt;motus_0002 {\u0026#34;seq_number\u0026#34;:2} tagctagctagctagctagctagctagcta \u0026gt;motus_0003 {\u0026#34;seq_number\u0026#34;:3} gtagctagctagctagctagctagctaga \u0026gt;motus_0004 {\u0026#34;seq_number\u0026#34;:4} cgatgctccatgctagtgctagtcgatga \u0026gt;motus_0005 {\u0026#34;seq_number\u0026#34;:5} cgatggctccatgctagtgctagtcgatga The sprintf function of the OBITools4 expression language is used to format sequence identifiers. It requires a format string, in this case \u0026quot;motus_%04d\u0026quot;, which describes how the new identifier will be generated. In this case, %04d will be replaced by the second argument of the sprintf() function, annotations.seq_number, which is the number associated with the sequence in the file. d indicates a decimal integer value, and the 4 in front specifies that this number will be padded to 4 digits. The 0 before the 4 indicates that the number will be padded with zeros.\nThe result of the printf function can be seen in the results presented. The first sequence is given the identifier motus_0001, the second is given the identifier motus_0002, and so on.\nEdition of the definition # Edition of the sequence # \u0026ndash;cut \u0026lt;###:###\u0026gt; \u0026ndash;sequence\nSynopsis # obiannotate [--add-lca-in \u0026lt;SLOT_NAME\u0026gt;] [--aho-corasick \u0026lt;string\u0026gt;] [--allows-indels] [--approx-pattern \u0026lt;PATTERN\u0026gt;]... [--attribute|-a \u0026lt;KEY=VALUE\u0026gt;]... [--batch-size \u0026lt;int\u0026gt;] [--clear] [--compress|-Z] [--csv] [--cut \u0026lt;###:###\u0026gt;] [--debug] [--definition|-D \u0026lt;PATTERN\u0026gt;]... [--delete-tag \u0026lt;KEY\u0026gt;]... [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--has-attribute|-A \u0026lt;KEY\u0026gt;]... [--help|-h|-?] [--id-list \u0026lt;FILENAME\u0026gt;] [--identifier|-I \u0026lt;PATTERN\u0026gt;]... [--ignore-taxon|-i \u0026lt;TAXID\u0026gt;]... [--input-OBI-header] [--input-json-header] [--inverse-match|-v] [--json-output] [--keep|-k \u0026lt;KEY\u0026gt;]... [--lca-error \u0026lt;#.###\u0026gt;] [--length] [--max-count|-C \u0026lt;COUNT\u0026gt;] [--max-cpu \u0026lt;int\u0026gt;] [--max-length|-L \u0026lt;LENGTH\u0026gt;] [--min-count|-c \u0026lt;COUNT\u0026gt;] [--min-length|-l \u0026lt;LENGTH\u0026gt;] [--no-order] [--no-progressbar] [--number] [--only-forward] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-mode \u0026lt;forward|reverse|and|or|andnot|xor\u0026gt;] [--pattern \u0026lt;string\u0026gt;] [--pattern-error \u0026lt;int\u0026gt;] [--pattern-name \u0026lt;string\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--predicate|-p \u0026lt;EXPRESSION\u0026gt;]... [--raw-taxid] [--rename-tag|-R \u0026lt;NEW_NAME=OLD_NAME\u0026gt;]... [--require-rank \u0026lt;RANK_NAME\u0026gt;]... [--restrict-to-taxon|-r \u0026lt;TAXID\u0026gt;]... [--scientific-name] [--sequence|-s \u0026lt;PATTERN\u0026gt;]... [--set-identifier \u0026lt;EXPRESSION\u0026gt;] [--set-tag|-S \u0026lt;KEY=EXPRESSION\u0026gt;]... [--silent-warning] [--skip-empty] [--solexa] [--taxonomic-path] [--taxonomic-rank] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--valid-taxid] [--version] [--with-leaves] [--with-taxon-at-rank \u0026lt;RANK_NAME\u0026gt;]... [\u0026lt;args\u0026gt;] Options # obiannotate specific options # Identifier modification # --set-identifier \u0026lt;EXPRESSION\u003e: An expression used to assigned the new id of the sequence. Attribute modification # --clear: Clears all attributes associated to the sequence records. --delete-tag \u0026lt;KEY\u003e: Deletes attribute named KEY. When this attribute is missing, the sequence record is skipped and the next one is examined. --keep | -k \u0026lt;KEY\u003e: Keeps only attribute named KEY. Several -k options can be combined. --rename-tag | -R \u0026lt;NEW_NAME=OLD_NAME\u003e: Changes attribute name OLD_NAME to NEW_NAME. When attribute named OLD_NAME is missing, the sequence record is skipped and the next one is examined. --set-tag | -S \u0026lt;KEY=EXPRESSION\u003e: Creates a new attribute named with a key KEY set with a value computed from EXPRESSION. Sequence-related annotation # --aho-corasick \u0026lt;string\u003e: Adds an aho-corasick attribute with the count of matches of the provided patterns. --length: Adds attribute with seq_length as a key and sequence length as a value. --pattern \u0026lt;string\u003e: Adds a pattern attribute containing the pattern, a pattern_match attribute indicating the matched sequence, and a pattern_error slot indicating the number difference between the pattern and the match to the sequence. --pattern-name \u0026lt;string\u003e: specifies the name to use as prefix for the attributes reporting the match. (default: \u0026ldquo;pattern\u0026rdquo;) Sequence modification # --cut \u0026lt;###:###\u003e: A pattern describing how to cut the sequence. Taxonomy annotation # --add-lca-in \u0026lt;KEY\u003e: From the taxonomic annotation of the sequence (taxid attribute or merged_taxid attribute), a new attribute named KEY is added with the taxid of the lowest common ancestor corresponding to the current annotation. --lca-error \u0026lt;#.###\u003e: Error rate tolerated on the taxonomical description during the lowest common ancestor. At most a fraction of lca-error of the taxonomic information can disagree with the estimated LCA. (default: 0.000000) --scientific-name: Annotates the sequence with its scientific name. Taxonomy options # Check taxids against a taxonomy # OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur: The taxon is valid The taxon is no more valid, but a new one replaces it The taxon is no more valid, and no new taxid exists to replace it. In the first case, the obitools normalize the writing of the taxid in the form: TAXCOD:TAXID [SCIENTIFIC NAME]@RANK As example with the NCBI taxonomy the human taxid looks like : taxon:9606 [Homo sapiens]@species That rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved. 9606 In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data. --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. --raw-taxid: Displays the raw taxid for each displayed taxon. (default: false) --update-taxid: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false). --fail-on-taxonomy: Make obitools failing on error if a used taxid is not a currently valid one (default: false). --taxonomic-rank: Annotates the sequence with its taxonomic rank. --taxonomic-path: Annotates the sequence with its taxonomic path. --with-taxon-at-rank: Adds taxonomic annotation at taxonomic rank RANK_NAME. Selecting sequence records # Selection based on the sequence # Strict matching # --sequence | -s \u0026lt;PATTERN\u003e: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive. Approximate matching # --approx-pattern \u0026lt;PATTERN\u003e: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions. --allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option). --pattern-error \u0026lt;INTEGER\u003e: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option). Selection based on the sequence identifier # --identifier | -I \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive. --id-list \u0026lt;FILENAME\u003e: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line. Selection based on the sequence definition # --definition | -D \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive. Selection based on the sequence properties # --min-count | -c \u0026lt;COUNT\u003e: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count. --max-count | -C \u0026lt;COUNT\u003e: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count. --min-length | -l \u0026lt;LENGTH\u003e: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length. --max-length | -L \u0026lt;LENGTH\u003e: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obiannotate --help "},{"id":9,"href":"/obidoc/obitools/obiclean/","title":"obiclean","section":"Sequence alignments","content":" obiclean: a PCR aware denoising algorithm # Description # obiclean implements the denoising algorithms provided by OBITools4.\nThe original obiclean algorithm is a denoising (clustering) algorithm designed to filter out potential PCR-generated spurious sequences.\nThis new version of obiclean adds two additional filters:\nA filter to set a threshold for the minimum number of samples (PCRs) a sequence must be present to be retained (default: 1, can be changed using the --min-sample-count option). A naive chimera detection algorithm. This is an experimental feature. It is not run by default. It can be enabled with the --dectect-chimera option. obiclean can run in two modes:\nA tagging mode where no sequences are actually removed from the data set, they are just tagged. It is your responsibility to remove the sequences you do not want based on these tags and your filter rules, using obigrep . A filter mode in which sequences that are considered to be artifactual sequences by obiclean are removed from the data set. obiclean relies on per-sample (PCR) sequence abundance information to apply its algorithms. Therefore, the input data set must first be dereplicated using the obiuniq command with the -m sample option.\ngraph TD A@{ shape: doc, label: \"my_sequences_uniq.fasta\" } C[obiclean] D@{ shape: doc, label: \"my_sequences_clean.fasta\" } A --\u003e C:::obitools C --\u003e D classDef obitools fill:#99d57c The clustering algorithm # The algorithm implemented in obiclean aims to remove punctual PCR errors (nucleotide substitutions, insertions or deletions). Therefore, it is not applied to the whole data set at once, but to each sample (PCR) independently.\nTwo pieces of information are used:\nThe count attributes of the sequence set. The pairwise sequence similarities calculated in each set of sequences belonging to a sample. The result of the obiclean algorithm is the classification of each sequence set into one of three classes: head, internal or singleton.\nConsider two sequences S1 and S2 that occur in the same sample (PCR). S1 is a sequence variant of S2 if and only if\nThe ratio of the number of occurrences of S1 and S2 is less than the parameter R. \\[ \\frac{Count_{S1}}{Count_{S2}} \u003c R \\] The default value of R is 1 and can be set between 0 and 1 using the -r option.\nThe number of differences between S1 and S2 when aligning these sequences is less than a maximum number of differences that can be specified with the -d option (default = 1 error). \\[ dist(S1,S2) \u003c d \\] This relation, is a sequence variant of, defines a Directed Acyclic Graph (DAG) on the sequences belonging to a sample. obiclean gives access to this graph using the --save-graph option. The following is an example of a command to run obiclean and create the graph files:\nobiclean -r 0.1 \\ -Z \\ --save-graph sample-graph \\ wolf_uniq.fasta.gz \u0026gt; wolf_clean.fasta.gz The -r option is used to set the ratio threshold between the sequence abundances. The --save-graph option tells obiclean to save the graph defined by the \u0026ldquo;is a sequence variant of\u0026rdquo; relation in a file per sample, using the GML format, in the directory named sample-graph. . šŸ“‚ sample-graph ā”œā”€ā”€ šŸ“„ 13a_F730603.gml ā”œā”€ā”€ šŸ“„ 15a_F730814.gml ā”œā”€ā”€ šŸ“„ 26a_F040644.gml └── šŸ“„ 29a_F260619.gml The -Z option is used to compress the output file. The program [yEd] ( https://www.yworks.com/products/yed) allows you to visualize the graph described for each sample.\nobiclean graph for the sample 13a_F730603Each dot represents one sequence. The area of the dot is proportional to the abundance of the sequence in the sample. The arrows represent the relationship is a sequence variant of, starting from the derived sequence to its presumed original version. The number on each arrow indicates the distance between the two sequences, here 1 everywhere. This sample corresponds to the dietary analysis of a wolf. Therefore, one true sequence (the prey) is expected. It corresponds to the big blue circle.\nFrom the graph topology, each sequence S is classified into one of the following three classes\nhead\nThere is at least one sequence in the sample that is a variant of S. There is no sequence in the sample such that S is a variant of that sequence. internal\nThere is at least one sequence in the sample such that S is a variant of this sequence. singleton\nThere is no sequence in the sample that is a variant of S. There is no sequence in the sample that is a variant of this sequence. This class is sample dependent, as a graph is built per sample and recorded in the obiclean_status tag, as shown below for one of the sequences extracted from the result file wolf_clean.fasta.gz.\n\u0026gt;HELIUM_000100422_612GNAAXX:7:91:7524:17193#0/1_sub[28..127] {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;count\u0026#34;:8,\u0026#34;direction\u0026#34;:\u0026#34;reverse\u0026#34;,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;forward_mismatches\u0026#34;:0,\u0026#34;forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;29a_F260619\u0026#34;:3},\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;obiclean_head\u0026#34;:false,\u0026#34;obiclean_headcount\u0026#34;:0,\u0026#34;obiclean_internalcount\u0026#34;:2,\u0026#34;obiclean_mutation\u0026#34;:{\u0026#34;HELIUM_000100422_612GNAAXX:7:22:2603:18023#0/1_sub[28..127]\u0026#34;:\u0026#34;(a)-\u0026gt;(g)@26\u0026#34;},\u0026#34;obiclean_samplecount\u0026#34;:2,\u0026#34;obiclean_singletoncount\u0026#34;:0,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:\u0026#34;i\u0026#34;,\u0026#34;29a_F260619\u0026#34;:\u0026#34;i\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:5,\u0026#34;29a_F260619\u0026#34;:3},\u0026#34;reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;reverse_mismatches\u0026#34;:0,\u0026#34;reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_b_single\u0026#34;:46} ttagccctaaacacaagtaattaatgtaacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctttataccctt Tags added to each sequence by the clustering algorithm # obiclean_head: true if the sequence is a head or singleton in at least one sample, false otherwise.\nobiclean_samplecount: the number of samples the sequence occurs in the data set (here 2).\nobiclean_headcount: the number of samples where the sequence is classified as head (here 0).\nobiclean_internalcount: the number of samples where the sequence is classified as internal (here 2).\nobiclean_singletoncount: the number of samples where the sequence is classified as singleton (here 0).\nobiclean_status: a JSON map indexed by the name of the sample in which the sequence was found. The value indicates the classification of the sequence in this sample: i for internal, s for singleton or h for head.\nobiclean_weight: a JSON map indexed by the name of the sample in which the sequence was found. The value indicates the number of times the sequence and its derivatives were found in this sample (here 5 for sample 15a_F73081).\nobiclean_mutation: a JSON map indexed by sequence ids. Each entry of the map contains the sequence id of the parent sequence and the position of the mutation between the parent sequence and the sequence in the variant. Only sequences belonging to the class internal in at least one sample are annotated with this tag.\nHere: (a)-\u0026gt;(g)@26 indicates that the a in the parent sequence HELIUM_000100422_612GNAAXX:7:22:2603:18023#0/1_sub[28..127] in this variant has been replaced by a g at position 26.\nThe Chimera Detection Algorithm # This new version of obiclean implements a naive chimera detection algorithm. It is an experimental feature. The algorithm is only run when the --dectect-chimera option is used. It is applied to sequences that have already been classified by the clustering algorithm presented above.\nThe algorithm defines a chimeric sequence S as a sequence classified as head or singleton by the clustering algorithm, for which there exists in the sample a pair of sequences \\(\\{S_{Pre} ; S_{Suf}\\}\\) that are more frequent than S, and such that the concatenation of the shared prefix between S and \\(S_{Pre}\\) and the shared suffix between S and \\(S_{Suf}\\) is equal to S.\n\\[ S = Common\\_prefix(S,S_{Pre}) + Common\\_suffix(S,S_{Suf}) \\] obiclean -r 0.1 \\ -Z \\ --detect-chimera \\ wolf_uniq.fasta.gz \u0026gt; wolf_clean_chimera.fasta.gz Extracted from the result file wolf_clean_chimera.fasta.gz, the sequence shown below illustrates how a chimeric sequence is annotated.\n\u0026gt;HELIUM_000100422_612GNAAXX:7:21:6999:18567#0/1_sub[28..127] {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;chimera\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;{HELIUM_000100422_612GNAAXX:7:26:10054:16185#0/1_sub[28..127]}/{HELIUM_000100422_612GNAAXX:7:102:9724:19316#0/1_sub[28..127]}@(24)\u0026#34;},\u0026#34;count\u0026#34;:1,\u0026#34;direction\u0026#34;:\u0026#34;reverse\u0026#34;,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;forward_mismatches\u0026#34;:0,\u0026#34;forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;forward_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;merged_sample\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;obiclean_head\u0026#34;:true,\u0026#34;obiclean_headcount\u0026#34;:0,\u0026#34;obiclean_internalcount\u0026#34;:0,\u0026#34;obiclean_samplecount\u0026#34;:1,\u0026#34;obiclean_singletoncount\u0026#34;:1,\u0026#34;obiclean_status\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:\u0026#34;s\u0026#34;},\u0026#34;obiclean_weight\u0026#34;:{\u0026#34;29a_F260619\u0026#34;:1},\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(A:21)-\u0026gt;(G:02)\u0026#34;:67,\u0026#34;(A:34)-\u0026gt;(C:02)\u0026#34;:31,\u0026#34;(A:34)-\u0026gt;(G:02)\u0026#34;:29,\u0026#34;(C:28)-\u0026gt;(G:02)\u0026#34;:42,\u0026#34;(C:34)-\u0026gt;(A:02)\u0026#34;:30,\u0026#34;(G:32)-\u0026gt;(T:02)\u0026#34;:55,\u0026#34;(T:33)-\u0026gt;(G:02)\u0026#34;:35},\u0026#34;reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;reverse_mismatches\u0026#34;:0,\u0026#34;reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;reverse_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;score\u0026#34;:306,\u0026#34;score_norm\u0026#34;:0.806,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:50,\u0026#34;seq_b_single\u0026#34;:46} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agcttaaaactcaaaggacttggcggtgctgtatacccgt Tags added to each chimeric sequence by the chimera detection algorithm # A chimera tag is added to the sequence. The tag contains a JSON map indexed by the names of the samples in which the chimeric sequence was detected. The value indicates the two parental sequences and the position of the transition between the two sequences in the chimera:\n{\u0026#34;29a_F260619\u0026#34;:\u0026#34;{HELIUM_000100422_612GNAAXX:7:26:10054:16185#0/1_sub[28..127]}/{HELIUM_000100422_612GNAAXX:7:102:9724:19316#0/1_sub[28..127]}@(24)\u0026#34;} Which reads as\nSequence: HELIUM_000100422_612GNAAXX:7:21:6999:18567#0/1_sub[28..127] was detected as chimera in sample: 29a_F260619 between the sequences: HELIUM_000100422_612GNAAXX:7:26:10054:16185#0/1_sub[28..127] as prefix HELIUM_000100422_612GNAAXX:7:102:9724:19316#0/1_sub[28..127] as suffix The junction is at position 24 on the chimeric sequence HELIUM_000100422_612GNAAXX:7:21:6999:18567#0/1_sub[28..127]. Filtering the output # Removal of sequences annotated as artifacts. # By default, obiclean only annotates each sequence with different tags describing its classification in the different samples. Therefore, there are as many sequences in the result file as in the input file. This can be verified using the obicount command on the previous input and result files, wolf_uniq.fasta.gz and wolf_clean_chimera.fasta.gz respectively.\nobicount wolf_uniq.fasta.gz | csvlook | entities | n | | -------- | ------- | | variants | 4 313 | | reads | 42 452 | | symbols | 428 403 | obicount wolf_uniq_chimera.fasta.gz | csvlook | entities | n | | -------- | ------- | | variants | 4 313 | | reads | 42 452 | | symbols | 428 403 | obiclean can be run in filter mode, allowing a sequence to be removed from the resulting sequence set if it is considered artifactual in all samples where it appears. Artifactual sequences are those classified as internal or chimeric.\nThis filtering is done by setting the -H option.\nobiclean -r 0.1 \\ -Z \\ --detect-chimera \\ -H \\ wolf_uniq.fasta.gz \u0026gt; wolf_clean_chimera_head.fasta.gz obicount wolf_clean_chimera_head.fasta.gz | csvlook | entities | n | | -------- | ------- | | variants | 2 322 | | reads | 35 623 | | symbols | 230 953 | Remove sequences occurring in less than k samples (PCRs) # It may be considered reasonable to eliminate a sequence present in fewer than k samples, particularly if technical PCR replicates have been performed and several samples in the dataset actually correspond to these technical replicates of a single biological sample. By default, the minimum number of samples is set to 1, meaning that no sequences are rejected by this filter. The -min-sample-count option can be used to set this threshold to a higher value. A value of 2 already has a significant effect:\nobiclean -r 0.1 \\ --detect-chimera \\ -H \\ --min-sample-count 2 \\ wolf_uniq.fasta.gz \\ | obicount | csvlook | entities | n | | -------- | ------ | | variants | 12 | | reads | 12 695 | | symbols | 1 197 | This is equivalent to post-filtering the result of the obiclean command using the following obigrep command:\nobiclean -r 0.1 \\ --detect-chimera \\ -H \\ wolf_uniq.fasta.gz \\ | obigrep -p \u0026#39;annotations.obiclean_samplecount\u0026gt;=2\u0026#39; \\ | obicount | csvlook | entities | n | | -------- | ------ | | variants | 12 | | reads | 12 695 | | symbols | 1 197 | Synopsis # obiclean [--batch-size \u0026lt;int\u0026gt;] [--compressed|-Z] [--debug] [--distance|-d \u0026lt;int\u0026gt;] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--head|-H] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--min-eval-rate \u0026lt;int\u0026gt;] [--min-sample-count \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--ratio|-r \u0026lt;float64\u0026gt;] [--sample|-s \u0026lt;string\u0026gt;] [--save-graph \u0026lt;string\u0026gt;] [--save-ratio \u0026lt;string\u0026gt;] [--skip-empty] [--solexa] [--version] [\u0026lt;args\u0026gt;] Options # obiclean specific options # Clustering algorithm options # --distance | -d \u0026lt;INTEGER\u003e: maximum numbers of differences between two variant sequences. (default: 1) --ratio | -r \u0026lt;FLOAT\u003e: threshold ratio between counts (rare/abundant counts) of two sequence records so that the less abundant one is a variant of the more abundant (default: 1.00). --sample | -s \u0026lt;STRING\u003e: name of the attribute containing sample descriptions (default: \u0026ldquo;sample\u0026rdquo;). Chimera detection options # --detect-chimera: enable chimera detection. (default: false) Filtering options # --head | -H : remove from the result data set, the sequences annotated as spurious in all the samples (default: false). --min-sample-count \u0026lt;INTEGER\u003e: minimum number of samples a sequence must be present in to be considered in the analysis. (default: 1) Dumping internal clustering data # --save-graph \u0026lt;DIRNAME\u003e: save the clustering graph for each sample (PCR) in a GML file in the directory precised as parameter of the option (default: false). --save-ratio \u0026lt;FILENAME\u003e: create a CSV file containing abundance ratio statistics for the edges of the clustering graphs above the --min-eval-rate threshold. If the option -Z is used conjointly with the option --save-graph, in addition to the result file, the ratio CSV file is also compressed using GZIP. --min-eval-rate \u0026lt;INTEGER\u003e: the minimum abundance of the destination sequence of an edge to be stored in the CSV file produced by the --save-ratio option (default: 1000). shared options # Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # Determining the ratio parameter # The ratio parameter (option -r) defines the ratio threshold between the frequency of the variant of a sequence and its original sequence. It can be used to distinguish between two closely related true sequences and a true sequence with its variant. To get an idea of the ratio threshold to use, the obiclean command with the --save-ratio option can be used. This option creates a CSV file containing the abundance ratio statistics from the edges of the clustering graphs. Only a subset of the edges are kept in the CSV file:\nThose corresponding to a single mutation (distance between the original and the mutated sequence is 1). Those where the original sequence has a weight greater than the threshold (determined by the --min-eval-rate option). The last condition is used to avoid estimating the ratio from edges with too few sequences, in order to limit the stochastic effect on ratio estimation.\nobiclean -Z \\ --save-ratio wolf_ratio_R1.csv.gz \\ wolf_uniq.fasta.gz \u0026gt; wolf_clean_R1.fasta.gz The --save-ratio requires a parameter FILENAME that is the name of the CSV file to create. The file is compressed using GZIP if the option -Z is used.\ngzcat wolf_ratio_R1.csv.gz | head | csvlook -I | Sample | Origin_id | Origin_status | Origin | Mutant | Origin_Weight | Mutant_Weight | Origin_Count | Mutant_Count | Position | Origin_length | A | C | G | T | | ----------- | ---------------------------------------------------------- | ------------- | ------ | ------ | ------------- | ------------- | ------------ | ------------ | -------- | ------------- | -- | -- | -- | -- | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 44 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 72 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 42 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 57 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 76 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 73 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 16 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 32 | 99 | 35 | 25 | 16 | 23 | | 26a_F040644 | HELIUM_000100422_612GNAAXX:7:5:15939:5437#0/1_sub[28..126] | h | a | - | 12830 | 1 | 10385 | 1 | 73 | 99 | 35 | 25 | 16 | 23 | The ratio CSV file wolf_ratio_R1.csv.gz contains the following columns:\nSample: The name of the sample where the observation is done. Origin_id: The ID of the original sequence corresponding to described mutant. Origin_status: The status of the original sequence in the sample. Origin: Original sequence at the mutation site. Mutant: Mutant sequence at the mutation site. Origin_Weight: Observed weight of the original sequence in the sample. Mutant_Weight: Observed weight of the mutant sequence in the sample. Origin_Count: Observed count of the original sequence in the sample. Mutant_Count: Observed count of the mutant sequence in the sample. Position: Position of the mutation in the original sequence. Origin_length: Length of the original sequence. A: Count of A nucleotides in the original sequence. C: Count of C nucleotides in the original sequence. G: Count of G nucleotides in the original sequence. T: Count of T nucleotides in the original sequence. From the file wolf_ratio_R1.csv.gz, a histogram of the ratio of the weight of the mutant to the weight of the original can be plotted using the following command:\ngzcat wolf_ratio_R1.csv.gz \\ | octosql -o csv \u0026#34;select log10(float(Mutant_Weight) / float(Origin_Weight)) as ratio from stdin.csv\u0026#34; \\ | uplot -H hist -n 25 ratio ā”Œ ┐ [-4.2, -4.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 107 [-4.0, -3.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 200 [-3.8, -3.6) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 208 [-3.6, -3.4) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 119 [-3.4, -3.2) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 145 [-3.2, -3.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 146 [-3.0, -2.8) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 71 [-2.8, -2.6) ┤▇▇▇▇▇▇▇▇ 45 [-2.6, -2.4) ┤▇▇▇▇ 26 [-2.4, -2.2) ┤▇ 6 [-2.2, -2.0) ┤▇ 7 [-2.0, -1.8) ┤ 2 [-1.8, -1.6) ┤ 0 [-1.6, -1.4) ┤ 0 [-1.4, -1.2) ┤ 2 The file wolf_ratio_R1.csv.gz describes the following number of edges (look the number of rows in the CSV file):\ngzcat wolf_ratio_R1.csv.gz \\ | csvtk dim file num_cols num_rows - 15 1,084 Most edges in the graph connect a PCR variant sequence to its parent sequence. Only a few edges correspond to a connection between two closely related true sequences that differ only by a single mutation; they are not frequent enough to distort the shape of the distribution. Therefore, this histogram can be considered as the distribution of the ratio between a variant sequence and its parent sequence. We can observe that no ratio in this histogram is greater than \\(10^{-1}\\) , and only 4 out of 1084 edges have a ratio greater than \\(10^{-2}\\) . Using the --ratio 0.1 option will not split any edges, using the --ratio 0.01 option will split 4 edges over the edges used for the statistics. Because of all the edges discarded from the ratio table (involving too few original sequences), the effect on the number of MOTUs produced may be greater.\nBelow we run the obiclean command with several different values for the --ratio option, ranging from 1 to 0.01. For each run, the number of MOTUs produced is printed by piping the output of obiclean to the obicount and csvlook commands.\nRun with a ratio of 1 obiclean -r 1 -H wolf_uniq.fasta.gz \\ | obicount | csvlook | entities | n | | -------- | ------- | | variants | 2 046 | | reads | 35 111 | | symbols | 203 349 | Run with a ratio of 1/2 obiclean -r 0.5 -H wolf_uniq.fasta.gz \\ | obicount | csvlook | entities | n | | -------- | ------- | | variants | 2 046 | | reads | 35 111 | | symbols | 203 349 | Run with a ratio of 1/10 obiclean -r 0.1 -H wolf_uniq.fasta.gz \\ | obicount | csvlook | entities | n | | -------- | ------- | | variants | 2 449 | | reads | 35 757 | | symbols | 243 515 | Run with a ratio of 1/100 obiclean -r 0.01 -H wolf_uniq.fasta.gz \\ | obicount | csvlook | entities | n | | -------- | ------- | | variants | 3 215 | | reads | 37 546 | | symbols | 319 820 | As you can see, the number of MOTUs produced increases as the -ratio option decreases, but the ratio of 0.5 has no effect on the number of MOTUs produced compared to the default ratio of 1.0.\n"},{"id":10,"href":"/obidoc/docs/commands/options/","title":"Shared command options","section":"The OBITools4 commands","content":" Customising the execution of OBITools # OBITools are a set of UNIX commands that can be used from a UNIX shell. They can be used interactively from a terminal, or as part of a shell script to automate a data analysis pipeline. Each OBITools command implements an algorithm to process the data. For example, the obicount command implements an algorithm to count the number of sequences in a sequence file.\nšŸ“„ two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct obicount two_sequences.fasta entities,n variants,2 reads,3 symbols,200 In addition to its name, an OBITools command has a number of options that allow you to customise its behaviour. For example, the obicount command has the --symbols option, which tells it to count only the total number of nucleotides in the sequence file.\nobicount --symbols two_sequences.fasta entities,n symbols,200 If you compare the two outputs, you will notice that the first version of the obicount command without the --symbols option counts the total number of nucleotides, but also the number of sequence variants and the number of reads, while the second version with the --symbols option counts only the total number of nucleotides.\nMultiple ways to specify the same option # Unix options are specified on the command line by adding then after the command name. They can take two forms:\nThe long option name, which is the name of the option preceded by two hyphens, for example --help. For some options, such as the help option, there is also a short version of the option. This consists of a single character preceded by a single hyphen, for example -h. If multiple forms of the same option exist, they are separated in the documentation by a vertical bar |, e.g. the option help exists in its long form --help and in one of its short forms -h or -?. These different forms are represented as follows --help|-h|-?.\nSpecifying an option through environment variables # Options such as --max-cpu, which specifies the maximum number of CPU cores used by OBITools, can be specified when running the command\nobicount --max-cpu=4 my_sequence.fasta or by declaring an environment variable. For this example, the environment variable corresponding to the --max-cpu option is OBIMAXCPU. When using bash or zsh shells, the environment variable can be set using the export command:\nexport OBIMAXCPU=4 Once the environment variable is set, any OBITools command run in the same shell session will use the value of four CPU cores, in this case without the need to specify the --max-cpu option.\nSome OBITools options are shared by most of the commands. These options are listed in the following table.\nControlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) "},{"id":11,"href":"/obidoc/docs/about/","title":"About","section":"Docs","content":" What are OBITools4? # The development of OBITools began at LECA ( University of Grenoble) in the early 2000s, at the same time as the development of DNA metabarcoding methods in the same laboratory. The idea behind the OBITools project was to provide a set of UNIX command-line tools that mimic standard UNIX shell commands such as grep, uniq or wc, but which work on DNA sequence files. Unlike standard UNIX tools, where the processing unit is a line of text, with OBITools the processing unit is a sequence record. In addition, some commands implementing algorithms specific to the processing of DNA metabarcoding data have been added, making OBITools one of the sequence processing tools widely used on UNIX-like systems, suitable for the analysis of DNA metabarcoding data.\nOBITools were originally developed in Python version 2, with some computationally intensive code written in C. They were adapted to the volumes of DNA metabarcoding data generated by 454 pyrosequencing in the early 2000s. With the advent of Solexa and Illumina sequencers, data sizes increased considerably, making OBITools less efficient. Coupled with the switch to Python version 3, OBITools became difficult to implement.\nOBITools4 are the latest version of OBITools . Unlike OBITools3, OBITools4 follow the same philosophy as OBITools1 and OBITools2. OBITools4 are a complete rewrite of the OBITools code in GO, an efficient compiled programming language. This language also enables code to be parallelized to take advantage of the multi-core architectures of today\u0026rsquo;s computers.\nThe most important thing to understand about OBITools is that it is not a pipeline for processing DNA metabarcoding data. OBITools are a set of tools that allow you to easily build your own pipeline for processing your DNA metabarcoding data according to your biological questions.\n"},{"id":12,"href":"/obidoc/docs/programming/lua/obitools_classes/biosequence/","title":"BioSequence","section":"Obitools Classes","content":" The BioSequence class # The BioSequence class corresponds to the OBITools representation of a nucleic sequence.\nConstructor of BioSequence # The BioSequence constructor accepts three parameters:\nsequence_id: a string corresponding to the sequence id. The sequence id cannot contain white character. dna_sequence: a string representing a nucleic sequence. The string is converted to lowercases and must only contain characters corresponding to IUPAC code. definition: this parameter is optional. If present, it corresponds to an unstructured text, used to describe the sequence. sequence = BioSequence.new(\u0026#34;sequence_id\u0026#34;,\u0026#34;gctagctgtgatgctgatgctagct\u0026#34;) BioSequence Methods # id : the sequence identifier # Extract the sequence identifier from a BioSequence object. The method doesn\u0026rsquo;t accept any parameter and returns a string.\nsequence = BioSequence.new(\u0026#34;sequence_id\u0026#34;,\u0026#34;gctagctgtgatgctgatgctagct\u0026#34;) print(sequence:id()) sequence_id A string parameter representing a nucleic sequence can be provided to the id method. In this case, the id of the BioSequence object is substituted by the new string. The method does not return anything.\nsequence = BioSequence.new(\u0026#34;sequence_id\u0026#34;,\u0026#34;gctagctgtgatgctgatgctagct\u0026#34;) print(sequence:id()) sequence:sequence(\u0026#34;new_id\u0026#34;) print(sequence:id()) sequence_id new_id sequence : the nucleic sequence # When used with no parameter, the method extracts the nucleic sequence itself from the BioSequence object. In this condition, it returns a string.\nsequence = BioSequence.new(\u0026#34;sequence_id\u0026#34;,\u0026#34;gctagctgtgatgctgatgctagct\u0026#34;) print(sequence:sequence()) gctagctgtgatgctgatgctagct A string parameter representing a nucleic sequence can be provided to the sequence method. In this case, the current sequence of the object is substituted by the new string. The method does not return anything.\nsequence = BioSequence.new(\u0026#34;sequence_id\u0026#34;,\u0026#34;gctagctgtgatgctgatgctagct\u0026#34;) print(sequence:sequence()) sequence:sequence(\u0026#34;cgatctagcta\u0026#34;) print(sequence:sequence()) gctagctgtgatgctgatgctagct cgatctagcta qualities # definition # count # taxid # taxon # attribute # len # has_sequence # has_qualities # source # md5 # md5_string # subsequence # reverse_complement # fasta # fastq # string # "},{"id":13,"href":"/obidoc/docs/programming/expression/","title":"Expression language","section":"Programming OBITools","content":" OBITools Expression Language # The OBITools expression language is based on Gval and extends it with extra functions useful for bioinformatics tasks, as well as predefined variables. It is designed to evaluate simple expressions used as arguments in some OBITools commands (e.g., obigrep , obiannotate ). For more complex scripting, you can use Lua through the obiscript command.\nBasic Expressions # Expressions can be literal values, arithmetic or logical operations, or string manipulations.\nExamples:\nLiteral values:\n42 // Number \u0026#34;hello\u0026#34; // String true // Boolean null // Null value Arithmetic operations:\n10 + 20 * 2 // 50 Logical operations:\nx \u0026gt; 0 \u0026amp;\u0026amp; y \u0026lt; 100 // Combined conditions Parameterized Expressions # Variables can be used in expression to parameterize them. They can be accessed directly or nested inside the expression, depending on their structure and how they are passed to the OBITools commands.\nExamples:\nDirect access to parameters:\nfoo \u0026gt; 0 // Checks if `foo` is greater than 0 foo.bar == \u0026#34;ok\u0026#34; // Access to nested key `bar` in `foo` Nested parameters:\nsequence.Qualities()[0] // Access first element of array `data` annotations[\u0026#34;count\u0026#34;] // Access key `timeout` count map `annotations` Selectors: Brackets vs Dot # OBITools expression language supports two ways to access nested data:\nBracket selector ([]): for dynamic or complex keys.\nfoo[\u0026#34;key\u0026#34; + \u0026#34;name\u0026#34;] // Dynamic key concatenation data[1] // Access second item in an array Dot selector (.): for fixed and alphanumeric keys.\nfoo.bar.baz // Access `baz` field in `bar` field of `foo` Struct Fields and Methods # If the parameters are Go structs, fields and methods can be accessed directly.\nExamples:\nAccess struct fields:\nannotations.seq_length + sequence.Count() // Combine field and method Nested structures:\nannotations.merged_sample.sample_1 // Access nested struct fields Built-in Features # The expression language includes a rich set of operators and data types:\nCategory Examples Operators +, -, *, /, \u0026gt;, ==, \u0026amp;\u0026amp;, || Constants 42, \u0026ldquo;hello\u0026rdquo;, true, null Functions date(), strlen(), format() Control if-else, ternary ? :, null coalescence ?? Example:\nWith the ternary operator conditional expression, you can conditionally assign a value to a variable. If value is greater than 100, it will be \u0026ldquo;high\u0026rdquo;, otherwise \u0026ldquo;low\u0026rdquo;.\nresult = (value \u0026gt; 100 ? \u0026#34;high\u0026#34; : \u0026#34;low\u0026#34;) ?? \u0026#34;default\u0026#34; The null coalescence operator (??) returns the left-hand side if it\u0026rsquo;s not null, otherwise it will return the right-hand side. So in this case, if value is null, it will be replaced with \u0026ldquo;default\u0026rdquo;.\nvalue ?? \u0026#34;default\u0026#34; You can chain several ?? operations together:\na ?? b ?? c ?? \u0026#34;fallback\u0026#34; 🧩 List of variables Added to the Gval Language # The expressions are evaluated in the context of a sequence. When evaluating an expression, these variables are available\nsequence - a variable representing current sequence being processed by the OBITools command. It is an object of type BioSequence. annotations - a variable representing the annotations of the current sequence being processed by the OBITools command. It is an object of type Annotations, actually a map indexed by string. Each string is the tag name that you can observe in the header of a sequence in a fasta or fastq file The expression language allows to access to the methods of the BioSequence class for the sequence variable. For example, you can use sequence.Len() returns the length of the sequence and sequence.Id() returns its identifier. The same for the Annotations class for the annotations variable.\n🧩 The useful methods for the BioSequence class are: # Len() int - Returns the length of the sequence. String() string - Returns the sequence itself as a string. Id() string - Returns the identifier of the sequence. Definition() string - Returns the definition part of the header line of the sequence HasAnnotation() bool - Returns true if at least one annotation exists for this sequence. HasDefinition() bool - Returns true if a definition exists for this sequence. HasSequence() bool - Returns true if the sequence is not empty. HasQualities() bool - Returns true if quality scores exist for this sequence. Count() int - Returns the number of occurrences of the sequence in the data set. Taxid() string - Returns the taxonomy id associated with this sequence. 🧩 List of Functions Added to the Gval Language # len # Calculates the length of an object (e.g., string, sequence). The function accepts as input a sequence, a string, a vector or a map. On sequence and string, it returns the length of the input (number of nucleotides or characters respectively). On maps and vectors and maps, the len function returns the number of elements stored in the container object.\nExample: # Here we use the len function to compute the length of the current sequence.\nlen(sequence) // Returns the length of the biological sequence contains # Checks if a key exists in a map or a substring exists in a string. This function applied on map objects. OBITools maps are indexed by string keys. The contains required a map object as first parameter and a string object as second parameter. It returns the logical value true if the map contains the key defined by the second parameter. Otherwise, the function returns false.\nExample: # Check if the annotations map of the sequence is containing the key count, which means: is the sequence annotated by a count tag.\ncontains(annotations,\u0026#34;count\u0026#34;) // // Checks if \u0026#34;gene\u0026#34; is a key in the annotations map ismap # Checks if an object is a map (key-value structure). That function is a type assertion function it allows for checking that the object provided as parameter is a map. It returns the logical value true if the object is a map, otherwise it returns false.\nExample: # Check if the annotations.merged_sample object is a map. annotation is itself a map containing every annotation of the current sequence. annotations.merged_sample is the object contained in the annotations at the index merged_sample. This tag is normally set by obiuniq and is a map indexed by sample ids and containing the number of time this sequence has been observed in the different samples. If the file is correctly annotated, the annotations.merged_sample object is therefore a map and the ismap function must return true.\nismap(annotations.merged_sample) // Returns `true` if the `merged_sample` // `annotations` is a map isvector # Checks if a value is a vector (list or array). Returns true if the object is a list, and false otherwise.\nExample: # isvector({\u0026#34;toto\u0026#34;:3}) // returns false isvector([1,2,3]) // returns true elementof # Extracts an element from a vector, a map or a string. The function requires two arguments, The container element, and the index to be extracted. If the index is out of range, it returns an error. If the object is not a vector, map, or string, it returns an error. When the container object is a vector or a string the index is expected to be a positive or null integer and when it is a map the index should be a string key.\nExample: # elementof([1,2,3], 0) // returns 1 elementof({\u0026#34;a\u0026#34;:1,\u0026#34;b\u0026#34;:2}, \u0026#34;a\u0026#34;) // returns 1 elementof(\u0026#34;abc\u0026#34;, 0) // returns \u0026#34;a\u0026#34; min # Returns the minimum value of a vector or a map. If the object is not a vector nor a map, but the value is comparable it will return the value itself. If it is not comparable it returns an error. An error is also returned if the vector or the map is empty.\nExample: # min([10,2,3]) // returns 2 min({\u0026#34;a\u0026#34;:10,\u0026#34;b\u0026#34;:2}) // returns 2 min(12) // returns 12 max # Returns the maximum value of a vector or a map. If the object is not a vector nor a map, but the value is comparable it will return the value itself. If it is not comparable it returns an error. An error is also returned if the vector or the map is empty.\nExample: # max([10,2,3]) // returns 10 max({\u0026#34;a\u0026#34;:10,\u0026#34;b\u0026#34;:2}) // returns 10 max(12) // returns 12 sprintf # Formats a string by replacing placeholders with values, enabling dynamic text generation. It is commonly used to construct messages, file paths, or structured data by inserting variables into predefined templates.\nHow It Works\nPlaceholders (e.g., %s, %d, %f) act as markers for values to be inserted. The function replaces each placeholder with the corresponding argument in order. Examples # Basic String Insertion\nsprintf(\u0026#34;Sample: %s\u0026#34;, \u0026#34;Sper01\u0026#34;) // Output: \u0026#34;Sample: Sper01\u0026#34; Basic String Insertion\nsprintf(\u0026#34;Sample: %s\u0026#34;, \u0026#34;Sper01\u0026#34;) // Output: \u0026#34;Sample: Sper01\u0026#34; Numeric Formatting\nssprintf(\u0026#34;Length: %d bp\u0026#34;, 84) // Output: \u0026#34;Length: 84 bp\u0026#34; Floating-Point Precision\nsprintf(\u0026#34;GC Content: %.2f%%\u0026#34;, 52.345) // Output: \u0026#34;GC Content: 52.34%\u0026#34; Combining Multiple Values\nsprintf(\u0026#34;Primer: %s (position %d)\u0026#34;, \u0026#34;GGGCAATCCTGAGCCAA\u0026#34;, 10) // Output: \u0026#34;Primer: GGGCAATCCTGAGCCAA (position 10)\u0026#34; Placeholders like %s (string), %d (integer), %f (float), and %v (generic value) are typical of the printf family of function found in many languages.\nPadding Add padding to values using 0 (zero) or space ( ) for alignment.\nFormat Description Example Output %5d Minimum width of 5, right-aligned sprintf(\u0026quot;%5d\u0026quot;, 42) ' 42' %-5d Minimum width of 5, left-aligned sprintf(\u0026quot;%-5d\u0026quot;, 42) '42 ' %05d Zero-padded to 5 digits sprintf(\u0026quot;%05d\u0026quot;, 42) '00042' %05.2f Zero-padded float with precision sprintf(\u0026quot;%05.2f\u0026quot;, 3.14) '03.14' Precision Control the number of decimal places for floats or the maximum length for strings.\nFormat Description Example Output %.2f 2 decimal places sprintf(\u0026quot;%.2f\u0026quot;, 3.14159) '3.14' %.3s First 3 characters of a string sprintf(\u0026quot;%.3s\u0026quot;, \u0026ldquo;hello\u0026rdquo;) 'hel' %05.2f Zero-padded float with precision sprintf(\u0026quot;%05.2f\u0026quot;, 12.3) '12.30' Alignment Use - to left-align values within a field.\nFormat Description Example Output %-10s Left-align string in 10 chars sprintf(\u0026quot;%-10s\u0026quot;, \u0026ldquo;cat\u0026rdquo;) 'cat ' %-5.2f Left-align float with precision sprintf(\u0026quot;%-5.2f\u0026quot;, 3.14) '3.14 ' Special Verbs\n%v: Default formatting (e.g., for slices, maps, or custom types).\nsprintf(\u0026#34;%v\u0026#34;, [1, 2, 3]) // \u0026#34;[1 2 3]\u0026#34; sprintf(\u0026#34;%v\u0026#34;, {\u0026#34;name\u0026#34;: \u0026#34;Alice\u0026#34;}) // \u0026#34;{name: Alice}\u0026#34; %T: Print the type of a value.\nsprintf(\u0026#34;%T\u0026#34;, 42) // \u0026#34;int\u0026#34; sprintf(\u0026#34;%T\u0026#34;, \u0026#34;hello\u0026#34;) // \u0026#34;string\u0026#34; Hexadecimal and Binary\n%x/%X: Lowercase/uppercase hex.\nsprintf(\u0026#34;%x\u0026#34;, 255) // \u0026#34;ff\u0026#34; sprintf(\u0026#34;%X\u0026#34;, 255) // \u0026#34;FF\u0026#34; %b: Binary representation.\nsprintf(\u0026#34;%b\u0026#34;, 5) // \u0026#34;101\u0026#34; Scientific Notation\n%e/%E: Scientific notation with lowercase/uppercase e.\nsprintf(\u0026#34;%e\u0026#34;, 123456.789) // \u0026#34;1.234568e+05\u0026#34; sprintf(\u0026#34;%E\u0026#34;, 123456.789) // \u0026#34;1.234568E+05\u0026#34; Use %% to escape a literal % character\nsprintf(\u0026#34;Percentage: %d%%\u0026#34;, 50) // \u0026#34;Percentage: 50%\u0026#34; subspc # The function accept a string parameter and replaces spaces in a string with underscores (_). It returns the new substituted string.\nExamples # subspc(\u0026#34;Abies alba\u0026#34;) // returns \u0026#34;Abies_alba\u0026#34; int # Converts a value to an integer (int). Fails if conversion is not possible.\nExamples # int(\u0026#34;324\u0026#34;) # Returns the integer value 324 int(3.24) # Returns the integer value 3 numeric # Converts a value to a floating-point number (float64). Fails if conversion is not possible.\nExamples # numeric(\u0026#34;3.14159\u0026#34;) # Returns the float value 3.14159 numeric(3) # Returns the float value 3.0 bool # Converts a value to a boolean (bool). Fails if conversion is not possible. Every non-null numeric value is considered as true. For string, once converted to lower cases, value equals to \u0026quot;true\u0026quot;, \u0026quot;t\u0026quot;, \u0026quot;yes\u0026quot;, \u0026quot;1\u0026quot; or \u0026quot;on\u0026quot; are considered as true, all others are false.\nExamples # bool(\u0026#34;TRUE\u0026#34;) // returns true bool(\u0026#34;Toto\u0026#34;) // returns false bool(3) // returns true bool(0) // returns false string # Converts a value to a string. Fails if conversion is not possible.\nExamples # string([1,2,4]) // returns \u0026#34;[1,2,4]\u0026#34; string(\u0026#34;Toto\u0026#34;) // returns \u0026#34;Toto string(3) // returns \u0026#34;3\u0026#34; string(10.14) // returns \u0026#34;10.14\u0026#34; ifelse # Conditional operator: returns args[1] if args[0] is true, otherwise args[2].\nExamples # ifelse(bool(\u0026#34;true\u0026#34;), \u0026#34;yes\u0026#34;, \u0026#34;no\u0026#34;) // returns \u0026#34;yes\u0026#34; ifelse(bool(\u0026#34;false\u0026#34;), \u0026#34;yes\u0026#34;, \u0026#34;no\u0026#34;) // returns \u0026#34;no\u0026#34; gcskew # Calculates the GC skew (difference between G and C bases) of a biological sequence.\n\\[ GC_{skew} = \\frac{G - C}{G + C} \\] Examples # For example for sequence \u0026quot;GATCG\u0026quot;: \\(GC_{skew} = \\frac{2 - 1}{2 + 1} = \\frac{1}{3}= 0.33 \\) gcskew(\u0026#34;GATCG\u0026#34;) // returns 0.3333 (1/3) gc # Calculates the percentage of G and C bases in a biological sequence.\n\\[ GC = \\frac{G + C}{len(sequence) - O} \\] With G and C the number of corresponding nucleotides and O the number of ambiguous characters (Ns).\nThe function accepts a single argument of type biological sequence.\nExamples # gc(\u0026#34;GATCG\u0026#34;) // returns 0.6 (3/5, as there are two Gs and one Cs (Three in total) // in a sequence of five nucleotides) composition # Returns the base composition of a biological sequence as a map (map[string]float64) containing five keys: \u0026ldquo;a\u0026rdquo;, \u0026ldquo;c\u0026rdquo;, \u0026ldquo;g\u0026rdquo;, \u0026ldquo;t\u0026rdquo;, and \u0026ldquo;o\u0026rdquo;. The value for each key is the number of occurrences of that base in the sequence, case-insensitive (i.e., both \u0026lsquo;A\u0026rsquo; and \u0026lsquo;a\u0026rsquo; are considered as \u0026lsquo;a\u0026rsquo;). The \u0026ldquo;o\u0026rdquo; key represents the number of other characters (nucleotides that are not A, C, G or T) in the sequence.\nThe function accepts a single argument of type biological sequence.\nExamples # composition(\u0026#34;GATCG\u0026#34;) // returns map[string]float64{\u0026#34;a\u0026#34;:1, \u0026#34;c\u0026#34;:1, \u0026#34;g\u0026#34;:2, \u0026#34;t\u0026#34;:1, \u0026#34;o\u0026#34;:0} qualities # Returns the quality scores of a biological sequence as an array of float values representing the Phred quality scores for each base in the sequence. The function accepts a single argument of type BioSequence.\nExamples # qualities(sequence) replace # Replaces all occurrences of a regular expressions pattern in a string. The function accepts three arguments: the first one is the input string and the second one is the pattern to be replaced. The last argument is what will replace the found pattern in the string. It returns the modified string.\nExamples # replace(\u0026#34;GATCG\u0026#34;, \u0026#34;A.\u0026#34;, \u0026#34;xx\u0026#34;) // returns \u0026#34;GxxCG\u0026#34; replace(\u0026#34;GATCG\u0026#34;, \u0026#34;[ACGT]+\u0026#34;, \u0026#34;X\u0026#34;) // returns \u0026#34;X\u0026#34; substr # Extracts a substring from the input string. The function accepts three arguments. The first one is the input string, the second one is the start index and the third one is the length of the substring to be extracted. It returns the extracted substring. Position in the string is zero-based.\nExamples # substr(\u0026#34;GATCG\u0026#34;, 0, 3) // returns \u0026#34;GAT\u0026#34; substr(\u0026#34;GATCG\u0026#34;, 1, 4) // returns \u0026#34;ATCG\u0026#34; "},{"id":14,"href":"/obidoc/docs/patterns/regular/","title":"Regular Expressions","section":"Patterns","content":" Regular Expressions # Regular expressions are a powerful tool for describing patterns in text. They are used in several OBITools like obigrep , obiannotate or obiscript .\nSingle characters # Pattern Description . any character, possibly including newline (flag s=true) [xyz] character class [^xyz] negated character class [[:alpha:]] ASCII character class [[:^alpha:]] negated ASCII character class Composites # Pattern Description xy x followed by y x|y x or y (prefer x) Repetitions # Pattern Description x* zero or more x, prefer more x+ one or more x, prefer more x? zero or one x, prefer one x{n,m} n or n+1 or \u0026hellip; or m x, prefer more x{n,} n or more x, prefer more x{n} exactly n x x*? zero or more x, prefer fewer x+? one or more x, prefer fewer x?? zero or one x, prefer zero x{n,m}? n or n+1 or \u0026hellip; or m x, prefer fewer x{n,}? n or more x, prefer fewer x{n}? exactly n x Grouping # Pattern Description (re) numbered capturing group (submatch) (?P\u0026lt;name\u0026gt;re) named \u0026amp; numbered capturing group (submatch) (?\u0026lt;name\u0026gt;re) named \u0026amp; numbered capturing group (submatch) (?:re) non-capturing group (?flags) set flags within current group; non-capturing (?flags:re) set flags during re; non-capturing Character classes # Pattern Description [\\d] digits (== \\d) [^\\d] not digits (== \\D) [\\D] not digits (== \\D) [^\\D] not not digits (== \\d) [[:name:]] named ASCII class inside character class (== [:name:]) [^[:name:]] named ASCII class inside negated character class (== [:^name:]) [\\p{Name}] named Unicode property inside character class (== \\p{Name}) [^\\p{Name}] named Unicode property inside negated character class (== \\P{Name}) Named character classes # Pattern Description [[:alnum:]] alphanumeric (== [0-9A-Za-z]) [[:alpha:]] alphabetic (== [A-Za-z]) [[:ascii:]] ASCII (== [\\x00-\\x7F]) [[:blank:]] blank (== [\\t ]) [[:cntrl:]] control (== [\\x00-\\x1F\\x7F]) [[:digit:]] digits (== [0-9]) [[:graph:]] graphical (== [!-~] == [A-Za-z0-9!\u0026quot;#$%\u0026amp;'()*+,\\-./:;\u0026lt;=\u0026gt;?@[\\\\\\]^_`{|}~]) [[:lower:]] lower case (== [a-z]) [[:print:]] printable (== [ -~] == [[:graph:]]) [[:punct:]] punctuation (== [!-/:-@[-\\`{-~]) [[:space:]] whitespace (== [\\t\\n\\v\\f\\r ]) [[:upper:]] upper case (== [A-Z]) [[:word:]] word characters (== [0-9A-Za-z_]) [[:xdigit:]] hex digit (== [0-9A-Fa-f]) "},{"id":15,"href":"/obidoc/obitools/obicomplement/","title":"obicomplement","section":"Basics","content":" obicomplement: get sequences reverse complement # Description # Compute the reverse complement of the sequence entries. The output is written by default in fasta format if the input sequence file does not include quality scores, otherwise it is written in fastq format.\nSynopsis # obicomplement [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obicomplement specific options # --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obicomplement --help "},{"id":16,"href":"/obidoc/obitools/obipairing/","title":"obipairing","section":"Sequence alignments","content":" obipairing: align forward and reverse paired reads # Description # When DNA metabarcoding sequences are generated as paired reads on the Illumina platform, obipairing aims to align forward and reverse reads to generate full length amplicon sequences.\nInput data # The obipairing command requires two input files:\nOne file contains the forward reads. The second file contains the reverse reads. Both files must contain the same number of sequences, and the sequences must be in the same order. This means that the first sequence of the forward reads file must correspond to the first sequence of the reverse reads file. obipairing will take this order into account and will only align sequences that are in the same rank.\nConsider the following example, where the forward reads file is forward.fastq and the reverse reads file is reverse.fastq and both consist of 4 sequences:\nšŸ“„ forward.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 @M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ reverse.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 The first sequence of the forward.fastq file having the id M01334:147:000000000-LBRVD:1:1101:14968:1570 will be paired with the first sequence of the reverse.fastq file having the same id M01334:147:000000000-LBRVD:1:1101:14968:1570, not because they have the same identifier but because they are both the first sequence of their respective files.\nThe simplest obipairing command # The minimal obipairing command to align the forward.fastq and reverse.fastq files is:\nobipairing -F forward.fastq -R reverse.fastq \u0026gt; paired.fastq graph TD A@{ shape: doc, label: \"forward.fastq\" } B@{ shape: doc, label: \"reverse.fastq\" } C[obipairing] D@{ shape: doc, label: \"paired.fastq\" } A --\u003e C B --\u003e C:::obitools C --\u003e D classDef obitools fill:#99d57c it will produce a file named paired.fastq with the following content:\nšŸ“„ paired.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;ali_length\u0026#34;:137,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:1687,\u0026#34;score_norm\u0026#34;:0.679,\u0026#34;seq_ab_match\u0026#34;:93} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/\u0026gt;/1D1GF\u0026gt;\u0026gt;EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311\u0026gt;\u0026gt;11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;ali_dir\u0026#34;:\u0026#34;right\u0026#34;,\u0026#34;ali_length\u0026#34;:138,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:16)-\u0026gt;(A:33)\u0026#34;:14,\u0026#34;(T:33)-\u0026gt;(A:17)\u0026#34;:118,\u0026#34;(T:37)-\u0026gt;(A:16)\u0026#34;:125,\u0026#34;(T:38)-\u0026gt;(A:16)\u0026#34;:32,\u0026#34;(T:39)-\u0026gt;(A:17)\u0026#34;:44},\u0026#34;paring_fast_count\u0026#34;:114,\u0026#34;paring_fast_overlap\u0026#34;:138,\u0026#34;paring_fast_score\u0026#34;:0.844,\u0026#34;score\u0026#34;:5446,\u0026#34;score_norm\u0026#34;:0.957,\u0026#34;seq_a_single\u0026#34;:13,\u0026#34;seq_ab_match\u0026#34;:132,\u0026#34;seq_b_single\u0026#34;:13} gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;ali_length\u0026#34;:4,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:126,\u0026#34;score_norm\u0026#34;:1,\u0026#34;seq_ab_match\u0026#34;:4} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311\u0026gt;\u0026gt;11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:54,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:39)-\u0026gt;(A:14)\u0026#34;:101},\u0026#34;paring_fast_count\u0026#34;:42,\u0026#34;paring_fast_overlap\u0026#34;:54,\u0026#34;paring_fast_score\u0026#34;:0.824,\u0026#34;score\u0026#34;:2888,\u0026#34;score_norm\u0026#34;:0.944,\u0026#34;seq_a_single\u0026#34;:97,\u0026#34;seq_ab_match\u0026#34;:51,\u0026#34;seq_b_single\u0026#34;:97} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF\u0026gt;0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11\u0026gt;1111@31\u0026gt;111 The alignment process # obipairing will align the reads following a two-step procedure to increase computation speed.\nA fast alignment to determine quickly the overlap # The first step aligns the reads using a FASTA-derived algorithm. Based on results of the first step, a second alignment step is on the overlapping region only using an exact dynamic programming algorithm taking into account sequence quality scores present in the fastq files. It is possible to disable this first alignment step at the cost of an increase in the computation time by using the --exact-mode option.\nThe first fast alignment step adds three tags to the FASTQ header for each sequence record to indicate the results of this first step alignment.\nparing_fast_count : Number of 4mer shared on the main diagonal of the fasta dot plot. pairing_fast_overlap : Length in nucleotides of the overlap as detected by this algorithm. pairing_fast_score : The pairing fast score is the number of shared 4mer on the main diagonal of the fasta dot plot (pairing_fast_count) divided by the number of 4mer involved in the overlapping region of the forward and reverse reads ( \\(pairing\\_fast\\_overlap - 3\\) ) \\[ pairing\\_fast\\_score = \\frac{pairing\\_fast\\_count}{pairing\\_fast\\_overlap - 3} \\] There are two options for controlling this first step.\nThe --fasta-exact option allows changing the best alignment selection from the one with the highest pairing_fast_score (the default behavior) to the one with the highest pairing_fast_count.\nThe --exact-mode option tells obipairing to bypass this first alignment step and proceed directly to exact alignment, at the cost of a longer computation time.\nThe exact alignment of the overlapping regions # Once the overlap has been quickly identified using the FASTA-derived algorithm, the overlapping region as detected in this first step is extended by \\(\\Delta\\) nucleotides at each end ( \\(\\Delta = 5\\) by default and can be defined with the --delta option) to be exactly aligned using a semi-global alignment algorithm taking into account the sequence quality scores present in the fastq files. There are two versions of this algorithm, the left-align and the right-align version. The version used, left or right, depends on the length of the amplicon. Amplicons longer than the read length will be aligned with the left version. The shorter ones are aligned with the right version.\nWhen the --exact-mode option is used, full length reads are aligned twice, once with the left version and once with the right version. The alignment with the highest score is used. This consequently increases computation time.\nThe exact alignment step adds the following tags to the FASTQ header for each read to report the quality of the alignment.\nali_dir: indicates the mode of the used exact alignment left or right. ali_length: the length of the aligned overlapping region (including gaps). seq_a_single: the length of the unaligned region on the forward read. seq_ab_match: the number of matches in the aligned overlapping region. seq_b_single: the length of the unaligned region on the reverse read. score: the raw score of the alignment (the sum of the elementary scores for each aligned position). score_norm: seq_ab_match divided by ali_length. pairing_mismatches: a description of the mismatches between the reads (this tag is not added if the --without-stat is set). It is expressed as a JSON map with keys describing the mismatch and values corresponding to the position of the mismatch in the reconstructed full length amplicon. {\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:40)-\u0026gt;(A:14)\u0026#34;:101} This example describes the three mismatches found in the overlapping region of the fourth sequence pair: A C with a quality score of 39 on the forward read is aligned to an A with a quality score of 16 on the reverse read at position 102. A C with a quality score of 39 on the forward read is aligned to an A with a quality score of 17 on the reverse read at position 121. A T with a quality score of 40 on the forward read aligns to an A with a quality score of 14 on the reverse read at position 101. Building the consensus sequence # If the overlap length is below a threshold (20 by default, and can be set with the --min-overlap option), or the score_norm is below an identity threshold (0.9 by default, and can be set with the --min-identity option), no consensus is computed for the read pair. Both sequences are only pasted together with a set of . separating the forward read and the reverse complementary sequence of the reverse read. In this case, the sequence is tagged with a mode attribute set to join.\nIf the overlap is long enough and the identity is sufficient, a consensus sequence is built to maximize the global sequencing quality of the reconstructed amplicon. The non-aligned regions are reported as is. The overlapping regions are transcribed as follows:\nFor each match, the nucleotide observed on both reads is retained, and the quality score is increased to reflect the congruence of the two reads. \\[Q_{consensus} = Q_F + Q_R\\] If there is a mismatch, the nucleotide with the highest quality score is retained and its quality score is decreased to reflect the discrepancy between the two reads (with \\(Q_{max} = max(Q_F, Q_R)\\) and \\(Q_{min} = min(Q_F, Q_R)\\) ). \\[Q_{consensus} = \\log_{10} \\left(10^{-\\frac{Q_max}{10}} \\cdot \\frac{1 - 10^{-\\frac{Q_min}{10}}}{4} \\right)\\] In case of an insertion or deletion, the gap will be affected with a quality of 0 and the mismatch rules will be applied. This means that insertions and deletions will always be considered as insertions in the consensus sequence. A mode attribute set to alignment will be added to the consensus sequence annotations.\nSynopsis # obipairing --forward-reads|-F \u0026lt;FILENAME_F\u0026gt; --reverse-reads|-R \u0026lt;FILENAME_R\u0026gt; [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--delta|-D \u0026lt;int\u0026gt;] [--ecopcr] [--embl] [--exact-mode] [--fast-absolute] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--gap-penality|-G \u0026lt;float64\u0026gt;] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--min-identity|-X \u0026lt;float64\u0026gt;] [--min-overlap \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--penality-scale \u0026lt;float64\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--skip-empty] [--solexa] [--version] [--without-stat|-S] [\u0026lt;args\u0026gt;] Options # obipairing mandatory options # --forward-reads | -F \u0026lt;FILENAME\u003e: The name of the file containing the forward reads. --reverse-reads | -R \u0026lt;FILENAME\u003e: The name of the file containing the reverse reads. Other obipairing specific options # --delta | -D \u0026lt;INTEGER\u003e: length added to the overlap detected by the fast algorithm before being forwarded to the exact alignment algorithm (default: 5 nucleotides). --exact-mode: do not run fast alignment heuristic. (default: a fast algorithm is run at first to accelerate the final exact alignment). --fast-absolute: compute absolute fast score, this option has no effect in exact mode (default: false). --gap-penalty | -G \u0026lt;FLOAT64\u003e: gap penalty expressed as the multiply factor applied to the mismatch score between two nucleotides with a quality of 40 (default 2). (default: 2.000000) --min-identity | -X \u0026lt;FLOAT64\u003e: minimum identity between overlapped regions of the reads to consider the alignment (default: 0.900000). --min-overlap \u0026lt;INTEGER\u003e: minimum overlap between both the reads to consider the alignment (default: 20). --penalty-scale \u0026lt;FLOAT64\u003e: scale factor applied to the mismatch score and the gap penalty (default 1). --without-stat | -S : remove alignment statistics from the produced consensus sequences (default: false). Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # Basic example # Consider the two small fastq files presented above, each containing four sequences and named forward.fastq and reverse.fastq. The following command will align them and create a file named paired.fastq containing the full-length amplicon sequences:\nobipairing -F forward.fastq -R reverse.fastq \u0026gt; paired.fastq A bar graph showing the frequencies of the aligned and joined read pairs can be generated by combining the output of the obicsv command with the uplot command:\nobicsv -k mode paired.fastq | uplot -H count mode ā”Œ ┐ alignment ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 2.0 join ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 2.0 ā”” ā”˜ It is possible to use the obidistribute tool to separate the reads according to their mode attribute, which is set to join or alignment:\nobidistribute -p \u0026#34;paired_%s.fastq\u0026#34; \\ -c mode \\ paired.fastq This command will produce two files named paired_join.fastq and paired_alignment.fastq containing the sequences with mode set to join and alignment respectively.\nLooking at the content of the paired_join.fastq file, we can see that the first pair of reads was not aligned because the score_norm tag is less than the default identity threshold of 0.9, while the second pair of reads was not aligned because the length of the overlap (ali_length tag) is less than the default minimum overlap of 20.\nšŸ“„ paired_join.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;ali_length\u0026#34;:137,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:1687,\u0026#34;score_norm\u0026#34;:0.679,\u0026#34;seq_ab_match\u0026#34;:93} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/\u0026gt;/1D1GF\u0026gt;\u0026gt;EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311\u0026gt;\u0026gt;11 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;ali_length\u0026#34;:4,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:126,\u0026#34;score_norm\u0026#34;:1,\u0026#34;seq_ab_match\u0026#34;:4} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311\u0026gt;\u0026gt;11 Looking at the contents of the paired_alignment.fastq file, we can see (ali_dir tag) that the first pair of reads was aligned using the right version of the exact alignment algorithm, while the second pair of reads was aligned using the left version.\nšŸ“„ paired_alignment.fastq @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;ali_dir\u0026#34;:\u0026#34;right\u0026#34;,\u0026#34;ali_length\u0026#34;:138,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:16)-\u0026gt;(A:33)\u0026#34;:14,\u0026#34;(T:33)-\u0026gt;(A:17)\u0026#34;:118,\u0026#34;(T:37)-\u0026gt;(A:16)\u0026#34;:125,\u0026#34;(T:38)-\u0026gt;(A:16)\u0026#34;:32,\u0026#34;(T:39)-\u0026gt;(A:17)\u0026#34;:44},\u0026#34;paring_fast_count\u0026#34;:114,\u0026#34;paring_fast_overlap\u0026#34;:138,\u0026#34;paring_fast_score\u0026#34;:0.844,\u0026#34;score\u0026#34;:5446,\u0026#34;score_norm\u0026#34;:0.957,\u0026#34;seq_a_single\u0026#34;:13,\u0026#34;seq_ab_match\u0026#34;:132,\u0026#34;seq_b_single\u0026#34;:13} gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:54,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:39)-\u0026gt;(A:14)\u0026#34;:101},\u0026#34;paring_fast_count\u0026#34;:42,\u0026#34;paring_fast_overlap\u0026#34;:54,\u0026#34;paring_fast_score\u0026#34;:0.824,\u0026#34;score\u0026#34;:2888,\u0026#34;score_norm\u0026#34;:0.944,\u0026#34;seq_a_single\u0026#34;:97,\u0026#34;seq_ab_match\u0026#34;:51,\u0026#34;seq_b_single\u0026#34;:97} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF\u0026gt;0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11\u0026gt;1111@31\u0026gt;111 Pairing the reads in exact mode # The --exact-mode option can be used to align the reads in exact mode. This option bypasses the first fast alignment step and aligns the overlapping region of the reads using the exact alignment algorithm. This option increases the computation time.\nobipairing -F forward.fastq -R reverse.fastq \\ --exact-mode \u0026gt; paired_exact.fastq šŸ“„ paired_exact.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;ali_length\u0026#34;:137,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:1687,\u0026#34;score_norm\u0026#34;:0.679,\u0026#34;seq_ab_match\u0026#34;:93} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/\u0026gt;/1D1GF\u0026gt;\u0026gt;EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311\u0026gt;\u0026gt;11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;ali_dir\u0026#34;:\u0026#34;right\u0026#34;,\u0026#34;ali_length\u0026#34;:138,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:16)-\u0026gt;(A:33)\u0026#34;:14,\u0026#34;(T:33)-\u0026gt;(A:17)\u0026#34;:118,\u0026#34;(T:37)-\u0026gt;(A:16)\u0026#34;:125,\u0026#34;(T:38)-\u0026gt;(A:16)\u0026#34;:32,\u0026#34;(T:39)-\u0026gt;(A:17)\u0026#34;:44},\u0026#34;score\u0026#34;:5446,\u0026#34;score_norm\u0026#34;:0.957,\u0026#34;seq_a_single\u0026#34;:13,\u0026#34;seq_ab_match\u0026#34;:132,\u0026#34;seq_b_single\u0026#34;:13} gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;ali_length\u0026#34;:137,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:3033,\u0026#34;score_norm\u0026#34;:0.796,\u0026#34;seq_ab_match\u0026#34;:109} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311\u0026gt;\u0026gt;11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:54,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:39)-\u0026gt;(A:14)\u0026#34;:101},\u0026#34;score\u0026#34;:2888,\u0026#34;score_norm\u0026#34;:0.944,\u0026#34;seq_a_single\u0026#34;:97,\u0026#34;seq_ab_match\u0026#34;:51,\u0026#34;seq_b_single\u0026#34;:97} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF\u0026gt;0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11\u0026gt;1111@31\u0026gt;111 For this trivial data set, both results, paired.fastq and paired_exact.fastq, are identical with respect to the consensus sequence. But the annotations are different. Using the UNIX diff command, it is possible to compare the two files:\ndiff -u paired.fastq paired_exact.fastq --- paired.fastq 2025-02-23 16:50:12 +++ paired_exact.fastq 2025-02-23 17:24:37 @@ -2,15 +2,15 @@ tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/\u0026gt;/1D1GF\u0026gt;\u0026gt;EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311\u0026gt;\u0026gt;11 -@M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;ali_dir\u0026#34;:\u0026#34;right\u0026#34;,\u0026#34;ali_length\u0026#34;:138,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:16)-\u0026gt;(A:33)\u0026#34;:14,\u0026#34;(T:33)-\u0026gt;(A:17)\u0026#34;:118,\u0026#34;(T:37)-\u0026gt;(A:16)\u0026#34;:125,\u0026#34;(T:38)-\u0026gt;(A:16)\u0026#34;:32,\u0026#34;(T:39)-\u0026gt;(A:17)\u0026#34;:44},\u0026#34;paring_fast_count\u0026#34;:114,\u0026#34;paring_fast_overlap\u0026#34;:138,\u0026#34;paring_fast_score\u0026#34;:0.844,\u0026#34;score\u0026#34;:5446,\u0026#34;score_norm\u0026#34;:0.957,\u0026#34;seq_a_single\u0026#34;:13,\u0026#34;seq_ab_match\u0026#34;:132,\u0026#34;seq_b_single\u0026#34;:13} +@M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;ali_dir\u0026#34;:\u0026#34;right\u0026#34;,\u0026#34;ali_length\u0026#34;:138,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:16)-\u0026gt;(A:33)\u0026#34;:14,\u0026#34;(T:33)-\u0026gt;(A:17)\u0026#34;:118,\u0026#34;(T:37)-\u0026gt;(A:16)\u0026#34;:125,\u0026#34;(T:38)-\u0026gt;(A:16)\u0026#34;:32,\u0026#34;(T:39)-\u0026gt;(A:17)\u0026#34;:44},\u0026#34;score\u0026#34;:5446,\u0026#34;score_norm\u0026#34;:0.957,\u0026#34;seq_a_single\u0026#34;:13,\u0026#34;seq_ab_match\u0026#34;:132,\u0026#34;seq_b_single\u0026#34;:13} gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////\u0026lt;000 -@M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;ali_length\u0026#34;:4,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:126,\u0026#34;score_norm\u0026#34;:1,\u0026#34;seq_ab_match\u0026#34;:4} +@M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;ali_length\u0026#34;:137,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;join\u0026#34;,\u0026#34;score\u0026#34;:3033,\u0026#34;score_norm\u0026#34;:0.796,\u0026#34;seq_ab_match\u0026#34;:109} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311\u0026gt;\u0026gt;11 -@M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:54,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:39)-\u0026gt;(A:14)\u0026#34;:101},\u0026#34;paring_fast_count\u0026#34;:42,\u0026#34;paring_fast_overlap\u0026#34;:54,\u0026#34;paring_fast_score\u0026#34;:0.824,\u0026#34;score\u0026#34;:2888,\u0026#34;score_norm\u0026#34;:0.944,\u0026#34;seq_a_single\u0026#34;:97,\u0026#34;seq_ab_match\u0026#34;:51,\u0026#34;seq_b_single\u0026#34;:97} +@M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:54,\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(C:39)-\u0026gt;(A:16)\u0026#34;:102,\u0026#34;(C:39)-\u0026gt;(A:17)\u0026#34;:121,\u0026#34;(T:39)-\u0026gt;(A:14)\u0026#34;:101},\u0026#34;score\u0026#34;:2888,\u0026#34;score_norm\u0026#34;:0.944,\u0026#34;seq_a_single\u0026#34;:97,\u0026#34;seq_ab_match\u0026#34;:51,\u0026#34;seq_b_single\u0026#34;:97} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF\u0026gt;0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11\u0026gt;1111@31\u0026gt;111 You can see that only the description line of the sequences has been changed. They are the only ones that start with a + or a - in the first column. The lines starting with - are from the paired.fastq file. The lines starting with + are from the unpaired.fastq file. Lines starting with are identical in both files.\nFor the two aligned sequences, the tags describing the fast alignment performed first are missing in the paired_exact.fastq file because the FASTA-derived algorithm is not run when the --exact-mode option is used.\nThe second joined sequence pair with the --exact-mode now has a very long overlap of 137 bases, as opposed to 4 bases in the previous command, but the score_norm value is only 0.796, which is much lower than the threshold of 0.9, leading to a rejection of the alignment.\n"},{"id":17,"href":"/obidoc/docs/programming/lua/","title":"Lua: for scripting OBITools","section":"Programming OBITools","content":" Lua: the scripting language for OBITools # Lua is a scripting language. Its name means moon in Portuguese. Its purpose is to provide an interpreter that can be easily integrated into software to add scripting capabilities. OBITools provides the obiscript command for this purpose. The obiscript command allows a small script to be applied to each selected sequence in a sequence file. The Lua interpreter used by obiscript is GopherLua Version 1.1.1 which implements Lua version 5.1.\nThe aim of this section is not to be a full introduction to Lua, but to show how to write a Lua script that can be used with obiscript . A full documentation of Lua is available on the official website of the language ( https://www.lua.org/manual/5.1).\nThe structure of a Lua script for obiscript # šŸ“„ example.lua 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 function begin() obicontext.item(\u0026#34;compteur\u0026#34;,0) end function worker(sequence) samples = sequence:attribute(\u0026#34;merged_sample\u0026#34;) samples[\u0026#34;tutu\u0026#34;]=4 sequence:attribute(\u0026#34;merged_sample\u0026#34;,samples) sequence:attribute(\u0026#34;toto\u0026#34;,44444) nb = obicontext.inc(\u0026#34;compteur\u0026#34;) sequence:id(\u0026#34;seq_\u0026#34; .. nb) return sequence end function finish() print(\u0026#34;compteur = \u0026#34; .. obicontext.item(\u0026#34;compteur\u0026#34;)) end The begin and finish function # The worker function # The obicontext to share information # The OBITools classes # The BioSequence class # The BioSequenceSlice class # The Taxonomy class # The Taxon class # Dealing with the OBITools4 multithreading # šŸ“„ extrem_quality.lua 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 -- -- Script for obiscript extracting the qualities of -- the first `size` and last `size` base pairs of -- all the reads longer than 2 x `size` -- -- The result is a csv file on the stdout -- -- obiscript -S ../qualities.lua FAZ61712_c61e82f1_69e58200_0_nolambda.fastq \u0026gt; xxxx -- Import the io module local io = require(\u0026#34;io\u0026#34;) -- Set the output stream to the stdout of the Go program io.output(io.stdout) size = 60 function begin() obicontext.item(\u0026#34;locker\u0026#34;, Mutex:new()) header = \u0026#34;id\u0026#34; for i = 1, size do header = header .. \u0026#34;, L\u0026#34; .. i end for i = size, 1, -1 do header = header .. \u0026#34;, R\u0026#34; .. i end obicontext.item(\u0026#34;locker\u0026#34;):lock() print(header) obicontext.item(\u0026#34;locker\u0026#34;):unlock() end function worker(sequence) l = sequence:len() if l \u0026gt; size * 2 then qualities = sequence:qualities() rep = sequence:id() for i = 1, size do rep = rep .. \u0026#34;, \u0026#34; .. qualities[i] end for i = size, 1, -1 do rep = rep .. \u0026#34;, \u0026#34; .. qualities[l - i + 1] end obicontext.item(\u0026#34;locker\u0026#34;):lock() print(rep) obicontext.item(\u0026#34;locker\u0026#34;):unlock() end return BioSequenceSlice.new() end "},{"id":18,"href":"/obidoc/docs/installation/","title":"Installation","section":"Docs","content":" Availability # The OBITools are open source and protected by the CeCILL 2.1 license.\nAll the sources of the OBITools4 can be downloaded from the metabarcoding GitHub server ( https://github.com/metabarcoding).\nPrerequisites # The OBITools4 are developed using the GO programming language, we stick to the latest version of the language. If you want to download and compile the sources yourself, you first need to install the corresponding compiler on your system. Some parts of the soft are also written in C, therefore a recent C compiler is also requested, GCC on Linux or Windows, the Developer Tools on Mac.\nWhichever installation you choose, you will need to ensure that a C compiler is available on your system.\nInstallation # Using the installation script # An installation script that compiles the new OBITools4 on your Unix-like system is available online. The easiest way to run it is to copy and paste the following command into your terminal:\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash By default, the script installs the OBITools4 commands and other associated files into the /usr/local directory. The names of the commands in the new OBITools4 are mostly identical to those in previous OBITools . Therefore, installing the new OBITools may hide or delete the old ones. If you want both versions to be available on your system, the installation script offers two options:\n--install-dir|-i \u0026lt;PATH\u0026gt;: Directory where obitools are installed (as example use /usr/local not /usr/local/bin).\n--obitools-prefix|-p: Prefix added to the obitools command names if you want to have several versions of obitools at the same time on your system (as example -p g will produce gobigrep command instead of obigrep).\nYou can use these options by following the installation command:\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | \\ bash -s -- --install-dir test_install --obitools-prefix k In this case, the binaries will be installed in the test_install directory and all command names will be prefixed with the letter k. Thus obigrep will be named kobigrep.\n"},{"id":19,"href":"/obidoc/docs/programming/lua/obitools_classes/biosequenceslice/","title":"BioSequenceSlice","section":"Obitools Classes","content":" The BioSequenceSlice class # Constructor of BioSequenceSlice # BioSequenceSlice Methods # push # pop # sequence # len # fasta # fastq # string # "},{"id":20,"href":"/obidoc/docs/patterns/dnagrep/","title":"DNA Patterns","section":"Patterns","content":" DNA Patterns # DNA patterns are useful for describing short DNA sequences like oligonucleotides. They are used by several OBITools like obimultiplex , obipcr or obigrep . The advantage of using DNA patterns over classical regular expressions is that they can be matched with errors. Allowed errors can be simple mismatches, or mismatches and insertions/deletions.\nSyntax of a DNA Pattern # Patterns are limited to sequences up to 63 bases long. As all DNA sequences, they are represented from the 5\u0026rsquo; end to the 3\u0026rsquo; end. Each base is represented by a single letter (A, C, G, T). IUPAC codes can be used to represent ambiguous bases (N, M, K, R, Y, S, W, B, D, H, V, N, see table below). Ambiguous positions can also be denoted by a range of base characters (i.e. ATGC) surrounded by square brackets ([]) : [ATC]. A range of bases can negate by prefixing it with a ! : [!AC]. Patterns do not allow for ambiguity on the number of occurrences of a base. Positions where errors are not allowed, are denoted by a sharp (#) symbol after the base. Patterns are case unsensitive. Example A DNA pattern corresponding to the forward primer of the Euka02 marker with no errors allowed at the two last bases on the 3\u0026rsquo; end:\nTTTGTCTGSTTAATTSC#G#\nExample The same pattern using base ranges for indicating the second S ambiguity:\nTTTGTCTGSTTAATT[CG]C#G#\nIUPAC Codes for Ambiguous Bases # IUPAC DNA ambiguity codes SymbolBasesOrigin of designation GGGuanine AAAdenine TTThymine CCCytosine RG or ApuRine YT or CpYrimidine MA or CaMino KG or TKeto SG or CStrong interaction (3 H bonds) WA or TWeak interaction (2 H bonds) HA or C or Tnot-G, H follows G in the alphabet BG or T or Cnot-A, B follows A VG or C or Anot-T (not-U), V follows U DG or A or Tnot-C, D follows C NG or A or T or CaNy "},{"id":21,"href":"/obidoc/formats/fasta/","title":"FASTA file format","section":"Sequence file formats","content":" The FASTA sequence file format # The FASTA sequence file format is the most widely used sequence file format. This is probably due to its simplicity. It was originally created for the Lipman and Pearson FASTA program ( Citation: Pearson\u0026#32;\u0026amp;\u0026#32;Lipman,\u0026#32;1988 Pearson,\u0026#32; W.\u0026#32;\u0026amp;\u0026#32;Lipman,\u0026#32; D. \u0026#32; (1988). \u0026#32;Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America,\u0026#32;85(8).\u0026#32;2444–2448.\u0026#32;Retrieved from\u0026#32; http://www.ncbi.nlm.nih.gov/pubmed/3162770 ) .\nIn the FASTA format, a sequence is represented by a title line starting with a \u0026gt; character, and the sequences themselves follow the iupac code. The sequence is usually split into several other lines of the same length (expect for the last one). Several sequences can be stored in the same file. The first line of the next sequence also marks the end of the previous one.\n\u0026gt;my_sequence this is my pretty sequence ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT AACGACGTTGCAGTACGTTGCAGT The first word in the title line is the sequence identifier. The rest of the line is a description of the sequence. The OBITools extend this format by adding structured data to the title line. In the previous version of the OBITools, the structured data was stored after the sequence identifier in a key=value; format, as shown below. The sequence definition was stored as free text after the last key=value; pair.\nšŸ“„ two_sequences_obi2.fasta \u0026gt;AB061527 obicleandb_level=family; count=1; family_name=Soricidae; genus_name=Sorex; genus_taxid=9379; obicleandb_trusted=2.2137847111025621e-13; species_name=Sorex unguiculatus; species_taxid=62275; taxid=62275; family_taxid=9376; Sorex unguiculatus mitochondrial NA, complete genome. ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 species_name=Homo sapiens; family_taxid=9604; genus_name=Homo; obicleandb_trusted=0; genus_taxid=9605; obicleandb_level=genus; species_taxid=9606; taxid=9606; count=2; family_name=Hominidae; Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN. ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct With OBITools4 a new format has been introduced to store structured data in the title line. The key/value annotation pairs are now formatted as a JSON map object. The definition is stored as an additional key/value pair using the key \u0026lsquo;definition\u0026rsquo;.\nšŸ“„ two_sequences_obi4.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct The obiconvert command, like all other OBITools4 commands, has two options --output-json-header and --output-OBI-header to choose between the new JSON format and the old OBITools format. The --output-OBI-header option can be abbreviated to -O. By default, the new JSON OBITools4 format is used, so only the -O option is really useful if the old format is required for compatibility with other software.\nConverting from the new JSON format to the old OBITools format:\nobiconvert -O two_sequences_obi4.fasta \u0026gt;AB061527 obicleandb_level=family; count=1; family_name=Soricidae; genus_name=Sorex; genus_taxid=9379; obicleandb_trusted=2.2137847111025621e-13; species_name=Sorex unguiculatus; species_taxid=62275; taxid=62275; family_taxid=9376; Sorex unguiculatus mitochondrial NA, complete genome. ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 species_name=Homo sapiens; family_taxid=9604; genus_name=Homo; obicleandb_trusted=0; genus_taxid=9605; obicleandb_level=genus; species_taxid=9606; taxid=9606; count=2; family_name=Hominidae; Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN. ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct Converting from the old OBITools format to the new JSON format:\nobiconvert two_sequences_obi2.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct The actual format of the header is automatically detected when OBITools4 commands read a FASTA file.\nReferences # Pearson\u0026#32;\u0026amp;\u0026#32;Lipman (1988) Pearson,\u0026#32; W.\u0026#32;\u0026amp;\u0026#32;Lipman,\u0026#32; D. \u0026#32; (1988). \u0026#32;Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America,\u0026#32;85(8).\u0026#32;2444–2448.\u0026#32;Retrieved from\u0026#32; http://www.ncbi.nlm.nih.gov/pubmed/3162770 "},{"id":22,"href":"/obidoc/docs/principles/","title":"General operating principles","section":"Docs","content":" General operating principles for OBITools # OBIToolsare not a metabarcoding data analysis pipeline, but a set of tools for developing customized analyses, while avoiding the black-box effect of a ready-to-use pipeline. A particular effort in the development of OBITools4 has been to use data formats that can be easily interfaced with other software.\nOBITools correspond to a set of UNIX commands that are executed from a command line interface, also known as a terminal, to perform various tasks on DNA sequence files. A UNIX command can be considered as a process that takes a set of inputs and produces a set of outputs.\ngraph LR A@{ shape: doc, label: \"input 1\" } B@{ shape: doc, label: \"input 2\" } C[Unix command] D@{ shape: doc, label: \"output 1\" } E@{ shape: doc, label: \"output 2\" } F@{ shape: doc, label: \"output 3\" } A --\u003e C B --\u003e C:::obitools C --\u003e D C --\u003e E C --\u003e F classDef obitools fill:#99d57c Most OBITools take a single file as input and produce a single file as output. Among the inputs, one has a special status: the standard input (stdin). Symmetrically, there is the standard output (stdout). By default, like any other UNIX command, the OBITools reads its data from stdin and write its results to stdout.\ngraph LR A@{ shape: doc, label: \"stdin\" } C[Unix command] D@{ shape: doc, label: \"stdout\" } A --\u003e C:::obitools C --\u003e D classDef obitools fill:#99d57c If nothing is specified, the UNIX system connects standard input to the terminal keyboard and standard output to the terminal screen. So, for example, if you enter the obiconvert command in your terminal without any arguments, it will appear to stop and do nothing, when in fact it is waiting for you to type something on the keyboard. To stop it, just press Ctrl+D or Ctrl+C. Ctrl+D ends typing and stops the program. Ctrl+C kills the program.\nSpecifying the input data # OBITools are designed to process DNA sequence files. Most of them therefore accept DNA sequence files as input. They can be formatted in the most common sequence file formats, fasta , fastq , EMBL-ENA and GenBank flat files. Data can also be supplied as CSV files. The OBITools generally recognize the file format of input data, but options are provided to force a specific format (i.e. --fasta, --fastq, --genbank, --embl).\nThe most common way to specify the file containing the DNA sequences to be processed is to specify its name as an argument. Here is an example using obicount to count the number of DNA sequences in a file named my_file.fasta.\nobicount my_file.fasta But it is also possible to pass data using the Unix redirection mechanism (i.e. \u0026gt; and \u0026lt;, more details).\nobicount \u0026lt; my_file.fasta OBITools can also be used to process a set of files. In this case, OBITools will process the files in the order in which they appear on the command line.\nobicount my_file1.fasta my_file2.fasta my_file3.fasta The wildcard character (i.e. *) can be used to specify a set of files to be processed.\nobicount my_file*.fasta If the files are located in a subdirectory, the directory name can be specified, without the need to specify any file name. In that case, OBITools will process all the sequence files present in the subdirectory. Sequence files are searched recursively in the specified directory and all its sub-directories.\nobicount my_sub_directory Files considered to be DNA sequence files are those with the extension .fasta, .fastq, .genbank or .embl, .seq or .dat. Files with the second extension .gz (e.g. .fasta.gz) are also considered to be DNA sequence files and are processed without the need for decompression.\nImagine a folder called Genbank containing a complete copy of the Genbank database organized into subdirectories, one per division. Each division subdirectory contains a set of fasta compressed (.gz) files.\n. šŸ“‚ Genbank └── šŸ“‚ bct │ └── šŸ“„ gbbct1.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbbct2.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbbct3.fasta.gz │ └── šŸ“„ ... └── šŸ“‚ inv │ └── šŸ“„ gbinv1.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbinv2.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbinv3.fasta.gz │ └── šŸ“„... └── šŸ“‚ mam │ └── šŸ“„ gbmam1.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbmam2.fasta.gz │ ā”œā”€ā”€ šŸ“„ gbmam3.fasta.gz │ └── šŸ“„... └── šŸ“‚... │ It is possible to count entries in the gbbct1.fasta.gz file with the command\nobicount Genbank/bct/gbbct1.fasta.gz to count the entries in the bct (bacterial) division with the command\nobicount Genbank/bct or to count the entries in the complete Genbank copy with the command\nobicount Genbank Specifying what to do with the output # By default, OBITools write their output to standard output (stdout), which means that the results of a command are printed out on the terminal screen.\nMost OBITools produce sequence files as output. The output sequence file is in fasta or fastq format, depending on whether it contains quality scores ( fastq ) or not ( fasta ). The output format of sequence files can be forced using the --fasta-output or --fastq-output options. If the --fastq-output option is used for a dataset without quality information, a default quality score of 40 will be assigned to each nucleotide. A third option is the --json-output option, which produces data in JSON format.\nWith the exception of the obisummary command, the OBITools which produce other types of data return them in CSV format. The obisummary command returns its results in JSON or YAML formats.\nThe obicomplement command computes the reverse-complement of the DNA sequences provided as input.\nšŸ“„ two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct If the two_sequences.fasta file is processed with the obicomplement command, without indicating the name of the output file, the result is written to the terminal screen.\nobicomplement two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9376\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;9379\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;62275\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;62275\u0026#34;} agggatataaagcaccgccaagtcctttgagttttaagctattgctagtagttctctgac gggtatttttgttagattaaatacctaagtttagggctaa \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9604\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;9605\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;9606\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;9606\u0026#34;} agggatataaagaactgccaggtcctttgagttttaagctgttgctcgtagtattctgac gaatggttttgttaatgtaactactagagtttagggctaa There are two options for saving the results to a file. The first is to redirect the output to a file, as in the following example.\nobicomplement two_sequences.fasta \u0026gt; two_sequences_comp.fasta The second option is to use the --out option.\nobicomplement two_sequences.fasta --out two_sequences_comp.fasta Both methods will produce the same result, a file named two_sequences_comp.fasta containing the reverse-complement of the DNA sequences contained in two_sequences.fasta.\nCombining OBITools commands using pipes # Since OBITools are UNIX commands, and their default behaviour is to read their input from stdin and write their output to stdout, it is possible to combine them using the Unix pipe mechanism (i.e. |). For example, you can reverse-complement the file two_sequences.fasta with the command obicomplement , and then count the number of DNA sequences in the resulting file with the command obicount , without saving the intermediate results, by linking the stdout of obicomplement to the stdin of obicount .\nobicomplement two_sequences.fasta | obicount entities,n variants,2 reads,3 symbols,200 The result of the obicount command is a CSV file. Therefore, it can itself be piped to another command, like csvtomd to reformat the result in a Markdown table.\nobicomplement two_sequences.fasta | obicount | csvtomd entities | n ----------|----- variants | 2 reads | 3 symbols | 200 Or being plotted with the uplot command.\nobicomplement two_sequences.fasta | obicount | uplot barplot -H -d, n ā”Œ ┐ variants ┤ 2.0 reads ┤ 3.0 symbols ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 200.0 ā”” ā”˜ The tagging system of OBITools # OBITools provide several tools for performing computations on the sequences. The result of such a computation may be the selection of a subset of the input sequences, a modification of the sequences themselves, or it may only lead to the estimation of some sequence properties. In the latter case, OBITools store the estimated properties of the relevant sequence in a fasta or fastq file. To achieve this, OBITools add structured information in the form of a JSON map to the header of each sequence. The JSON map allows calculation results to be stored in key-value pairs. Each OBITools command adds one or more key-value pairs to the JSON map as required to annotate a sequence. Below is an example of a fasta formatted sequence with a JSON map added to its header containing three key-value pairs: count associated with the value 2, is_annotated associated with the value true and xxxx associated with the value yyyy.\n\u0026gt;sequence1 {\u0026#34;count\u0026#34;: 2, \u0026#34;is_annotated\u0026#34;: true, \u0026#34;xxxx\u0026#34;: \u0026#34;yyy\u0026#34;} cgacgtagctgtgatgcagtgcagttatattttacgtgctatgtttcagtttttttt fdcgacgcagcggag The key names # Keys can be any string of characters. Their names are case-sensitive. The keys count, Count and COUNT are all considered to be different keys. Some key names are reserved by OBITools and have special meanings (e.g. count contains, if present, an integer value indicating how many times this sequence has been observed, taxid contains a string corresponding to a taxonomic identifier from a taxonomy).\nThe tag values # Values can be strings, integers, floats, or boolean values. Values can also be of composite types but with some limitations compared to the JSON format. In OBITools4 annotations it is not possible to nest composite types. A list cannot contain a list or a map.\nA list is an ordered set of values, in this case a set of integer values:\n[1,3,2,12] A map is a set of values indexed by a key, which is a string. As an example, here is a map of integer values:\n{\u0026#34;toto\u0026#34;:4,\u0026#34;titi\u0026#34;:10,\u0026#34;tutu\u0026#34;:1} Maps are notably used by obiuniq to aggregate information collected from the merged sequence records.\n\u0026gt;my_seq_O1 {\u0026#34;merged_sample\u0026#34;:{\u0026#34;sample1\u0026#34;:45,\u0026#34;sample_2\u0026#34;:33}} gctagctagctgtgatgtcgtagttgctgatgctagtgctagtcgtaaaaaat Using the obiannotate command, it is possible to edit these annotations, adding new ones, deleting others, renaming keys or changing values.\nCaution You are free to add, edit and delete even the OBITools4 reserved keys to mimic the results of an OBITools4 commands. But beware of the impact of these manually modified values. It is best not to modified reserved annotation keys.\nOBITools4 and the taxonomic information # One of the advantages of OBITools is their ability to handle taxonomy annotations. Each sequence in a sequence file can be individually taxonomically annotated by adding a taxid tag. Although several annotation tags can be related to taxonomic information, only the taxid tag really matters.\nThe tags associated with taxonomic annotations fall into three categories\ntaxid The main taxonomic annotation Any tag ending with the _taxid suffix contains secondary taxid annotations, such as family_taxid which contains the taxid at the family level. Text tags ending with _name, such as scientific_name or family_name, which contain the textual representation corresponding to the taxids. The last category is intended solely to facilitate the user\u0026rsquo;s task, to make taxonomic information more comprehensible on a human level. The second category is also intended to help the user, bearing in mind that any taxonomy-based selection implemented by OBITools4 is based solely on the taxid tag.\nTaxonomic identifiers, taxid, are short strings that uniquely identify a taxon within a taxonomy. It is important to rely on taxid rather than Latin names to identify taxa, as several taxa share the same Latin name (e.g. Vertebrata is also a genus of red algae).\nFor example, in the NCBI taxonomy, the species Homo sapiens has the taxid 9606 and belongs to the genus Homo, which has the taxid 9605. Although all NCBI taxids are numeric, the OBITools4 treats them as strings: \u0026quot;9606\u0026quot; and \u0026quot;9605\u0026quot;.\nThe one way to specify a taxid to obitools is to provide this short string: \u0026quot;9606\u0026quot; or \u0026quot;9605\u0026quot;.\nIf the --taxonomy or -t option, which takes a filename as parameter, is used when calling a OBITools command, the corresponding taxonomy will be loaded and every taxid present in a file (taxid and *_taxid tags) will be checked against the taxonomy. To download a copy of the NCBI taxonomy you can use the obitaxonomy command:\nobitaxonomy --download-ncbi --out ncbitaxo.tgz This will create a new ncbitaxo.tgz file containing a local copy of the complete taxonomy.\nThe first consequence of this check is that all taxa are rewritten in their long form. \u0026quot;9606\u0026quot; becomes \u0026quot;taxon:9606 [Homo sapiens]@species\u0026quot;:\ntaxon: is the taxonomy code (TAXOCOD is taxon for the NCBI taxonomy). 9606: is the taxid Homo sapiens: is the scientific name species: is the taxonomic rank So the long form of a taxid can be written as \u0026quot;TAXOCOD:TAXID [SCIENTIFIC NAME]@RANK\u0026quot;.\nIf you look at the following files, you can see that the taxid tag is set to 62275 and 9606 for the first and second sequences respectively:\nšŸ“„ two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct If you use obiconvert without specifying a taxonomy, its only action is to convert potential old numeric taxids (62275 and 9606) to their string equivalents (\u0026quot;62275\u0026quot; and \u0026quot;9606\u0026quot;).\nobiconvert two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9376\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;9379\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;62275\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;62275\u0026#34;} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9604\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;9605\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;9606\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;9606\u0026#34;} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct If the previously downloaded NCBI taxonomy is specified to obiconvert , the output of the command will be as follows. You will notice that, this time, the taxa are given in their long form. The scientific name and taxonomic rank are also given.\nobiconvert -t ncbitaxo.tgz two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;taxon:9376 [Soricidae]@family\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;taxon:9379 [Sorex]@genus\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:62275 [Sorex unguiculatus]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:62275 [Sorex unguiculatus]@species\u0026#34;} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct If the check reveals that taxid is not present in the taxonomy, a warning is issued by the OBITools4. As example, the obiconvert command applied to the following file:\nšŸ“„ four_sequences.fasta \u0026gt;AY189646 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac agcttaaaactcaaaggacctggcggtgcttcatatccct \u0026gt;AF023201 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Snyderichthys copei\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;67561\u0026#34;} tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt agcttgaaacccaaaggacctgacggtgccttagaccccc \u0026gt;JN897380 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea thermophila mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea thermophila\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;1114968\u0026#34;} tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag \u0026gt;KC236422 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea japonica mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea japonica\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;5799994\u0026#34;} cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag displays the following warning:\nobiconvert -t ncbitaxo.tgz four_sequences.fasta INFO[0000] Number of workers set 16 INFO[0000] Found 1 files to process INFO[0000] four_sequences.fasta mime type: text/fasta INFO[0000] On output use JSON headers INFO[0000] Output is done on stdout INFO[0000] Data is writen to stdout INFO[0000] NCBI Taxdump Tar Archive detected: ncbitaxo.tgz INFO[0000] Loading Taxonomy nodes INFO[0003] 2653519 Taxonomy nodes read INFO[0003] Loading Taxon names INFO[0005] 2653519 taxon names read INFO[0005] Loading Merged taxa INFO[0005] 88919 merged taxa read WARN[0005] AF023201: Taxid 67561 has to be updated to taxon:305503 [Lepidomeda copei]@species WARN[0005] JN897380: Taxid 1114968 has to be updated to taxon:2734678 [Neotrypaea thermophila]@species WARN[0005] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy) Of the four sequences, only the first sequence has a taxid known from the NCBI taxonomy. The other three sequences have taxids that are not part of the NCBI taxonomy. In fact, the second and third sequences have taxids that were known in the NCBI taxonomy, but are now transferred to other taxids. The fourth sequence has a taxid that is actually unknown in the NCBI taxonomy.\nSince only the first sequence AY189646 has a known taxid in the output, the taxids are rewritten in long form for this sequence only. For the other three sequences, the taxids are left as they were before. Nevertheless, all four sequences are present in the output.\n\u0026gt;AY189646 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac agcttaaaactcaaaggacctggcggtgcttcatatccct \u0026gt;AF023201 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Snyderichthys copei\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;67561\u0026#34;} tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt agcttgaaacccaaaggacctgacggtgccttagaccccc \u0026gt;JN897380 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea thermophila mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea thermophila\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;1114968\u0026#34;} tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag \u0026gt;KC236422 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea japonica mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea japonica\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;5799994\u0026#34;} cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag If the --update-taxid option is used, the OBITools4 command will update the taxids of sequences that have been transferred to other taxids. When executed on the same sequence file, the same three warnings appear, but the first two warnings announce that the taxids have been updated.\nobiconvert -t ncbitaxo.tgz --update-taxid four_sequences.fasta WARN[0007] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species WARN[0007] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species WARN[0007] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy) In the output, the taxids are rewritten in long format for the first sequence as before, but also for the next two sequences, taking into account their updated taxids.\n\u0026gt;AY189646 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac agcttaaaactcaaaggacctggcggtgcttcatatccct \u0026gt;AF023201 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Snyderichthys copei\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:305503 [Lepidomeda copei]@species\u0026#34;} tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt agcttgaaacccaaaggacctgacggtgccttagaccccc \u0026gt;JN897380 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea thermophila mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea thermophila\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:2734678 [Neotrypaea thermophila]@species\u0026#34;} tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag \u0026gt;KC236422 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea japonica mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea japonica\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;5799994\u0026#34;} cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag If the --fail-on-taxonomy option is used, the OBITools4 command will abort if it encounters a taxid that is not in the NCBI taxonomy. If it is run on the same sequence file, you will see the error message that stops the command when reading the last sequence annotated with a taxid that is not in the NCBI taxonomy. If the --update-taxid option was not used, the command would also have been aborted on the sequence AF023201.\nobiconvert -t ncbitaxo.tgz --update-taxid --fail-on-taxonomy four_sequences.fasta WARN[0007] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species WARN[0007] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species FATA[0007] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy) To remove invalid taxids from your file, you can use obigrep to keep only sequences with a valid taxid. This is the role of the `\u0026ndash;valid-taxid\u0026rsquo; option.\nobigrep -t ncbitaxo.tgz \\ --update-taxid \\ --valid-taxid four_sequences.fasta WARN[0006] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy) WARN[0006] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species WARN[0006] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species If the same three warnings occur, you will notice that only the first three sequences are preserved in the resulting file.\n\u0026gt;AY189646 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;} ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac agcttaaaactcaaaggacctggcggtgcttcatatccct \u0026gt;AF023201 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Snyderichthys copei\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:305503 [Lepidomeda copei]@species\u0026#34;} tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt agcttgaaacccaaaggacctgacggtgccttagaccccc \u0026gt;JN897380 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Nihonotrypaea thermophila mitochondrion, complete genome.\u0026#34;,\u0026#34;species_name\u0026#34;:\u0026#34;Nihonotrypaea thermophila\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:2734678 [Neotrypaea thermophila]@species\u0026#34;} tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg cggtaatttagtccag Manipulating paired sequence files with OBITools4 # Sequencing machines, particularly Illumina machines, produce paired-read data sets. The two paired reads correspond to two sequencings of the same DNA molecule from either end. They are commonly referred to as \u0026lsquo;forward reads\u0026rsquo; and \u0026lsquo;reverse reads\u0026rsquo;.\nToday, these paired reads are provided to the biologist in the form of two fastq files. These files assume that the two reads corresponding to the sequencing of the same DNA molecule are in the same position in the two files. If the data manipulations that delete or insert sequences in these files are not performed symmetrically, it is very likely that they will be out of phase, so that the two sequences will no longer be in the same position.\nšŸ“„ forward.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 @M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ reverse.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 In the two files above, the first sequence of the forward.fastq file with the ID M01334:147:000000000-LBRVD:1:1101:14968:1570 is paired with the first sequence of the reverse.fastq file with the same ID M01334:147:000000000-LBRVD:1:1101:14968:1570, not because they have the same identifier, but because they are both the first sequence of their respective files.\nSome of the OBITools4 commands, such as obiconvert , obigrep or obiannotate offer a --paired-with option. This option takes a filename as a parameter. It tells the OBITools4 command that the file given as an argument is paired with the file being processed. Therefore, the OBITools4 commands will process both the forward and reverse files in parallel.\nAs the --paired-with option allows the OBITools4 command to process two files, it also produces two result files. As a result, standard output cannot be used to return the results. Therefore, when using the --paired-with option, the --out option must be used. The --out option takes a filename as a parameter and tells the OBITools4 command to write the result to the specified file. As a single filename is given, the OBITools4 command modifies this filename by adding a suffix _R1 or _R2 to create two filenames.\nobiconvert --paired-with reverse.fastq \\ --out result.fasta \\ --fasta-output \\ forward.fastq This command processes the forward.fastq and the reverse.fastq as two paired files. It then converts them into two fasta files named result_R1.fasta and result_R2.fasta for the forward and reverse reads respectively.\nls -l *.fast? -rw-r--r--@ 1 myself staff 1504 8 mar 15:09 forward.fastq -rw-r-----@ 1 myself staff 964 8 mar 17:36 result_R1.fasta -rw-r-----@ 1 myself staff 964 8 mar 17:36 result_R2.fasta -rw-r--r--@ 1 myself staff 1504 8 mar 15:09 reverse.fastq The ls command is used here to see the results of the above obiconvert command, with the two resulting files and their names built by adding the suffixes _R1 or _R2 at the end of the filename just before the extension.\n"},{"id":23,"href":"/obidoc/obitools/obiconvert/","title":"obiconvert","section":"Basics","content":" obiconvert: convert a sequence file # Description # Convert a sequence file to fasta , fastq , or JSON format.\nSynopsis # obiconvert [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obiconvert specific options # --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. Check taxids against a taxonomy # OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur: The taxon is valid The taxon is no more valid, but a new one replaces it The taxon is no more valid, and no new taxid exists to replace it. In the first case, the obitools normalize the writing of the taxid in the form: TAXCOD:TAXID [SCIENTIFIC NAME]@RANK As example with the NCBI taxonomy the human taxid looks like : taxon:9606 [Homo sapiens]@species That rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved. 9606 In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data. --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. --raw-taxid: Displays the raw taxid for each displayed taxon. (default: false) --update-taxid: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false). --fail-on-taxonomy: Make obitools failing on error if a used taxid is not a currently valid one (default: false). Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obiconvert --help "},{"id":24,"href":"/obidoc/obitools/obipcr/","title":"obipcr","section":"Sequence alignments","content":" obipcr: the electronic PCR tool # Description # The obipcr program is the successor of ecoPCR. It is known as an in silico PCR software.\nSynopsis # obipcr --forward \u0026lt;string\u0026gt; --max-length|-L \u0026lt;int\u0026gt; --reverse \u0026lt;string\u0026gt; [--allowed-mismatches|-e \u0026lt;int\u0026gt;] [--batch-size \u0026lt;int\u0026gt;] [--circular|-c] [--compress|-Z] [--debug] [--delta|-D \u0026lt;int\u0026gt;] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--fragmented] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--min-length|-l \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--only-complete-flanking] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--skip-empty] [--solexa] [--version] [\u0026lt;args\u0026gt;] Options # obipcr mandatory options # --forward \u0026lt;PATTERN\u003e: The forward primer used for the electronic PCR. IUPAC codes can be used in the pattern. --reverse \u0026lt;PATTERN\u003e: The reverse primer used for the electronic PCR. IUPAC codes can be used in the pattern. --max-length | -L \u0026lt;INTEGER\u003e: Maximum length of the barcode, primers excluded. Other obipcr specific options # --allowed-mismatches | -e \u0026lt;INTEGER\u003e: Maximum number of mismatches allowed for each primer (default: 0). --min-length | -l \u0026lt;INTEGER\u003e: Minimum length of the barcode primers excluded (default: no minimum length). --circular | -c : Considers that sequences are circular. (default: sequences are considered linear) --delta | -D \u0026lt;INTEGER\u003e: Without this option, only the barcode sequences will be output, without the priming sites. This option allows to add the priming sites and the flanking sequences of the priming sites over a length of delta to each side of the barcode. --only-complete-flanking: Works in conjunction with \u0026ndash;delta. Prints only sequences with full-length flanking sequences (default: prints every sequence regardless of whether the flanking sequences are present). Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # The minimal obipcr command looks like this:\nobipcr -L 220 \\ --forward GGGCAATCCTGAGCCAA \\ --reverse CCATTGAGTCTCTGCACCTATC \\ /data/Genbank/Release_261 \\ \u0026gt; Sper01_obipcr.fasta It retrieves the sequences from the NCBI Genbank Release 261 database located in the /data/Genbank/Release_261 directory. The output is saved in the file Sper01_obipcr.fasta. The primer pair is specified as GGGCAATCCTGAGCCAA and CCATTGAGTCTCTGCACCTATC using the -forward and --reverse options. These primers correspond to the Sper01 marker. The -L option specifies the maximum length of the barcode excluding the primers, here 220 nucleotides. By default, no mismatches are allowed between the primers and the priming sites.\nTo allow mismatches between the primers and the priming sites, use the --allowed-mismatches option or its short form -e. Here, the maximum number of mismatches allowed is 3. This maximum number of mismatches is allowed per primer. The mismatch can occur anywhere in the primer.\nobipcr -e 3 \\ -L 220 \\ --forward GGGCAATCCTGAGCCAA \\ --reverse CCATTGAGTCTCTGCACCTATC \\ /data/Genbank/Release_261 \\ \u0026gt; Sper01_obipcr.fasta To disallow mismatches at specific positions, add a sharp # after the blocked position. For example, GGGCAATCCTGAGCCAA# disallows mismatches at the last position of the forward primer. Since # is also used to introduce comments in a bash script, the primer containing the # sign must be enclosed in single or double quotes.\nobipcr -e 3 \\ -L 220 \\ --forward \u0026#34;GGGCAATCCTGAGCCAA#\u0026#34; \\ --reverse \u0026#34;CCATTGAGTCTCTGCACCTATC\u0026#34; \\ /data/Genbank/Release_261 \\ \u0026gt; Sper01_obipcr.fasta "},{"id":25,"href":"/obidoc/formats/fastq/","title":"FASTQ file format","section":"Sequence file formats","content":" The FASTQ sequence file format # The FASTQ sequence file format is widely used for storing biological sequences and their corresponding quality scores. It was originally developed at the Wellcome Trust Sanger Institute to bundle a fasta sequence together with its quality data ( Citation: Cock,\u0026#32;Fields \u0026amp; al.,\u0026#32;2010 Cock,\u0026#32; P.,\u0026#32; Fields,\u0026#32; C.,\u0026#32; Goto,\u0026#32; N.,\u0026#32; Heuer,\u0026#32; M.\u0026#32;\u0026amp;\u0026#32;Rice,\u0026#32; P. \u0026#32; (2010). \u0026#32;The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research,\u0026#32;38(6).\u0026#32;1767–1771. https://doi.org/10.1093/nar/gkp1137 ) . The format has become the de facto standard for storing the output of high-throughput sequencing instruments.\nIn FASTQ format, each sequence entry consists of four lines:\nA sequence identifier line beginning with an @ character The raw sequence letters using the iupac code A separator line beginning with a + character (optionally followed by the same sequence identifier) The quality scores encoded in ASCII format @my_sequence this is my pretty sequence ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT + CCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ The first word after the \u0026lsquo;@\u0026rsquo; symbol in the identifier line is the sequence identifier. The rest of the line is a description of the sequence.\nThe qualities line gives information about the quality scores assigned to each base by the sequencing machine during the sequencing process. It indicates the probability that the base read is incorrectly sequenced.\n\\[ P(error) = 10^{-\\frac{Q}{10}} \\] Sequencers typically provide quality scores in the range of \\(0\\) to \\(40\\) , which corresponds to a probability of error \\(P(Error)\\) in the range of \\(10^{0} = 1\\) to \\(10^{-4}\\) . The higher the score, the lower the probability of error.\nQuality scores and chance of sequencing errorFigure showing the relationship between FASTQ quality scores and error probability\nIn FASTQ format, the sequence of quality score is encoded as an ASCII string where each score is mapped to an ASCII character. The quality score \\(0\\) is encoded as the character !. The quality score \\(40\\) is encoded as the character I (uppercase i).\n\\[ASCII\\,CODE = Q + 33 \\] The OBITools extend this format by adding structured data to the identifier line. In the previous version of the OBITools, the structured data was stored after the sequence identifier in a key=value; format, as shown below. The sequence definition was stored as free text after the last key=value; pair.\nšŸ“„ two_sequences_obi2.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 ali_length=62; mode=alignment; pairing_mismatches={\u0026#39;(T:26)-\u0026gt;(G:13)\u0026#39;:62,\u0026#39;(T:34)-\u0026gt;(G:18)\u0026#39;:48}; score=484; seq_b_single=46; ali_dir=left; score_norm=0.968; seq_a_single=46; seq_ab_match=60; sequence definition here ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 mode=alignment; seq_a_single=46; seq_ab_match=52; score=283; score_norm=0.839; seq_b_single=46; ali_dir=left; ali_length=62; pairing_mismatches={\u0026#39;(A:02)-\u0026gt;(G:30)\u0026#39;:104,\u0026#39;(A:34)-\u0026gt;(G:14)\u0026#39;:64,\u0026#39;(C:02)-\u0026gt;(A:30)\u0026#39;:86,\u0026#39;(C:02)-\u0026gt;(T:20)\u0026#39;:108,\u0026#39;(C:27)-\u0026gt;(G:32)\u0026#39;:83,\u0026#39;(C:34)-\u0026gt;(G:18)\u0026#39;:57,\u0026#39;(T:02)-\u0026gt;(G:26)\u0026#39;:87,\u0026#39;(T:22)-\u0026gt;(G:14)\u0026#39;:66,\u0026#39;(T:29)-\u0026gt;(G:11)\u0026#39;:62,\u0026#39;(T:32)-\u0026gt;(G:30)\u0026#39;:48}; sequence definition here ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC With OBITools4 a new format has been introduced to store structured data in the identifier line. The key/value annotation pairs are now formatted as a JSON map object. The definition is stored as an additional key/value pair using the key `definition\u0026rsquo;.\nšŸ“„ two_sequences_obi4.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:26)-\u0026gt;(G:13)\u0026#34;:62,\u0026#34;(T:34)-\u0026gt;(G:18)\u0026#34;:48},\u0026#34;score\u0026#34;:484,\u0026#34;score_norm\u0026#34;:0.968,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:60,\u0026#34;seq_b_single\u0026#34;:46,\u0026#34;definition\u0026#34;:\u0026#34;sequence definition here\u0026#34;} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(A:02)-\u0026gt;(G:30)\u0026#34;:104,\u0026#34;(A:34)-\u0026gt;(G:14)\u0026#34;:64,\u0026#34;(C:02)-\u0026gt;(A:30)\u0026#34;:86,\u0026#34;(C:02)-\u0026gt;(T:20)\u0026#34;:108,\u0026#34;(C:27)-\u0026gt;(G:32)\u0026#34;:83,\u0026#34;(C:34)-\u0026gt;(G:18)\u0026#34;:57,\u0026#34;(T:02)-\u0026gt;(G:26)\u0026#34;:87,\u0026#34;(T:22)-\u0026gt;(G:14)\u0026#34;:66,\u0026#34;(T:29)-\u0026gt;(G:11)\u0026#34;:62,\u0026#34;(T:32)-\u0026gt;(G:30)\u0026#34;:48},\u0026#34;score\u0026#34;:283,\u0026#34;score_norm\u0026#34;:0.839,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:52,\u0026#34;seq_b_single\u0026#34;:46,\u0026#34;definition\u0026#34;:\u0026#34;sequence definition here\u0026#34;} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC The obiconvert command, like all other OBITools4 commands, has two options --output-json-header and --output-OBI-header to choose between the new JSON format and the old OBITools format. The --output-OBI-header option can be abbreviated to -O. By default, the new JSON OBITools4 format is used, so only the -O option is really useful if the old format is required for compatibility with another software.\nConverting from the new JSON format to the old OBITools format:\nobiconvert -O two_sequences_obi4.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 ali_length=62; mode=alignment; pairing_mismatches={\u0026#39;(T:26)-\u0026gt;(G:13)\u0026#39;:62,\u0026#39;(T:34)-\u0026gt;(G:18)\u0026#39;:48}; score=484; seq_b_single=46; ali_dir=left; score_norm=0.968; seq_a_single=46; seq_ab_match=60; sequence definition here ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 mode=alignment; seq_a_single=46; seq_ab_match=52; score=283; score_norm=0.839; seq_b_single=46; ali_dir=left; ali_length=62; pairing_mismatches={\u0026#39;(A:02)-\u0026gt;(G:30)\u0026#39;:104,\u0026#39;(A:34)-\u0026gt;(G:14)\u0026#39;:64,\u0026#39;(C:02)-\u0026gt;(A:30)\u0026#39;:86,\u0026#39;(C:02)-\u0026gt;(T:20)\u0026#39;:108,\u0026#39;(C:27)-\u0026gt;(G:32)\u0026#39;:83,\u0026#39;(C:34)-\u0026gt;(G:18)\u0026#39;:57,\u0026#39;(T:02)-\u0026gt;(G:26)\u0026#39;:87,\u0026#39;(T:22)-\u0026gt;(G:14)\u0026#39;:66,\u0026#39;(T:29)-\u0026gt;(G:11)\u0026#39;:62,\u0026#39;(T:32)-\u0026gt;(G:30)\u0026#39;:48}; sequence definition here ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC Converting from the old OBITools format to the new JSON format:\nobiconvert two_sequences_obi2.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:26)-\u0026gt;(G:13)\u0026#34;:62,\u0026#34;(T:34)-\u0026gt;(G:18)\u0026#34;:48},\u0026#34;score\u0026#34;:484,\u0026#34;score_norm\u0026#34;:0.968,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:60,\u0026#34;seq_b_single\u0026#34;:46,\u0026#34;definition\u0026#34;:\u0026#34;sequence definition here\u0026#34;} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(A:02)-\u0026gt;(G:30)\u0026#34;:104,\u0026#34;(A:34)-\u0026gt;(G:14)\u0026#34;:64,\u0026#34;(C:02)-\u0026gt;(A:30)\u0026#34;:86,\u0026#34;(C:02)-\u0026gt;(T:20)\u0026#34;:108,\u0026#34;(C:27)-\u0026gt;(G:32)\u0026#34;:83,\u0026#34;(C:34)-\u0026gt;(G:18)\u0026#34;:57,\u0026#34;(T:02)-\u0026gt;(G:26)\u0026#34;:87,\u0026#34;(T:22)-\u0026gt;(G:14)\u0026#34;:66,\u0026#34;(T:29)-\u0026gt;(G:11)\u0026#34;:62,\u0026#34;(T:32)-\u0026gt;(G:30)\u0026#34;:48},\u0026#34;score\u0026#34;:283,\u0026#34;score_norm\u0026#34;:0.839,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:52,\u0026#34;seq_b_single\u0026#34;:46,\u0026#34;definition\u0026#34;:\u0026#34;sequence definition here\u0026#34;} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC The actual format of the header is automatically detected when OBITools4 commands read a FASTQ file.\nReferences # Cock,\u0026#32; Fields,\u0026#32; Goto,\u0026#32; Heuer\u0026#32;\u0026amp;\u0026#32;Rice (2010) Cock,\u0026#32; P.,\u0026#32; Fields,\u0026#32; C.,\u0026#32; Goto,\u0026#32; N.,\u0026#32; Heuer,\u0026#32; M.\u0026#32;\u0026amp;\u0026#32;Rice,\u0026#32; P. \u0026#32; (2010). \u0026#32;The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research,\u0026#32;38(6).\u0026#32;1767–1771. https://doi.org/10.1093/nar/gkp1137 "},{"id":26,"href":"/obidoc/docs/programming/lua/obitools_classes/taxonomy/","title":"Taxonomy","section":"Obitools Classes","content":" The Taxonomy class # Constructor of Taxonomy # Taxonomy Methods # "},{"id":27,"href":"/obidoc/obitools/obicount/","title":"obicount","section":"Basics","content":" obicount: counting sequence records # Description # Count the sequence records in a sequence file. It returns three pieces of information. The first is the number of sequence records. Each sequence record is associated with a count attribute (equal to 1 if absent), this number corresponds to the number of times that sequence has been observed in the non-dereplicated data set. In the following example, the first sequence record has no count attribute and therefore counts for 1, when the second sequence record has a count attribute equal to 2.\n\u0026gt;AB061527 {\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct Thus, the second value returned is the sum of the count values for all sequences, 3 for the presented example file. The last value is the number of nucleotides stored in the file, the sum of the sequence lengths, without accounting for the count tag.\ngraph TD A@{ shape: doc, label: \"my_sequences.fastq\" } C[obicount] D@{ shape: doc, label: \"counts.csv\" } A --\u003e C:::obitools C --\u003e D classDef obitools fill:#99d57c Synopsis # obicount [--batch-size \u0026lt;int\u0026gt;] [--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--reads|-r] [--silent-warning] [--solexa] [--symbols|-s] [--u-to-t] [--variants|-v] [--version] [\u0026lt;args\u0026gt;] Options # obicount specific options # --variants | -v : when present, output the only the number of sequence records in the file. --reads | -r : when present, output only the sum of sequence counts in the file. --symbols | -s : when present, output only the number of nucleotides in the file. It is possible to combine two of the above options.\nControlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # By default, the obicount command will output the number of sequence records (variants), sum of counts (reads), and number of nucleotides (symbols) in the sequence file.\nobicount my_sequence_file.fasta INFO[0000] Number of workers set 16 INFO[0000] Found 1 files to process INFO[0000] xxx.fastq.gz mime type: text/fastq entities,n variants,43221 reads,43221 symbols,4391530 The output is in CSV format and can be transformed into Markdown for a prettier output using the csvtomd command.\nobicount my_sequence_file.fasta | csvtomd entities | n ----------|--------- variants | 43221 reads | 43221 symbols | 4391530 The conversion can also be done with the csvlook command from the csvkit package.\nobicount my_sequence_file.fasta | csvlook | entities | n | | -------- | --------- | | variants | 43 221 | | reads | 43 221 | | symbols | 4 391 530 | When using the --variants, --reads or --symbols option, the output only contains the number(s) corresponding to the specified option(s).\nobicount -v --reads my_sequence_file.fasta | csvlook | entities | n | | -------- | ------ | | variants | 43 221 | | reads | 43 221 | As for all the OBITools commands, a GZIP compressed input file can be used.\nobicount my_sequence_file.fasta.gz | csvlook | entities | n | | -------- | --------- | | variants | 43 221 | | reads | 43 221 | | symbols | 4 391 530 | "},{"id":28,"href":"/obidoc/obitools/obirefidx/","title":"obirefidx","section":"Sequence alignments","content":" obirefidx # Description # Synopsis # obirefidx [OPTIONS] [ARGS] Options # obirefidx specific options # --opt1 | -o \u0026lt;PARAM\u003e: Here the description of the option Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obirefidx --help "},{"id":29,"href":"/obidoc/formats/genbank/","title":"GenBank Flat File format","section":"Sequence file formats","content":" The GenBank Flat File format # The GenBank Flat File format is a widely used text-based format for storing nucleotide sequence data and their associated annotations. It is maintained by the National Center for Biotechnology Information (NCBI) and serves as a primary repository for sequence data in the United States.\nOverview # The GenBank format is designed to be both human-readable and machine-readable, making it suitable for manual inspection and automated processing. Each flat file contains a sequence record that includes metadata about the sequence, as well as the sequence itself. Each file can contain one or more records, with each record separated by a line containing only a // (slash-slash) string.\nšŸ“„ sample.gb\nLOCUS HQ324066 84 bp DNA linear PLN 18-NOV-2011 DEFINITION Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast. ACCESSION HQ324066 VERSION HQ324066.1 KEYWORDS . SOURCE chloroplast Trinia glauca ORGANISM Trinia glauca Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae; Pentapetalae; asterids; campanulids; Apiales; Apiaceae; Apioideae; apioid superclade; Selineae; Trinia. REFERENCE 1 (bases 1 to 84) AUTHORS Raye,G., Miquel,C., Coissac,E., Redjadj,C., Loison,A. and Taberlet,P. TITLE New insights on diet variability revealed by DNA barcoding and high-throughput pyrosequencing: chamois diet in autumn as a case study JOURNAL Ecol. Res. 26 (2), 265-276 (2011) REFERENCE 2 (bases 1 to 84) AUTHORS Raye,G. TITLE Direct Submission JOURNAL Submitted (25-SEP-2010) LECA, Universite Joseph Fourier, Bp 53, 2233 rue de la Piscine, Grenoble 38041, France FEATURES Location/Qualifiers source 1..84 /organism=\u0026#34;Trinia glauca\u0026#34; /organelle=\u0026#34;plastid:chloroplast\u0026#34; /mol_type=\u0026#34;genomic DNA\u0026#34; /db_xref=\u0026#34;taxon:1000432\u0026#34; /geo_loc_name=\u0026#34;France\u0026#34; gene \u0026lt;1..\u0026gt;84 /gene=\u0026#34;trnL\u0026#34; /note=\u0026#34;tRNA-Leu; tRNA-Leu(UAA)\u0026#34; intron \u0026lt;1..\u0026gt;84 /gene=\u0026#34;trnL\u0026#34; /note=\u0026#34;P6 loop\u0026#34; ORIGIN 1 gggcaatcct gagccaaatc ctattttaca aaaacaaaca aaggcccaga aggtgaaaaa 61 aggataggtg cagagactca atgg // Structure of the GenBank Flat File record # A GenBank flat file consists of several sections, each containing specific information about the sequence. The main sections include:\nHeader section # The header section contains essential metadata about the sequence. The following fields are commonly found in this section:\nLOCUS: A unique identifier for the sequence, including its length, type (e.g., DNA, RNA), and whether it is linear or circular. DEFINITION: A brief description of the sequence, summarizing its biological significance. ACCESSION: Accession number(s) associated with the sequence, which can be used to retrieve the record. VERSION: The version number of the sequence record, indicating updates or changes. KEYWORDS: Keywords associated with the sequence, making it easier to categorise and search. SOURCE: The organism from which the sequence is derived, including the scientific name. REFERENCE: Citations for the sequence, linking it to relevant literature. LOCUS HQ324066 84 bp DNA linear PLN 18-NOV-2011 DEFINITION Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast. ACCESSION HQ324066 VERSION HQ324066.1 KEYWORDS . SOURCE chloroplast Trinia glauca ORGANISM Trinia glauca Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae; Pentapetalae; asterids; campanulids; Apiales; Apiaceae; Apioideae; apioid superclade; Selineae; Trinia. REFERENCE 1 (bases 1 to 84) AUTHORS Raye,G., Miquel,C., Coissac,E., Redjadj,C., Loison,A. and Taberlet,P. TITLE New insights on diet variability revealed by DNA barcoding and high-throughput pyrosequencing: chamois diet in autumn as a case study JOURNAL Ecol. Res. 26 (2), 265-276 (2011) REFERENCE 2 (bases 1 to 84) AUTHORS Raye,G. TITLE Direct Submission JOURNAL Submitted (25-SEP-2010) LECA, Universite Joseph Fourier, Bp 53, 2233 rue de la Piscine, Grenoble 38041, France Feature table section # The feature table section contains information about the annotations or features of the sequence, such as genes, transcripts, or regions. Each feature is represented by a set of fields splitted over multiple lines. The first line of each feature contains the feature type, such as \u0026ldquo;gene\u0026rdquo;, \u0026ldquo;transcript\u0026rdquo;, or \u0026ldquo;region\u0026rdquo; and its location in the sequence. The subsequent lines contain the feature-specific information, such as the gene name, gene function, cross-references to other databases, or its translation to protein for protein-coding genes.\nFEATURES Location/Qualifiers source 1..84 /organism=\u0026#34;Trinia glauca\u0026#34; /organelle=\u0026#34;plastid:chloroplast\u0026#34; /mol_type=\u0026#34;genomic DNA\u0026#34; /db_xref=\u0026#34;taxon:1000432\u0026#34; /geo_loc_name=\u0026#34;France\u0026#34; gene \u0026lt;1..\u0026gt;84 /gene=\u0026#34;trnL\u0026#34; /note=\u0026#34;tRNA-Leu; tRNA-Leu(UAA)\u0026#34; intron \u0026lt;1..\u0026gt;84 /gene=\u0026#34;trnL\u0026#34; /note=\u0026#34;P6 loop\u0026#34; Sequence section # The sequence section contains the sequence data itself, starting with a line containing only the keyword ORIGIN (in uppercase), followed by the sequence data. The sequence data is separated by spaces every 10 characters and each line contains 60 nucleotides. The number on the left of each sequence lines indicates the start position of the line in the sequence.\nORIGIN 1 gggcaatcct gagccaaatc ctattttaca aaaacaaaca aaggcccaga aggtgaaaaa 61 aggataggtg cagagactca atgg Terminator # The record concludes with a // line, indicating the end of the record. This terminator is crucial for distinguishing between multiple records in a single file.\n// Converting GenBank Flat File to FASTA format # To convert a GenBank flat file to fasta format, you can use the obiconvert command. The obiconvert command extracts the taxid and scientific name associated with each GenBank record and stores them in the taxid and scientific_name tags in the FASTA header.\nobiconvert sample.gb \u0026gt;HQ324066 {\u0026#34;definition\u0026#34;:\u0026#34;Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast.\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;chloroplast Trinia glauca\u0026#34;,\u0026#34;taxid\u0026#34;:1000432} gggcaatcctgagccaaatcctattttacaaaaacaaacaaaggcccagaaggtgaaaaa aggataggtgcagagactcaatgg Note The DDBJ database uses a format very similar to GenBank, so obiconvert recognizes it as a GenBank file and correctly converts it to FASTA.\nReferences # For more detailed specifications and guidelines regarding the GenBank Flat File format, refer to the following resource:\nGenBank Flat File format "},{"id":30,"href":"/obidoc/docs/programming/lua/obitools_classes/taxon/","title":"Taxon","section":"Obitools Classes","content":" The Taxon class # Constructor of Taxon # Taxon Methods # "},{"id":31,"href":"/obidoc/obitools/obicsv/","title":"obicsv","section":"Basics","content":" obicsv # Description # Convert a sequence file to a CSV file.\nSynopsis # obicsv [--auto] [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--count] [--csv] [--debug] [--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header] [--input-json-header] [--keep|-k \u0026lt;KEY\u0026gt;]... [--max-cpu \u0026lt;int\u0026gt;] [--na-value \u0026lt;NAVALUE\u0026gt;] [--no-order] [--no-progressbar] [--obipairing] [--out|-o \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--quality|-q] [--raw-taxid] [--sequence|-s] [--silent-warning] [--solexa] [--taxon] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obicsv specific options # --auto: Based on the first sequences, propose a list of attibutes to print (default: false). --count: Prints the count attribute in the output (default: false). --definition | -d : Prints sequence definition in the output. --ids | -i : Prints sequence ids in the ouput. --keep | -k \u0026lt;KEY\u003e: Keeps only attribute with key KEY. Several -k options can be combined. --na-value \u0026lt;NAVALUE\u003e: A string representing non available values in the CSV file. (default: \u0026ldquo;NA\u0026rdquo;) --obipairing: prints the attributes added by obipairing. --quality | -q : Prints sequence quality in the output. --sequence | -s : Prints sequence itself in the output. --taxon: Prints the NCBI taxid and its related scientific name. Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obicsv --help "},{"id":32,"href":"/obidoc/obitools/obitag/","title":"obitag","section":"Sequence alignments","content":" obitag: annotate sequences with taxonomic information # Description # A Least Common Ancestor-based algorithm for taxonomic sequence annotation.\nSynopsis # obitag --reference-db|-R \u0026lt;FILENAME\u0026gt; [--batch-size \u0026lt;int\u0026gt;] [--compressed|-Z] [--debug] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--geometric|-G] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--save-db \u0026lt;FILENAME\u0026gt;] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--version] [\u0026lt;args\u0026gt;] Options # obitag specific options # --reference-db | -R \u0026lt;FILENAME\u003e: The name of the file containing the reference database. --geometric | -G : Activate the experimental geometric similarity heuristic (default: false) --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. --save-db \u0026lt;FILENAME\u003e: The name of a file where to save the reference DB with its indices (default: \u0026ldquo;\u0026rdquo;) Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obitag --help "},{"id":33,"href":"/obidoc/formats/embl/","title":"EMBL Flat File format","section":"Sequence file formats","content":" The EMBL-ENA Flat File format # The EMBL (European Molecular Biology Laboratory) Flat File format is a widely used text-based format for storing nucleotide and protein sequence data. It is primarily utilized in the European Nucleotide Archive (ENA) database for the exchange of biological sequence information including their associated metadata.\nOverview # The EMBL format is designed to be human-readable and machine-readable, making it suitable for both manual inspection and automated processing. Each flat file contains a sequence record that includes metadata about the sequence, as well as the sequence itself. Each file is composed of a single record or a series of records, separated by a line containing only a // (slash-slash) string.\nšŸ“„ sample.embl\nID HQ324066; SV 1; linear; genomic DNA; STD; PLN; 84 BP. XX AC HQ324066; XX DT 30-MAR-2011 (Rel. 108, Created) DT 20-NOV-2011 (Rel. 110, Last updated, Version 2) XX DE Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast. XX KW . XX OS Trinia glauca OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae; Pentapetalae; OC asterids; campanulids; Apiales; Apiaceae; Apioideae; apioid superclade; OC Selineae; Trinia. OG Plastid:Chloroplast XX RN [1] RP 1-84 RA Raye G., Miquel C., Coissac E., Redjadj C., Loison A., Taberlet P.; RT \u0026#34;New insights on diet variability revealed by DNA barcoding and RT high-throughput pyrosequencing: chamois diet in autumn as a case study\u0026#34;; RL Ecol. Res. 26(2):265-276(2011). XX RN [2] RP 1-84 RA Raye G.; RT ; RL Submitted (25-SEP-2010) to the INSDC. RL LECA, Universite Joseph Fourier, Bp 53, 2233 rue de la Piscine, Grenoble RL 38041, France XX DR MD5; f2c1ac0a050529656590007b565d327d. XX FH Key Location/Qualifiers FH FT source 1..84 FT /organism=\u0026#34;Trinia glauca\u0026#34; FT /organelle=\u0026#34;plastid:chloroplast\u0026#34; FT /mol_type=\u0026#34;genomic DNA\u0026#34; FT /country=\u0026#34;France\u0026#34; FT /db_xref=\u0026#34;taxon:1000432\u0026#34; FT gene \u0026lt;1..\u0026gt;84 FT /gene=\u0026#34;trnL\u0026#34; FT /note=\u0026#34;tRNA-Leu; tRNA-Leu(UAA)\u0026#34; FT intron \u0026lt;1..\u0026gt;84 FT /gene=\u0026#34;trnL\u0026#34; FT /note=\u0026#34;P6 loop\u0026#34; XX SQ Sequence 84 BP; 35 A; 16 C; 20 G; 13 T; 0 other; gggcaatcct gagccaaatc ctattttaca aaaacaaaca aaggcccaga aggtgaaaaa 60 aggataggtg cagagactca atgg 84 // Structure of the EMBL Flat File record # An EMBL flat file consists of several sections, each containing specific information about the sequence. The main sections include:\nHeader section # The header section contains essential metadata about the sequence. The following fields are commonly found in this section:\nID: A unique identifier for the sequence. This field is mandatory and typically includes the sequence type (e.g., mRNA, DNA, protein) and its length. AC: Accession number(s) associated with the sequence. This field can contain multiple accession numbers separated by semicolons. DE: Description of the sequence, providing a brief overview of its biological significance or function. KW: Keywords associated with the sequence, which can help in categorizing and searching for the sequence in databases. OS: Organism name, indicating the species from which the sequence is derived. OC: Organism classification, providing a taxonomic hierarchy (e.g., domain, kingdom, phylum). RN: Reference number for citations, linking the sequence to relevant literature. RP: Reference position, indicating the specific positions in the sequence that are referenced. RA: Authors of the reference, listing the individuals who contributed to the cited work. RL: Reference line, providing the complete citation for the reference. ID HQ324066; SV 1; linear; genomic DNA; STD; PLN; 84 BP. XX AC HQ324066; XX DT 30-MAR-2011 (Rel. 108, Created) DT 20-NOV-2011 (Rel. 110, Last updated, Version 2) XX DE Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast. XX KW . XX OS Trinia glauca OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae; Pentapetalae; OC asterids; campanulids; Apiales; Apiaceae; Apioideae; apioid superclade; OC Selineae; Trinia. OG Plastid:Chloroplast XX RN [1] RP 1-84 RA Raye G., Miquel C., Coissac E., Redjadj C., Loison A., Taberlet P.; RT \u0026#34;New insights on diet variability revealed by DNA barcoding and RT high-throughput pyrosequencing: chamois diet in autumn as a case study\u0026#34;; RL Ecol. Res. 26(2):265-276(2011). XX RN [2] RP 1-84 RA Raye G.; RT ; RL Submitted (25-SEP-2010) to the INSDC. RL LECA, Universite Joseph Fourier, Bp 53, 2233 rue de la Piscine, Grenoble RL 38041, France XX DR MD5; f2c1ac0a050529656590007b565d327d. XX Feature table section # The feature table section contains information about the annotations or features of the sequence, such as genes, transcripts, or regions. Each feature is represented by a set of fields splitted over multiple lines. The first line of each feature contains the feature type, such as \u0026ldquo;gene\u0026rdquo;, \u0026ldquo;transcript\u0026rdquo;, or \u0026ldquo;region\u0026rdquo; and its location in the sequence. The subsequent lines contain the feature-specific information, such as the gene name, gene function, cross-references to other databases, or its translation to protein for protein-coding genes.\nFH Key Location/Qualifiers FH FT source 1..84 FT /organism=\u0026#34;Trinia glauca\u0026#34; FT /organelle=\u0026#34;plastid:chloroplast\u0026#34; FT /mol_type=\u0026#34;genomic DNA\u0026#34; FT /country=\u0026#34;France\u0026#34; FT /db_xref=\u0026#34;taxon:1000432\u0026#34; FT gene \u0026lt;1..\u0026gt;84 FT /gene=\u0026#34;trnL\u0026#34; FT /note=\u0026#34;tRNA-Leu; tRNA-Leu(UAA)\u0026#34; FT intron \u0026lt;1..\u0026gt;84 FT /gene=\u0026#34;trnL\u0026#34; FT /note=\u0026#34;P6 loop\u0026#34; XX Sequence section # The sequence section contains the actual nucleotide sequence. This section is formatted to enhance readability and is represented in lines of 60 characters. The sequence letters represent nucleotides (A, T, C, G for DNA; A, U, C, G for RNA).\nThe sequence section begins with SQ in first line and includes metadata about the sequence, such as its length and the count of each nucleotide. The sequence itself follows, formatted in lines of 60 characters for clarity. The number on the right of each of the sequence lines indicates the end position of the line in the sequence.\nSQ Sequence 84 BP; 35 A; 16 C; 20 G; 13 T; 0 other; gggcaatcct gagccaaatc ctattttaca aaaacaaaca aaggcccaga aggtgaaaaa 60 aggataggtg cagagactca atgg 84 Terminator # The record concludes with a // line, indicating the end of the record. This terminator is crucial for distinguishing between multiple records in a single file.\n// Converting EMBL Flat File to FASTA format # To convert a EMBL flat file to fasta format, you can use the obiconvert command. The obiconvert command extracts from the source feature present in each EMBL record the taxid and scientific name associated with the record to store them in the taxid and scientific_name tags within the FASTA header.\nobiconvert sample.embl \u0026gt;HQ324066 {\u0026#34;definition\u0026#34;:\u0026#34;Trinia glauca tRNA-Leu (trnL) gene, intron; chloroplast.\u0026#34;,\u0026#34;scientific_name\u0026#34;:\u0026#34;Trinia glauca\u0026#34;,\u0026#34;taxid\u0026#34;:1000432} gggcaatcctgagccaaatcctattttacaaaaacaaacaaaggcccagaaggtgaaaaa aggataggtgcagagactcaatgg References # For more detailed specifications and guidelines regarding the EMBL Flat File Format, refer to the following resources:\nEMBL (European Molecular Biology Laboratory) Flat File Format EMBL-EBI Sequence Databases "},{"id":34,"href":"/obidoc/obitools/obidemerge/","title":"obidemerge","section":"Basics","content":" obidemerge # Description # Synopsis # obidemerge [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--csv] [--debug] [--demerge|-d \u0026lt;string\u0026gt;] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obidemerge specific options # --demerge | -d \u0026lt;string\u003e: Indicates which slot has to be demerged. (default: \u0026ldquo;sample\u0026rdquo;) --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obidemerge --help "},{"id":35,"href":"/obidoc/obitools/obidistribute/","title":"obidistribute","section":"Basics","content":" obidistribute: split a sequence file into multiple files # Description # Distribute a sequence file accross multiple output files.\nSynopsis # obidistribute --pattern|-p \u0026lt;string\u0026gt; [--append|-A] [--batch-size \u0026lt;int\u0026gt;] [--batches|-n \u0026lt;int\u0026gt;] [--classifier|-c \u0026lt;string\u0026gt;] [--compress|-Z] [--csv] [--debug] [--directory|-d \u0026lt;string\u0026gt;] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--hash|-H \u0026lt;int\u0026gt;] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--na-value \u0026lt;string\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--silent-warning] [--skip-empty] [--solexa] [--u-to-t] [--version] [\u0026lt;args\u0026gt;] Options # Required options # --pattern | -p \u0026lt;string\u003e: The template used to build the names of the output files. The variable part is represented by \u0026lsquo;%s\u0026rsquo;. Example : toto_%s.fastq. obidistribute specific options # --append | -A \u0026lt;string\u003e: Indicates to append sequence to files if they already exist. --batches | -n \u0026lt;int\u003e: Indicates in how many batches the input file must bee splitted. (default: 0) --classifier | -c \u0026lt;string\u003e: The name of a tag annotating the sequences. The name must corresponds to a string, a integer or a boolean value. That value will be used to dispatch sequences amoong the different files. --directory | -d \u0026lt;string\u003e: The name of a tag annotating the sequences. The name must corresponds to a string, a integer or a boolean value. That value will be used to dispatch sequences amoong the different directory in conjunction with the -c|\u0026ndash;classifier options. --hash | -H \u0026lt;int\u003e: Indicates to split the input into at most batch based on a hash code of the seequence. (default: 0) --na-value \u0026lt;string\u003e: Value used when the classifier tag is not defined for a sequence. (default: \u0026ldquo;NA\u0026rdquo;) Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obidistribute --help "},{"id":36,"href":"/obidoc/obitools/obigrep/","title":"obigrep","section":"Basics","content":" obigrep: filter a sequence file # Description # obigrep is a tool for selecting a subset of sequences based on a set of criteria. Sequences from the input dataset that match all the criteria are retained and printed in the result, while other sequences are discarded.\nSelection criteria can be based on different aspects of the sequence data, such as\nThe sequence identifier (ID) The sequence annotations The sequence itself Selection based on sequence identifier (ID) # There are two ways of selecting sequences according to their identifier:\nUsing a regular pattern with option -I Using a list of identifiers (IDs) provided in a file with option --id-list On the following five-sequences sample file:\nšŸ“„ five_ids.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta \u0026gt;seqA2 gtagctagctagctagctagctagctaga \u0026gt;seqC1 cgatgctgcatgctagtgctagtcgatga \u0026gt;seqB2 tagctagctagctagctagctagctagcta To select sequences with IDs \u0026ldquo;seqA1\u0026rdquo; and \u0026ldquo;seqB1\u0026rdquo;, you can use the command\nobigrep -I \u0026#39;^seq[AB]1$\u0026#39; five_ids.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta The explanations for the regular pattern ^seq[AB]1$ are\nthe ^ at the beginning means that the string must start with that pattern seq is an exact match for that string [AB] means any character in the set {A, B} 1 is an exact match for that character $ at the end of the pattern means that the string must end with that pattern. If the starting ^ had been omitted, the pattern would have matched any sequence ID containing \u0026ldquo;seq\u0026rdquo; followed by a character from the set {A, B} and ending with \u0026ldquo;1\u0026rdquo;, for example the IDs my_seqA1 or my_seqB1 would have been selected.\nIf the ending \u0026lsquo;$\u0026rsquo; had been omitted, the pattern would have matched any sequence ID starting with \u0026lsquo;seq\u0026rsquo; followed by a character in the set {A, B} and containing \u0026lsquo;1\u0026rsquo;, e.g. the ids seqA102 or seqB1023456789 would have been selected.\nAnother solution to extract these sequence IDs would be to use a text file containing them, one per line, as follows\nšŸ“„ seqAB.txt 1 2 seqA1 seqB1 This seqAB.txt can then be used as an index file by obigrep :\nobigrep --id-list seqAB.txt five_ids.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 tagctagctagctagctagctagctagcta Selection based on sequence definition # Each sequence record can have a sequence definition describing the sequence. In fasta or fastq format, this definition is found in the header of each sequence record after the second word (the first being the sequence id), or after the annotations between braces in the OBITools4 extended version of these formats.\nšŸ“„ three_def.fasta \u0026gt;seqA1 cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 my beautiful sequence tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:10} my pretty sequence gtagctagctagctagctagctagctaga In the three_def.fasta example file:\nseqA1 has no definition seqB1 definition is my beautiful sequence seqA2 definition is my pretty sequence The -D or --definition option lets you specify a regular pattern to select only those sequences whose definition matches the pattern. The example below selects sequences whose definition contains the word pretty.\nobigrep -D pretty three_def.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:10,\u0026#34;definition\u0026#34;:\u0026#34;my pretty sequence\u0026#34;} gtagctagctagctagctagctagctaga As you can see in the results, all the OBITools4 include the definition present in the original file as a new annotation tag called definition. So it is actually this tag that is tested by the -D option.\nSelection based on the annotations # Selection based on any annotation # The obigrep tool can also be used to select sequences based on their annotations. Annotation are constituted by all the tags and values added to each sequence header in the fasta / fastq file. For instance, if you have a sequence file with the following headers:\nšŸ“„ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Selecting sequences having a tag whatever its value # The -A option allows for selecting sequences having the given attribute whatever its value. In the following example, all the sequences having the count attribute are selected.\nobigrep -A \u0026#34;count\u0026#34; five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctgcatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} tagctagctagctagctagctagctagcta Only four sequences are retained, the sequence seqB1 is excluded because it does not have the tag count.\nSelecting sequences having a tag with a specific value # The -a option allows for selecting sequences having the given attribute affected to a value matching the provided regular pattern. In the following example, only the sequence seqA1 having the toto attribute containing the value titi is selected.\nobigrep -a toto=\u0026#34;titi\u0026#34; five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat As the value is a regular pattern, it is possible to be less strict, and for example, the following command will select all sequences with the toto attribute containing a value beginning (^ at the start of the expression) with t.\nobigrep -a toto=\u0026#34;^t\u0026#34; five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga The sequence seqC1 is excluded because its toto attribute contains the value foo, which does not begin with t, while seqB2 is excluded because it does not have a toto attribute.\nSelection based on the sequence abundances # In amplicon sequencing experiments, a sequence may be observed many times. The obiuniq command can be used to dereplicate strictly identical sequences. The number of strictly identical sequence reads merged into a single sequence record is stored in the count annotation tag of that sequence record. It is common to filter out sequences that are too rare or too abundant, depending on the purpose of the experiment. There are two ways to select sequence records based on this count tag.\nthe --min-count or -c options, followed by a numeric argument, select sequence records with a count greater than or equal to that argument. The --max-count or -C options, followed by a numeric argument, select sequence records with a count less than or equal to that argument. Note If the count tag is missing from a data set, it is assumed to be equal to 1.\nobigrep -c 2 five_tags.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Remove singleton sequences (sequences observed only once), here the sequences seqA1 having a count tag equal to 1, and seqB1 having no count tag defined.\nThe next command excludes from its results all the sequences occurring at least ten times.\nobigrep -C 10 five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga As usual, both options can be combined\nobigrep -c 2 -C 10 five_tags.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga Selection based on taxonomic annotation. # Taxonomy-based selection is always performed on the taxid attribute of a sequence, even if it contains other taxonomic information stored in other attribute such as scientific_name or family_taxid. To use taxonomy-based selection with obigrep , it is mandatory to load a taxonomy using the -t or --taxonomy option.\nSelecting sequences belonging a clade # If you do not have a taxonomy dump already downloaded, you must first download one using the following obitaxonomy command. The taxonomy will be stored in a file named ncbitaxo.tgz. This compressed archive can be supplied to other OBITools4 at a later date.\nobitaxonomy --download-ncbi --out ncbitaxo.tgz To select the sequences belonging to the Homo sapiens species, the first step is to extract the taxid corresponding to the species of interest from the downloaded taxonomy using the obitaxonomy command.\nThe -t option indicates the taxonomy to load The --fixed option indicates to consider the query string as the exact name of the species, not as a regular pattern. The --rank species indicates that our interest is only on taxa having the species taxonomic rank. \u0026quot;Homo sapiens\u0026quot; is the query string used to match the taxonomy names. The csvlook command aims to present nicely the CSV output of obitaxonomy .\nobitaxonomy -t ncbitaxo.tgz --fixed --rank species \u0026#34;Homo sapiens\u0026#34; | csvlook -I | taxid | parent | taxonomic_rank | scientific_name | | --------------------------------- | ----------------------- | -------------- | --------------- | | taxon:9606 [Homo sapiens]@species | taxon:9605 [Homo]@genus | species | Homo sapiens | The obigrep option to select sequences belonging a taxon is -r or --restrict-to-taxon. The option requires as argument the taxid of the clade of interest, here 9606 for Homo sapiens.\nobigrep -t ncbitaxo.tgz -r taxon:9606 five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta Only sequences seqA1 and seqB1 annotated as belonging to the target clade Homo sapiens or one of its subspecies Homo sapiens neanderthalensis are retained. Sequence seqA2 is not retained as it is annotated at genus level as Homo and therefore does not belong to the Homo sapiens clade, nor is sequence seqC1 annotated at family level as Hominidae. The last sequence seqB2 has no taxonomic annotation and is therefore considered to be annotated at the root of the taxonomy and no part of the Homo sapiens species clade.\nExcluding sequences belonging a clade # The -i or --ignore-taxon in its long form, performs the reverse selection of the -r option presented above. It only retains sequences that do not belong to the taxid target clade passed as an argument.\nobigrep -t ncbitaxo.tgz -i taxon:9606 five_tags.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Here, only the sequence seqA2, seqC1 and seqB2 are retained as none of them belongs to the Homo sapiens species.\nKeep only sequence with taxonomic information at a given rank # A taxid, when associated with a taxonomy, not only provides information at its taxonomic rank, but also makes it possible to retrieve information at any higher rank. For example, from a species taxid, it is expected that by querying the taxonomy, it will be possible to retrieve the corresponding genus or family taxid. obigrep allows you to select sequences annotated by a taxid capable of providing information at a given taxonomic rank using the --require-rank option.\nTo retrieve all ranks defined by a taxonomy, it is possible to use the obitaxonomy command with the -l option.\nobitaxonomy -t ncbitaxo.tgz -l | csvlook | rank | | ---------------- | | domain | | phylum | | class | | suborder | | subcohort | | superphylum | | subspecies | | varietas | | subgenus | | parvorder | | acellular root | | genotype | | subtribe | | subkingdom | | subfamily | | kingdom | | isolate | | superorder | | section | | subvariety | | genus | | serogroup | | tribe | | forma | | infraclass | | superclass | | serotype | | no rank | | family | | species group | | subclass | | infraorder | | pathogroup | | realm | | order | | biotype | | species subgroup | | species | | strain | | clade | | cohort | | series | | cellular root | | morph | | subphylum | | forma specialis | | superfamily | | subsection | This allows us to check that the species rank is defined and to filter the five_tags.fasta test file to retain only sequences with information available at the species level.\nobigrep -t ncbitaxo.tgz --require-rank species five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta Only two sequences are selected by this command, because seqA1 is annotated at the species level, and seqB1 is annotated at the subspecies taxonomic rank, which allows for retrieving species level information.\nseqA2 and seqC1 are discarded as they are annotated at genus and family levels, respectively. seqB2 is discarded as it is not taxonomically annotated and is therefore considered to be annotated at the root of the taxonomy.\nKeep only sequences annotated with valid taxids # šŸ“„ six_invalid.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctgcatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqD1 {\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9607\u0026#34;} gctagctagctgacgatgcatgcgtaggtgcagttgcgta obigrep -t ncbitaxo.tgz --valid-taxid six_invalid.fasta WARN[0005] seqD1: Taxid: taxon:9607 is unknown from taxonomy (Taxid taxon:9607 is not part of the taxonomy NCBI Taxonomy) \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctgcatgctagtgctagtcgatga Selection based on the sequence # Selection based on the sequence length # Two options -l (--min-length) and -L (--max-length) allow to select sequences based on their length. A sequence is selected if its length is greater or equal to the --min-length and less or equal to the --max-length. If only one of these options is used, only the specified limit is applied.\nIn the five_tags.fasta, one sequence is 27 base pairs (bp) long, two are 29 bp and the two last 30 bp long.\nTo select only sequences with a minimum length of 29 bp, the following command can be executed\nobigrep -l 29 five_tags.fasta \u0026gt;seqB1 {\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:63221 [Homo sapiens neanderthalensis]@subspecies\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tata\u0026#34;} tagctagctagctagctagctagctagcta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga To select only sequences with a maximum length of 29 bp, the following command can be executed\nobigrep -L 29 five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga Interestingly, in both cases, both 29-bp sequences were selected.\nSelection based on the sequence # Sequence records can be selected on the sequence itself. There are two pattern matching algorithms available, depending on the options used:\n--sequence or -s : The pattern is a regular pattern used to match the sequence records. The pattern is not case-sensitive. --approx-pattern : This option uses the same algorithm as obipcr and obimultiplex to locate primers. The description of the pattern follows the same grammar. While regular pattern allows for more complex expression in describing the look-up sequence, the DNA Patterns have the advantage of offering discrepancy between the pattern and the actual sequence (mismatches and indels). To set the number and the type of allowed errors use the --pattern-error and the --allows-indels options.\nIn the next example, sequences containing the pattern tgc present twice at least in the sequence eventually separated by any number of bases (.*) are searched. This can be expressed as the regular pattern : tgc.*tgc\nobigrep -s \u0026#39;tgc.*tgc\u0026#39; five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga If we are interested in sequence matching this pattern gatgctgcat, but want to allow a certain number of errors, we can use the --approx-pattern option. Despite its name, this option does not allow any errors by default, so for simple patterns like the one we have here, both the --approx-pattern and the -s options are equivalent.\nobigrep --approx-pattern gatgctgcat \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat obigrep -s gatgctgcat \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat However, --approx-pattern can be parameterized using the --pattern-error option. The following example allows two errors (differences) between the pattern and the matched sequence. Without a further option, these errors can only be substitutions. Thus, the value defined by --pattern-error is the maximum Hamming distance between the pattern and the matched sequence.\nobigrep --approx-pattern gatgctgcat \\ --pattern-error 2 \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga By adding the --allows-indels option, obigrep will allow indels in the pattern. This means that it can match sequences where the differences between the pattern and the matched sequence are insertions or deletions. Insertion or deletion of a symbol is considered one error. Therefore, with --pattern-error 2 and --allows-indels you can allow two mismatches, two insertions or deletions, or one mismatch and one indel. In this case, the `\u0026ndash;pattern-error\u0026rsquo; defines the maximum Levenshtein distance allowed between the pattern and the matched sequence.\nobigrep --approx-pattern gatgctgcat \\ --pattern-error 2 \\ --allows-indels \\ five_tags.fasta \u0026gt;seqA1 {\u0026#34;count\u0026#34;:1,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9606 [Homo sapiens]@species\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;titi\u0026#34;} cgatgctgcatgctagtgctagtcgat \u0026gt;seqC1 {\u0026#34;count\u0026#34;:15,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9604 [Hominidae]@family\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;foo\u0026#34;} cgatgctccatgctagtgctagtcgatga \u0026gt;seqB2 {\u0026#34;count\u0026#34;:25,\u0026#34;tata\u0026#34;:\u0026#34;bar\u0026#34;} cgatggctccatgctagtgctagtcgatga Defining you own predicate # You can create your own predicate to filter your dataset. A predicate is an expression that returns a logical value of true or false when evaluated. It is defined using the --predicate (-p) option and the OBITools4 expression language. The predicate is evaluated on each sequence in the dataset. Sequences that result in a true value are retained in the result, while those that result in a false value are discarded.\nThe following command, for example, filters out all sequences with a count annotation of less than 2 and greater than 10.\nobigrep -c 2 -C 10 five_tags.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga The following predicate can be used to substitute for it:\nobigrep -p \u0026#39;sequence.Count() \u0026gt;= 2 \u0026amp;\u0026amp; sequence.Count() \u0026lt;= 10\u0026#39; five_tags.fasta \u0026gt;seqA2 {\u0026#34;count\u0026#34;:5,\u0026#34;tata\u0026#34;:\u0026#34;foo\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;taxon:9605 [Homo]@genus\u0026#34;,\u0026#34;toto\u0026#34;:\u0026#34;tutu\u0026#34;} gtagctagctagctagctagctagctaga The OBITools4 expression language provides min and max functions. These functions extract the minimum and maximum values from a map or vector, respectively.\nIn the file some_uniq_seq.fasta, the \u0026lsquo;merged_sample` tag on each sequence indicates how the corresponding reads are distributed among samples.\nšŸ“„ some_uniq_seq.fasta \u0026gt;Seq_1 {\u0026#34;count\u0026#34;:2,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:1,\u0026#34;29a_F260619\u0026#34;:1}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat agctyaaaactcaaaggacttggcggtgctttataccctt \u0026gt;Seq_2 {\u0026#34;count\u0026#34;:22,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:10}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat atcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;Seq_3 {\u0026#34;count\u0026#34;:22,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:15,\u0026#34;29a_F260619\u0026#34;:7}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat agcttaaaactcaaaggacttggcggtgctttataccctt It is possible to extract the contingency table from this file using the obimatrix command. The --transpose option transposes the matrix so that sequences are in rows and samples are in columns.\nobimatrix --transpose some_uniq_seq.fasta \\ | csvtomd id | 15a_F730814 | 29a_F260619 -------|---------------|------------- Seq_1 | 1 | 1 Seq_2 | 12 | 10 Seq_3 | 15 | 7 To select sequences that occur at least ten times in a sample, you have to determine the maximum value of the merged_sample tag and compare it to the value ten.\nThis can be done using a predicate expression:\nobigrep -p \u0026#39;max(annotations.merged_sample) \u0026gt;= 10\u0026#39; some_uniq_seq.fasta \u0026gt;Seq_2 {\u0026#34;count\u0026#34;:22,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:12,\u0026#34;29a_F260619\u0026#34;:10}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat atcttaaaactcaaaggacttggcggtgctttataccctt \u0026gt;Seq_3 {\u0026#34;count\u0026#34;:22,\u0026#34;merged_sample\u0026#34;:{\u0026#34;15a_F730814\u0026#34;:15,\u0026#34;29a_F260619\u0026#34;:7}} ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat agcttaaaactcaaaggacttggcggtgctttataccctt As you can see from the results, seq_1 is discarded because it does not appear in any of the samples. It does not occur more than ten times. The maximum number of occurrences of seq_1 is 1.\nWorking with paired sequence files: # OBITools4 can handle paired sequence files. This means that it processes the paired sequences in the two files together. In particular, for obigrep , it will apply the same filtering to both files. This ensures that each sequence in the result files is paired with its correct counterpart.\nThe most important option for manipulating paired sequence files is the --paired-with option. This allows you to specify the name of a file containing sequences to be paired with those in the main sequence file. Since an obitools4 command that processes paired sequences produces two paired result files, the standard output cannot be used to store the results. Instead, you must use the --out option to specify where the results should be written.\nConsidering the two paired input files:\nšŸ“„ forward.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 @M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ reverse.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 To conserve only sequences starting with a t, use the following command:\nobigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --out start_t.fastq \\ forward.fastq After running the obigrep command, if you check the directory contents, you will obtain two new files named start_t_R1.fastq and start_t_R2.fastq, in addition to the two input files, forward.fastq and reverse.fastq. These file names are created by adding the suffixes _R1 and _R2 to the start_t.fastq file name specified in the --out option. The start_t_R1.fastq file (suffix _R1) contains the reads from the main file ( forward.fastq), while start_t_R2.fastq (suffix _R2) contains the reads from the file specified by the \u0026lsquo;\u0026ndash;paired-with\u0026rsquo; option ( reverse.fastq).\n% ls -l total 135568 -rw-r--r--@ 1 coissac staff 1504 13 mai 18:09 forward.fastq -rw-r--r--@ 1 coissac staff 1504 13 mai 18:09 reverse.fastq -rw-r-----@ 1 coissac staff 1179 13 mai 18:14 start_t_R1.fastq -rw-r-----@ 1 coissac staff 1179 13 mai 18:14 start_t_R2.fastq Inspecting the file start_t_R1.fastq makes the effect of obigrep clear. Every sequence starts with t.\nšŸ“„ start_t_R1.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 However, when we look at the file start_t_R2.fastq, the second sequence starts with a c. In fact, the obigrep constraint was only applied to the forward.fastq file. The sequences were selected from the reverse.fastq file because they are paired with one of the sequences selected from the forward.fastq file.\nšŸ“„ start_t_R2.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 The --paired-mode option can be used to specify how the obigrep filtering constraints are applied to both files. The option requires an argument that can take four different values:\nforward: the selection rules apply only to the forward reads; the reverse reads are selected because they are paired with a selected forward read. This is the default behaviour presented above. reverse: the selection rules apply only to the reverse reads; the forward reads are selected because they are paired with a selected reverse read. obigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --paired-mode reverse \\ --out start_t_rev.fastq \\ forward.fastq šŸ“„ start_t_rev_R1.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ start_t_rev_R2.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 and: the selection rules must be true for both reads of the pair obigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --paired-mode and \\ --out start_t_and.fastq \\ forward.fastq šŸ“„ start_t_and_R1.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 šŸ“„ start_t_and_R2.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 or: the selection rules must be true for at least one read of the pair. The second read is selected because its counterpart has been selected by the obigrep rules. obigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --paired-mode or \\ --out start_t_or.fastq \\ forward.fastq šŸ“„ start_t_or_R1.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca + 11\u0026gt;A\u0026gt;@3@A11\u0026gt;ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0\u0026gt;BF11B210B\u0026gt;//11B1\u0026lt;1BB\u0026lt;///\u0026lt;1122 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca + 11\u0026gt;A\u0026gt;@3B\u0026gt;\u0026gt;1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1\u0026gt;F1B1FD/////00\u0026lt;1 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ start_t_or_R2.fastq @M01334:147:000000000-LBRVD:1:1101:14968:1570 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct + 11\u0026gt;\u0026gt;1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE\u0026gt;\u0026gt;FG1D1/\u0026gt;/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11 @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:15399:1590 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct + 11\u0026gt;\u0026gt;1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 andnot: the selection rules must be true on the forward sequence but not on the reverse one. obigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --paired-mode andnot \\ --out start_t_andnot.fastq \\ forward.fastq šŸ“„ start_t_andnot_R1.fastq @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 šŸ“„ start_t_andnot_R2.fastq @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 xor: the selection rules must be true on only one read of the pair, not on both. obigrep -s \u0026#39;^t\u0026#39; \\ --paired-with reverse.fastq \\ --paired-mode xor \\ --out start_t_xor.fastq \\ forward.fastq šŸ“„ start_t_xor_R1.fastq @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc + 1\u0026gt;\u0026gt;A111\u0026gt;\u0026gt;\u0026gt;AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////\u0026lt;000 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;1:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca + 3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC šŸ“„ start_t_xor_R2.fastq @M01334:147:000000000-LBRVD:1:1101:15946:1586 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc + 111\u0026gt;\u0026gt;111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111 @M01334:147:000000000-LBRVD:1:1101:13773:1687 {\u0026#34;definition\u0026#34;:\u0026#34;2:N:0:CTCACCAA+CTAGGCAA\u0026#34;} tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa + 111\u0026gt;13@1111\u0026gt;11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0\u0026gt;FC@1B\u0026gt;1B11FEFEC\u0026gt;E\u0026gt;///?\u0026lt;0110/?/FF\u0026lt;G22111@00@\u0026lt;GHHB\u0026gt;FHHH1///1 Synopsis # obigrep [--allows-indels] [--approx-pattern \u0026lt;PATTERN\u0026gt;]... [--attribute|-a \u0026lt;KEY=VALUE\u0026gt;]... [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--csv] [--debug] [--definition|-D \u0026lt;PATTERN\u0026gt;]... [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--has-attribute|-A \u0026lt;KEY\u0026gt;]... [--help|-h|-?] [--id-list \u0026lt;FILENAME\u0026gt;] [--identifier|-I \u0026lt;PATTERN\u0026gt;]... [--ignore-taxon|-i \u0026lt;TAXID\u0026gt;]... [--input-OBI-header] [--input-json-header] [--inverse-match|-v] [--json-output] [--max-count|-C \u0026lt;COUNT\u0026gt;] [--max-cpu \u0026lt;int\u0026gt;] [--max-length|-L \u0026lt;LENGTH\u0026gt;] [--min-count|-c \u0026lt;COUNT\u0026gt;] [--min-length|-l \u0026lt;LENGTH\u0026gt;] [--no-order] [--no-progressbar] [--only-forward] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-mode \u0026lt;forward|reverse|and|or|andnot|xor\u0026gt;] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pattern-error \u0026lt;int\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--predicate|-p \u0026lt;EXPRESSION\u0026gt;]... [--raw-taxid] [--require-rank \u0026lt;RANK_NAME\u0026gt;]... [--restrict-to-taxon|-r \u0026lt;TAXID\u0026gt;]... [--save-discarded \u0026lt;FILENAME\u0026gt;] [--sequence|-s \u0026lt;PATTERN\u0026gt;]... [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--valid-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # Selecting sequence records # Selection based on the sequence # Strict matching # --sequence | -s \u0026lt;PATTERN\u003e: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive. Approximate matching # --approx-pattern \u0026lt;PATTERN\u003e: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions. --allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option). --pattern-error \u0026lt;INTEGER\u003e: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option). Selection based on the sequence identifier # --identifier | -I \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive. --id-list \u0026lt;FILENAME\u003e: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line. Selection based on the sequence definition # --definition | -D \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive. Selection based on the sequence properties # --min-count | -c \u0026lt;COUNT\u003e: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count. --max-count | -C \u0026lt;COUNT\u003e: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count. --min-length | -l \u0026lt;LENGTH\u003e: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length. --max-length | -L \u0026lt;LENGTH\u003e: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length. Matching the sequence annotations # Taxonomy based filtering # If the user specifies a taxonomy when calling *OBITools* (see --taxonomy option), it is possible to filter the sequences based on taxonomic properties. Each of the following options can be used multiple times if needed to specify multiple taxids or ranks.\n--restrict-to-taxon | -r \u0026lt;TAXID\u003e: Only sequences having a taxid belonging the provided taxid are conserved. --ignore-taxon | -i \u0026lt;TAXID\u003e: Sequences having a taxid belonging the provided taxid are discarded. --require-rank \u0026lt;RANK_NAME\u003e: Only sequences having a taxid able to provide information at the \u0026lt;RANK_NAME\u0026gt; level are conserved. As an example, the NCBI taxid 74635 corresponding to Rosa canina is able to provide information at the species, genus or family level. But, taxid 3764 (Rosa genus) is not able to provide information at the species level. Many of the taxid related to environmental samples have partial classification and a taxon at the species level is not always connected to a taxon at the genus level as parent. They can sometimes be connected to a taxon at higher level. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obigrep --help "},{"id":37,"href":"/obidoc/obitools/obijoin/","title":"obijoin","section":"Basics","content":" obijoin: join annotations from a file to a sequence file # Description # Perform a join operation to transfer annotations from a file to a sequence file.\nSynopsis # obijoin --join-with|-j \u0026lt;string\u0026gt; [--batch-size \u0026lt;int\u0026gt;] [--by|-b \u0026lt;string\u0026gt;]... [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-id|-i] [--update-quality|-q] [--update-sequence|-s] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # Required options # --join-with | -j \u0026lt;string\u003e: File name of the file to join with. obijoin specific options # --by | -b \u0026lt;string\u003e: To declare join keys. --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. --update-id | -i : Update the sequence IDs in the joined file. --update-quality | -q : Update the quality in the joined file. --update-sequence | -s : Update the sequence in the joined file. Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obijoin --help "},{"id":38,"href":"/obidoc/docs/file_format/sequence_files/csv/","title":"CSV format","section":"Sequence file formats","content":" The CSV sequence file format # The CSV (Comma-Separated Values) files are formatted as plain text where each line represents a data record, and each field within that record is separated by a comma.\nConverting FASTA file to CSV # Use the obicsv command to convert a fasta file to CSV format, with the -i and -s options, to print the sequence identifier and the nucleotide sequence respectively, and the -k option to retain the desired attributes. Each record in the FASTA file corresponds to a line in the output file:\nšŸ“„ two_sequences.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct obicsv -k count -k taxid -k family_taxid -k family_name \\ -i -s \\ two_sequences.fasta \u0026gt; two_sequences.csv id,count,taxid,scientific_name,family_taxid,family_name,sequence AB061527,1,62275,NA,9376,Soricidae,ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaatagcttaaaactcaaaggacttggcggtgctttatatccct AL355887,2,9606,NA,9604,Hominidae,ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaacagcttaaaactcaaaggacctggcagttctttatatccct The result of the obicsv can be reformatted with the csvlook command (the -I option disables the reformatting of values):\ncsvlook -I two_sequences.csv | id | count | taxid | family_taxid | family_name | sequence | | -------- | ----- | ----- | ------------ | ----------- | ---------------------------------------------------------------------------------------------------- | | AB061527 | 1 | 62275 | 9376 | Soricidae | ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaatagcttaaaactcaaaggacttggcggtgctttatatccct | | AL355887 | 2 | 9606 | 9604 | Hominidae | ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaacagcttaaaactcaaaggacctggcagttctttatatccct | Converting CSV file to FASTA format # To convert a sequence file in CSV format to fasta format, you can use the obiconvert command:\nobiconvert two_sequences.csv \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9376\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;62275\u0026#34;} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:\u0026#34;9604\u0026#34;,\u0026#34;taxid\u0026#34;:\u0026#34;9606\u0026#34;} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct Converting FASTQ file to CSV # In the same way as for fasta files, use the obicsv command to convert a fastq file to CSV format (the -q option prints the quality of the sequence in the output):\nšŸ“„ two_sequences.fastq @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(T:26)-\u0026gt;(G:13)\u0026#34;:62,\u0026#34;(T:34)-\u0026gt;(G:18)\u0026#34;:48},\u0026#34;score\u0026#34;:484,\u0026#34;score_norm\u0026#34;:0.968,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:60,\u0026#34;seq_b_single\u0026#34;:46} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 {\u0026#34;ali_dir\u0026#34;:\u0026#34;left\u0026#34;,\u0026#34;ali_length\u0026#34;:62,\u0026#34;mode\u0026#34;:\u0026#34;alignment\u0026#34;,\u0026#34;pairing_mismatches\u0026#34;:{\u0026#34;(A:02)-\u0026gt;(G:30)\u0026#34;:104,\u0026#34;(A:34)-\u0026gt;(G:14)\u0026#34;:64,\u0026#34;(C:02)-\u0026gt;(A:30)\u0026#34;:86,\u0026#34;(C:02)-\u0026gt;(T:20)\u0026#34;:108,\u0026#34;(C:27)-\u0026gt;(G:32)\u0026#34;:83,\u0026#34;(C:34)-\u0026gt;(G:18)\u0026#34;:57,\u0026#34;(T:02)-\u0026gt;(G:26)\u0026#34;:87,\u0026#34;(T:22)-\u0026gt;(G:14)\u0026#34;:66,\u0026#34;(T:29)-\u0026gt;(G:11)\u0026#34;:62,\u0026#34;(T:32)-\u0026gt;(G:30)\u0026#34;:48},\u0026#34;score\u0026#34;:283,\u0026#34;score_norm\u0026#34;:0.839,\u0026#34;seq_a_single\u0026#34;:46,\u0026#34;seq_ab_match\u0026#34;:52,\u0026#34;seq_b_single\u0026#34;:46} ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC obicsv -i -s -q two_sequences.fastq \u0026gt; two_sequences.csv csvlook -I two_sequences.csv | id | sequence | qualities | | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 | ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg | CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC | | HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 | ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg | CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC | Converting CSV file to FASTQ format # obiconvert --fastq-output two_sequences.csv @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaagagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AVA`cWeb_TYC\\UIN?IDP8QJMKRPVGLQAFPPc`AbAFB5A4\u0026gt;AAA56A\u0026gt;\u0026lt;\u0026gt;8\u0026gt;\u0026gt;F@A\u0026gt;\u0026lt;8??@BB+\u0026lt;?;?C@9CCCCCC\u0026lt;CC=CCCCCCCCCBC?CBCCCCC@CC "},{"id":39,"href":"/obidoc/obitools/obimatrix/","title":"obimatrix","section":"Basics","content":" obimatrix: convert a sequence file into a data matrix file # Description # Convert a mapping tag from a sequence file to a matrix file in CSV format.\nSynopsis # obimatrix [--batch-size \u0026lt;int\u0026gt;] [--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--map \u0026lt;string\u0026gt;] [--max-cpu \u0026lt;int\u0026gt;] [--na-value \u0026lt;string\u0026gt;] [--no-order] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--sample-name \u0026lt;string\u0026gt;] [--silent-warning] [--solexa] [--three-columns] [--transpose] [--u-to-t] [--value-name \u0026lt;string\u0026gt;] [--version] [\u0026lt;args\u0026gt;] Options # obimatrix specific options # --map \u0026lt;string\u003e: Which attribute is usd to produce the matrix. (default: \u0026ldquo;merged_sample\u0026rdquo;) --na-value \u0026lt;string\u003e: Value used when the map attribute is not defined for a sequence. (default: \u0026ldquo;0\u0026rdquo;) --sample-name \u0026lt;string\u003e: Name of the coulumn containing the sample names in the three column format. (default: \u0026ldquo;sample\u0026rdquo;) --three-columns: Printouts the matrix in tree column format. --transpose: Printouts the transposed matrix. (default: true) --value-name \u0026lt;string\u003e: Name of the coulumn containing the values in the three column format. (default: \u0026ldquo;count\u0026rdquo;) Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obimatrix --help "},{"id":40,"href":"/obidoc/formats/json/","title":"JSON format","section":"Sequence file formats","content":" The JSON sequence file format # To facilitate the exchange of data between different systems, and to allow easy parsing of the data with all programming languages, OBITools offers to export sequence data in JSON format. JSON output is requested by adding the --json-output option to the command line.\nConverting FASTA to JSON format # Here it is an example of two sequences in fasta format:\nšŸ“„ json_example.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct that you can convert in JSON with the obiconvert command:\nobiconvert --json-output json_example.fasta \u0026gt; json_example.json which produces the following JSON output:\nšŸ“„ json_example.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [ { \u0026#34;annotations\u0026#34;: { \u0026#34;count\u0026#34;: 1, \u0026#34;definition\u0026#34;: \u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;, \u0026#34;family_name\u0026#34;: \u0026#34;Soricidae\u0026#34;, \u0026#34;family_taxid\u0026#34;: \u0026#34;9376\u0026#34;, \u0026#34;genus_name\u0026#34;: \u0026#34;Sorex\u0026#34;, \u0026#34;genus_taxid\u0026#34;: \u0026#34;9379\u0026#34;, \u0026#34;obicleandb_level\u0026#34;: \u0026#34;family\u0026#34;, \u0026#34;obicleandb_trusted\u0026#34;: 2.2137847111025621e-13, \u0026#34;species_name\u0026#34;: \u0026#34;Sorex unguiculatus\u0026#34;, \u0026#34;species_taxid\u0026#34;: \u0026#34;62275\u0026#34;, \u0026#34;taxid\u0026#34;: \u0026#34;62275\u0026#34; }, \u0026#34;id\u0026#34;: \u0026#34;AB061527\u0026#34;, \u0026#34;sequence\u0026#34;: \u0026#34;ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaatagcttaaaactcaaaggacttggcggtgctttatatccct\u0026#34; }, { \u0026#34;annotations\u0026#34;: { \u0026#34;count\u0026#34;: 2, \u0026#34;definition\u0026#34;: \u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;, \u0026#34;family_name\u0026#34;: \u0026#34;Hominidae\u0026#34;, \u0026#34;family_taxid\u0026#34;: \u0026#34;9604\u0026#34;, \u0026#34;genus_name\u0026#34;: \u0026#34;Homo\u0026#34;, \u0026#34;genus_taxid\u0026#34;: \u0026#34;9605\u0026#34;, \u0026#34;obicleandb_level\u0026#34;: \u0026#34;genus\u0026#34;, \u0026#34;obicleandb_trusted\u0026#34;: 0, \u0026#34;species_name\u0026#34;: \u0026#34;Homo sapiens\u0026#34;, \u0026#34;species_taxid\u0026#34;: \u0026#34;9606\u0026#34;, \u0026#34;taxid\u0026#34;: \u0026#34;9606\u0026#34; }, \u0026#34;id\u0026#34;: \u0026#34;AL355887\u0026#34;, \u0026#34;sequence\u0026#34;: \u0026#34;ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaacagcttaaaactcaaaggacctggcagttctttatatccct\u0026#34; } ] Converting FASTQ to JSON format # To convert a fastq file as the following example:\nšŸ“„ json_example.fasta \u0026gt;AB061527 {\u0026#34;count\u0026#34;:1,\u0026#34;definition\u0026#34;:\u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Soricidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9376,\u0026#34;genus_name\u0026#34;:\u0026#34;Sorex\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9379,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;family\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:2.2137847111025621e-13,\u0026#34;species_name\u0026#34;:\u0026#34;Sorex unguiculatus\u0026#34;,\u0026#34;species_taxid\u0026#34;:62275,\u0026#34;taxid\u0026#34;:62275} ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat agcttaaaactcaaaggacttggcggtgctttatatccct \u0026gt;AL355887 {\u0026#34;count\u0026#34;:2,\u0026#34;definition\u0026#34;:\u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;,\u0026#34;family_name\u0026#34;:\u0026#34;Hominidae\u0026#34;,\u0026#34;family_taxid\u0026#34;:9604,\u0026#34;genus_name\u0026#34;:\u0026#34;Homo\u0026#34;,\u0026#34;genus_taxid\u0026#34;:9605,\u0026#34;obicleandb_level\u0026#34;:\u0026#34;genus\u0026#34;,\u0026#34;obicleandb_trusted\u0026#34;:0,\u0026#34;species_name\u0026#34;:\u0026#34;Homo sapiens\u0026#34;,\u0026#34;species_taxid\u0026#34;:9606,\u0026#34;taxid\u0026#34;:9606} ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac agcttaaaactcaaaggacctggcagttctttatatccct use the equivalent command:\nobiconvert --json-output json_example.fastq \u0026gt; json_example_qual.json which produces the following JSON output:\nšŸ“„ json_example_qual.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [ { \u0026#34;annotations\u0026#34;: { \u0026#34;count\u0026#34;: 1, \u0026#34;definition\u0026#34;: \u0026#34;Sorex unguiculatus mitochondrial NA, complete genome.\u0026#34;, \u0026#34;family_name\u0026#34;: \u0026#34;Soricidae\u0026#34;, \u0026#34;family_taxid\u0026#34;: \u0026#34;9376\u0026#34;, \u0026#34;genus_name\u0026#34;: \u0026#34;Sorex\u0026#34;, \u0026#34;genus_taxid\u0026#34;: \u0026#34;9379\u0026#34;, \u0026#34;obicleandb_level\u0026#34;: \u0026#34;family\u0026#34;, \u0026#34;species_name\u0026#34;: \u0026#34;Sorex unguiculatus\u0026#34;, \u0026#34;species_taxid\u0026#34;: \u0026#34;62275\u0026#34;, \u0026#34;taxid\u0026#34;: \u0026#34;62275\u0026#34; }, \u0026#34;id\u0026#34;: \u0026#34;AB061527\u0026#34;, \u0026#34;qualities\u0026#34;: \u0026#34;BBFFBFFHHHHHFFFFFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH\u0026#34;, \u0026#34;sequence\u0026#34;: \u0026#34;ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat\u0026#34; }, { \u0026#34;annotations\u0026#34;: { \u0026#34;count\u0026#34;: 2, \u0026#34;definition\u0026#34;: \u0026#34;Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.\u0026#34;, \u0026#34;family_name\u0026#34;: \u0026#34;Hominidae\u0026#34;, \u0026#34;family_taxid\u0026#34;: \u0026#34;9604\u0026#34;, \u0026#34;genus_name\u0026#34;: \u0026#34;Homo\u0026#34;, \u0026#34;genus_taxid\u0026#34;: \u0026#34;9605\u0026#34;, \u0026#34;obicleandb_level\u0026#34;: \u0026#34;genus\u0026#34;, \u0026#34;species_name\u0026#34;: \u0026#34;Homo sapiens\u0026#34;, \u0026#34;species_taxid\u0026#34;: \u0026#34;9606\u0026#34;, \u0026#34;taxid\u0026#34;: \u0026#34;9606\u0026#34; }, \u0026#34;id\u0026#34;: \u0026#34;AL355887\u0026#34;, \u0026#34;qualities\u0026#34;: \u0026#34;@BBFFBFFHHHGGGFFFFFHHHJJJHHHHHHHHHJHHHHHHHHHHHHHHHHHHHHHHHHH\u0026#34;, \u0026#34;sequence\u0026#34;: \u0026#34;ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac\u0026#34; } ] "},{"id":41,"href":"/obidoc/docs/programming/lua/obitools_classes/mutex/","title":"Mutex","section":"Obitools Classes","content":" The Mutex class # Constructor of Mutex # Mutex Methods # lock # unlock # "},{"id":42,"href":"/obidoc/obitools/obisplit/","title":"obisplit","section":"Basics","content":" obisplit: # Description # Synopsis # obisplit [--allows-indels] [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--config|-C \u0026lt;string\u0026gt;] [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pattern-error \u0026lt;int\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--template] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obisplit specific options # --allows-indels: Allows for indels during pattern DNA pattern matching. --config | -C \u0026lt;string\u003e: The configuration file. --pattern-error \u0026lt;INTEGER\u003e: Maximum number of allowed error during pattern matching (default: 4). --template: Print on the standard output a script template. (default: false) Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obisplit --help "},{"id":43,"href":"/obidoc/obitools/obisummary/","title":"obisummary","section":"Basics","content":" obisummary: generate summary statistics # Description # Generate summary statistics describing the sequence content of a sequence file.\nSynopsis # obisummary [--batch-size \u0026lt;int\u0026gt;] [--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--map \u0026lt;string\u0026gt;]... [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--silent-warning] [--solexa] [--u-to-t] [--version] [--yaml-output] [\u0026lt;args\u0026gt;] Options # obisummary specific options # --map \u0026lt;string\u003e: Name of a map attribute. --yaml-output \u0026lt;string\u003e: Print results as YAML record. (default: false) Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obisummary --help "},{"id":44,"href":"/obidoc/obitools/obiuniq/","title":"obiuniq","section":"Basics","content":" obiuniq: dereplicate a sequence file # Description # Dereplicate a sequence file, by merging identical sequences.\nSynopsis # obiuniq [--batch-size \u0026lt;int\u0026gt;] [--category-attribute|-c \u0026lt;CATEGORY\u0026gt;]... [--chunk-count \u0026lt;int\u0026gt;] [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--in-memory] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--merge|-m \u0026lt;KEY\u0026gt;]... [--na-value \u0026lt;NA_NAME\u0026gt;] [--no-order] [--no-progressbar] [--no-singleton] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--u-to-t] [--update-taxid] [--version] [--with-leaves] [\u0026lt;args\u0026gt;] Options # obiuniq specific options # --category-attribute | -c \u0026lt;CATEGORY\u003e: Adds one attribute to the list of attributes used to define sequence groups (this option can be used several times). --chunk-count \u0026lt;INTEGER\u003e: In how many chunk the dataset is pre-devided for speeding up the process. (default: 100) --in-memory: Use memory instead of disk to store data chunks. (default: false) --merge | -m \u0026lt;KEY\u003e: Adds a merged attribute containing the list of sequence record ids merged within this group. --na-value \u0026lt;NA_NAME\u003e: Value used when the classifier tag is not defined for a sequence. (default: \u0026ldquo;NA\u0026rdquo;) --no-singleton: If set, sequences occurring a single time in the data set are discarded. (default: false) --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obiuniq --help "},{"id":45,"href":"/obidoc/docs/file_format/sequence_files/annotations/","title":"Annotation of sequences","section":"Sequence file formats","content":" Annotation of sequence data # OBITools are manipulating files of sequence data. Each file can contain either a single sequence, or millions of them. OBITools are generating information from each of them and have to keep this computed information associated to the sequence data.\n"},{"id":46,"href":"/obidoc/docs/commands/tags/","title":"Glossary of tags","section":"The OBITools4 commands","content":" Glossary of tags # - D - # definition :\ntext information about the sequence present in the original sequence file.\ndirection :\nset to ā€œforwardā€ if the original sequence did not need to be reverse-complemented to be processed, set to ā€œreverseā€ otherwise. ( obipcr)\n- F - # forward_error :\nNumber of mismatch between forward primer and priming site ( obipcr)\nforward_match :\nForward primer priming site sequence ( obipcr)\nforward_primer :\nForward primer sequence ( obipcr)\n- R - # reverse_error :\nNumber of mismatch between reverse primer and priming site ( obipcr)\nreverse_match :\nReverse primer priming site sequence ( obipcr)\nreverse_primer :\nReverse primer sequence ( obipcr)\n- S - # scientific_name :\nScientific name associated with to the taxid ( obipcr, obitag)\n- T - # taxid :\nTaxonomic identifier ( obipcr, obitag)\n"},{"id":47,"href":"/obidoc/obitools/obicleandb/","title":"obicleandb","section":"Experimentals","content":" obicleandb: clean a sequence reference database # Description # Clean a sequence reference database for trivial wrong taxonomic annotations.\nSynopsis # obicleandb [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--ignore-taxon|-i \u0026lt;TAXID\u0026gt;]... [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--require-rank \u0026lt;RANK_NAME\u0026gt;]... [--restrict-to-taxon|-r \u0026lt;TAXID\u0026gt;]... [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--update-taxids] [--version] [\u0026lt;args\u0026gt;] Options # obicleandb specific options # --opt1 | -o \u0026lt;PARAM\u003e: Here the description of the option Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obicleandb --help "},{"id":48,"href":"/obidoc/obitools/obiconsensus/","title":"obiconsensus","section":"Experimentals","content":" obiconsensus: denoise MinION data using consensus sequences # Description # Denoise MinIon sequence data by constructing consensus sequences.\nSynopsis # obiconsensus [--batch-size \u0026lt;int\u0026gt;] [--cluster|-C] [--compress|-Z] [--debug] [--distance|-d \u0026lt;int\u0026gt;] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--kmer-size \u0026lt;SIZE\u0026gt;] [--low-coverage \u0026lt;float64\u0026gt;] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--no-singleton] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--sample|-s \u0026lt;string\u0026gt;] [--save-graph \u0026lt;string\u0026gt;] [--save-ratio \u0026lt;string\u0026gt;] [--skip-empty] [--solexa] [--unique|-U] [--version] [\u0026lt;args\u0026gt;] Options # obiconsensus specific options # --opt1 | -o \u0026lt;PARAM\u003e: Here the description of the option Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obiconsensus --help "},{"id":49,"href":"/obidoc/obitools/obilandmark/","title":"obilandmark","section":"Experimentals","content":" obilandmark # Description # Synopsis # obilandmark [--batch-size \u0026lt;int\u0026gt;] [--center|-n \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--version] [\u0026lt;args\u0026gt;] Options # obilandmark specific options # --opt1 | -o \u0026lt;PARAM\u003e: Here the description of the option Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obilandmark --help "},{"id":50,"href":"/obidoc/obitools/obimicrosat/","title":"obimicrosat","section":"Others","content":" obimicrosat: extract microsatellite sequences # Description # Extract sequence entries containing a microsatellite from a sequence file.\nSynopsis # obimicrosat [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--max-cpu \u0026lt;int\u0026gt;] [--max-unit-length|-M \u0026lt;int\u0026gt;] [--min-flank-length|-f \u0026lt;int\u0026gt;] [--min-length|-l \u0026lt;int\u0026gt;] [--min-unit-count \u0026lt;int\u0026gt;] [--min-unit-length|-m \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--not-reoriented|-n] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--version] [\u0026lt;args\u0026gt;] Options # obimicrosat specific options: # --opt1 | -o \u0026lt;PARAM\u003e: Here the description of the option Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obimicrosat --help "},{"id":51,"href":"/obidoc/obitools/obimultiplex/","title":"obimultiplex","section":"Demultiplexing samples","content":" obimultiplex: demultiplex the sequence reads # Description # The obimultiplex command demultiplexes sequencing reads by identifying sample-specific tags (barcodes) and PCR primers in the sequences. It assigns each sequence to its corresponding sample based on the tag combinations and primer sequences provided in a sample description file.\nThe demultiplexing process involves:\nIdentifying forward and reverse PCR primers in the sequences. Detecting sample-specific tags. Assigning sequences to samples based on the tag/primer combinations. Trimming primers and tags from the sequences. Reverse complementing the sequences if needed. Adding comprehensive annotations about the identification process. The new obimultiplex sample description file format # If obimultiplex is still able to use the old ngsfilter format used by the legacy obitools, it is now preferable to rely on the new format.\nThe new format is a CSV file, which can easily be prepared using an export from your favourite spreadsheet program.\n# primer matching options @param,primer_mismatches,2 @param,indels,false # tag matching options @param,matching,strict experiment,sample,sample_tag,forward_primer,reverse_primer wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG The CSV file is divided into two sections. The first section consists of lines beginning with @param in the first cell. These lines specify the parameters used to match the primers and tags to the sequence. The second section provides a description of all the samples (PCRs) included in the sequencing library. This section begins with a line containing the names of the columns used to describe the samples in the subsequent lines. Only the second section is required.\nBasic format and required columns # Below is an example for the minimal description of the PCRs multiplexed in the sequencing library. In the new version of OBITools4 this file is a CSV file.\nThe first line is mandatory and must contains at least the five column names presented below:\nexperiment: the name of the experiment that allows for grouping of samples;\nsample: the sample (PCR) name;\nsample_tag: the tag identifying the sample:\nEach sample tag must be unique within the library for each pair of primers. They can be provided in upper or lower case. No distinction is made between the two.\nThey can be a simple DNA word as here. This means that the same tag is used for both forward and reverse primers (eg: aattaac).\nIt can be two DNA words separated by a colon. For example, aagtag:gaagtag. This means that the first tag is used for the forward primer and the second for the reverse primers.\nThe example presented above :aattaac is equivalent to aattaac:aattaac. In the two-word syntax, if a forward or reverse primer is not tagged, the tag is replaced by a hyphen. For example, aagtag:- or -:aagtag. Consequently, an experiments conducted without primer tags must declare a dummy tag: -:-.\nFor a given primer all the tags must have the same length. forward_primer: the forward primer sequence\nreverse_primer: the reverse primer sequence\nšŸ“„ samples_simple.csv 1 2 3 4 5 experiment,sample,sample_tag,forward_primer,reverse_primer wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG šŸ“„ wolf_4seq.fastq @HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1 ccaattaactagaacaggctcgtctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaatagttttgtttgcataactatttgtgtttaaggctaggcatagtggggtatctaagttaattgg + CCCCCCCCCDCCCCCCCCCCCCCCCCCCCCCC=CBCCBCBCCCCCCDEFAEDEEEEBEAEJEJ?D?CD@^aVca\\C????CEBC\u0026gt;I?D\u0026lt;\u0026gt;EEDDDEEEEEEEAFEEDECCCCCCCCCCCCCCCBCCCCCCBCCCCCCCCCDCCCCCCCCCCCBC @HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1 ccgaatatcttagataccccactatgcttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccctctagaggagcctgttctagatattcgg + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCCCCCBCC @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:108:6440:4223#0/1 ccgcctcctttagatcccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC obimultiplex -s samples_simple.csv \\ wolf_4seq.fastq \\ \u0026gt; wolf_4seq_simple.fastq šŸ“„ wolf_4seq_simple.fastq @HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1_sub[28..127] {\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;reverse\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:1,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcgtctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;sample\u0026#34;:\u0026#34;13a_F730603\u0026#34;} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt + CCCBCCCCCCCCCCCCCCCEDEEFAEEEEEEEDDDEE\u0026gt;\u0026lt;D?I\u0026gt;CBEC????C\\acVa^@DC?D?JEJEAEBEEEEDEAFEDCCCCCCBCBCCBC=CCCCC @HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_sub[28..126] {\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;sample\u0026#34;:\u0026#34;26a_F040644\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccct + CCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;sample\u0026#34;:\u0026#34;29a_F260619\u0026#34;} ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt + CCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC Sample description\nexperiment: \u0026ldquo;wolf_diet\u0026rdquo;\nThe experiment name imputed to the barcode sequence\nsample: \u0026ldquo;13a_F730603\u0026rdquo;\nThe sample (PCR) name imputed to the barcode sequence\nAmplicon description\nobimultiplex_amplicon_rank: \u0026ldquo;1/1\u0026rdquo;\nobimultiplex is able to detect concatemer of several amplicons. This information is reported in the `obimultiplex_amplicon_rank` as a ratio here \"1/1\" meaning the first among one in the read. A value of \"2/3\" would mean the second amplicon detected among three in the read. obimultiplex_direction: \u0026ldquo;reverse\u0026rdquo;\nThe direction in which the amplicon has been detected:\n\u0026ldquo;forward\u0026rdquo; means, the forward primer has been identified, then the reverse complementary sequence of the reverse primer.\n\u0026ldquo;reverse\u0026rdquo; means, the reverse primer has been identified, then the forward complementary sequence of the forward primer. The sequence of the barcode has been reverse complemented to be always reported as a sequence oriented from the forward to the reverse primer.\nPrimer matching\nForward primer:\nobimultiplex_forward_primer: \u0026ldquo;ttagataccccactatgc\u0026rdquo;\nThe true forward primer sequence as provided in the obimultiplex sample description file.\nobimultiplex_forward_match: \u0026ldquo;ttagataccccactatgc\u0026rdquo;\nThe primer sequence as detected in the sequence read.\nobimultiplex_forward_error: 0\nThe number of differences between the obimultiplex_forward_primer and the obimultiplex_forward_match attribute values. obimultiplex by default allows up to two mismatches. That threshold can be changed using the \u0026ndash;allowed-mismatches option (or -e for the short version option).\nReverse primer:\n\u0026ldquo;obimultiplex_reverse_primer\u0026rdquo;:\u0026ldquo;tagaacaggctcctctag\u0026rdquo;\nThe true reverse primer sequence as provided in the obimultiplex sample description file.\n\u0026ldquo;obimultiplex_reverse_match\u0026rdquo;:\u0026ldquo;tagaacaggctcgtctag\u0026rdquo;\nThe primer sequence as detected in the sequence read.\n\u0026ldquo;obimultiplex_reverse_error\u0026rdquo;:1\nHere one mismatch has been detected between the primer sequence and the read sequence match.\nTag identification\nForward tag: \u0026ldquo;obimultiplex_forward_tag\u0026rdquo;:\u0026ldquo;gcctcct\u0026rdquo; \u0026ldquo;obimultiplex_forward_proposed_tag\u0026rdquo;:\u0026ldquo;gcctcct\u0026rdquo; \u0026ldquo;obimultiplex_forward_matching\u0026rdquo;:\u0026ldquo;strict\u0026rdquo; \u0026ldquo;obimultiplex_forward_tag_dist\u0026rdquo;:0 Reverse tag: \u0026ldquo;obimultiplex_reverse_tag\u0026rdquo;:\u0026ldquo;gcctcct\u0026rdquo; \u0026ldquo;obimultiplex_reverse_proposed_tag\u0026rdquo;:\u0026ldquo;gcctcct\u0026rdquo; \u0026ldquo;obimultiplex_reverse_matching\u0026rdquo;:\u0026ldquo;strict\u0026rdquo; \u0026ldquo;obimultiplex_reverse_tag_dist\u0026rdquo;:0 Supplementary columns # šŸ“„ samples_extra.csv 1 2 3 4 5 experiment,sample,sample_tag,forward_primer,reverse_primer,sex,age,plate,position wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,male,adult,02,A03 wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,male,juvenile,02,A01 wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,female,adult,01,B08 wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,female,adult,01,B12 obimultiplex -s samples_extra.csv \\ wolf_4seq.fastq \\ \u0026gt; wolf_4seq_extra.fastq šŸ“„ wolf_4seq_extra.fastq @HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1_sub[28..127] {\u0026#34;age\u0026#34;:\u0026#34;adult\u0026#34;,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;reverse\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:1,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcgtctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;aattaac\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;plate\u0026#34;:\u0026#34;02\u0026#34;,\u0026#34;position\u0026#34;:\u0026#34;A03\u0026#34;,\u0026#34;sample\u0026#34;:\u0026#34;13a_F730603\u0026#34;,\u0026#34;sex\u0026#34;:\u0026#34;male\u0026#34;} ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt + CCCBCCCCCCCCCCCCCCCEDEEFAEEEEEEEDDDEE\u0026gt;\u0026lt;D?I\u0026gt;CBEC????C\\acVa^@DC?D?JEJEAEBEEEEDEAFEDCCCCCCBCBCCBC=CCCCC @HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_sub[28..126] {\u0026#34;age\u0026#34;:\u0026#34;adult\u0026#34;,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gaatatc\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;plate\u0026#34;:\u0026#34;01\u0026#34;,\u0026#34;position\u0026#34;:\u0026#34;B08\u0026#34;,\u0026#34;sample\u0026#34;:\u0026#34;26a_F040644\u0026#34;,\u0026#34;sex\u0026#34;:\u0026#34;female\u0026#34;} ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccct + CCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCC @HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {\u0026#34;age\u0026#34;:\u0026#34;adult\u0026#34;,\u0026#34;experiment\u0026#34;:\u0026#34;wolf_diet\u0026#34;,\u0026#34;obimultiplex_amplicon_rank\u0026#34;:\u0026#34;1/1\u0026#34;,\u0026#34;obimultiplex_direction\u0026#34;:\u0026#34;forward\u0026#34;,\u0026#34;obimultiplex_forward_error\u0026#34;:0,\u0026#34;obimultiplex_forward_match\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_forward_primer\u0026#34;:\u0026#34;ttagataccccactatgc\u0026#34;,\u0026#34;obimultiplex_forward_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_forward_tag_dist\u0026#34;:0,\u0026#34;obimultiplex_reverse_error\u0026#34;:0,\u0026#34;obimultiplex_reverse_match\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_matching\u0026#34;:\u0026#34;strict\u0026#34;,\u0026#34;obimultiplex_reverse_primer\u0026#34;:\u0026#34;tagaacaggctcctctag\u0026#34;,\u0026#34;obimultiplex_reverse_proposed_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag\u0026#34;:\u0026#34;gcctcct\u0026#34;,\u0026#34;obimultiplex_reverse_tag_dist\u0026#34;:0,\u0026#34;plate\u0026#34;:\u0026#34;01\u0026#34;,\u0026#34;position\u0026#34;:\u0026#34;B12\u0026#34;,\u0026#34;sample\u0026#34;:\u0026#34;29a_F260619\u0026#34;,\u0026#34;sex\u0026#34;:\u0026#34;female\u0026#34;} ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt + CCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC obimultiplex -s samples_simple.csv \\ -u wolf_4seq_bad.fastq \\ wolf_4seq.fastq \\ \u0026gt; wolf_4seq_simple.fastq šŸ“„ wolf_4seq_bad.fastq @HELIUM_000100422_612GNAAXX:7:108:6440:4223#0/1 {\u0026#34;obimultiplex_error\u0026#34;:\u0026#34;No barcode identified\u0026#34;} ccgcctcctttagatcccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg + CCCCCCCBCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC\u0026lt;CcCccbe[`F`accXV\u0026lt;TA\\RYU\\\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC Synopsis # obimultiplex [--allowed-mismatches|-e \u0026lt;int\u0026gt;] [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--ecopcr] [--embl] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--keep-errors] [--max-cpu \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--skip-empty] [--solexa] [--tag-list|-s \u0026lt;string\u0026gt;] [--taxonomy|-t \u0026lt;string\u0026gt;] [--template] [--unidentified|-u \u0026lt;string\u0026gt;] [--version] [--with-indels] [\u0026lt;args\u0026gt;] Options # obimultiplex specific options # --allowed-mismatches | -e \u0026lt;INTEGER\u003e: Used to specify the number of errors allowed for matching primers. (default: -1) --keep-errors: Prints symbol counts. (default: false) --paired-with \u0026lt;FILENAME\u003e: filename containing the paired reads. --tag-list | -s \u0026lt;string\u003e: File name of the NGSFilter file describing PCRs. --template: Print on the standard output an example of CSV configuration file. (default: false) --unidentified | -u \u0026lt;string\u003e: Filename used to store the sequences unassigned to any sample. --with-indels: Allows for indels during the primers matching. (default: false) Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obimultiplex --help "},{"id":52,"href":"/obidoc/obitools/obiscript/","title":"obiscript","section":"Advanced tools","content":" obiscript: apply a LUA script on sequences # Description # Apply a LUA script to each sequence in a sequence file.\nSynopsis # obiscript [--allows-indels] [--approx-pattern \u0026lt;PATTERN\u0026gt;]... [--attribute|-a \u0026lt;KEY=VALUE\u0026gt;]... [--batch-size \u0026lt;int\u0026gt;] [--compressed|-Z] [--csv] [--debug] [--definition|-D \u0026lt;PATTERN\u0026gt;]... [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--genbank] [--has-attribute|-A \u0026lt;KEY\u0026gt;]... [--help|-h|-?] [--id-list \u0026lt;FILENAME\u0026gt;] [--identifier|-I \u0026lt;PATTERN\u0026gt;]... [--ignore-taxon|-i \u0026lt;TAXID\u0026gt;]... [--input-OBI-header] [--input-json-header] [--inverse-match|-v] [--json-output] [--max-count|-C \u0026lt;COUNT\u0026gt;] [--max-cpu \u0026lt;int\u0026gt;] [--max-length|-L \u0026lt;LENGTH\u0026gt;] [--min-count|-c \u0026lt;COUNT\u0026gt;] [--min-length|-l \u0026lt;LENGTH\u0026gt;] [--no-order] [--no-progressbar] [--only-forward] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--paired-mode \u0026lt;forward|reverse|and|or|andnot|xor\u0026gt;] [--paired-with \u0026lt;FILENAME\u0026gt;] [--pattern-error \u0026lt;int\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--predicate|-p \u0026lt;EXPRESSION\u0026gt;]... [--raw-taxid] [--require-rank \u0026lt;RANK_NAME\u0026gt;]... [--restrict-to-taxon|-r \u0026lt;TAXID\u0026gt;]... [--script|-S \u0026lt;string\u0026gt;] [--sequence|-s \u0026lt;PATTERN\u0026gt;]... [--skip-empty] [--solexa] [--taxonomy|-t \u0026lt;string\u0026gt;] [--template] [--update-taxid] [--valid-taxid] [--version] [\u0026lt;args\u0026gt;] Options # obiscript mandatory option # --script | -S \u0026lt;STRING\u003e: The script to execute. Other obiscript specific option # --template: Print on the standard output a script template. Selecting sequence records # Selection based on the sequence # Strict matching # --sequence | -s \u0026lt;PATTERN\u003e: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive. Approximate matching # --approx-pattern \u0026lt;PATTERN\u003e: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions. --allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option). --pattern-error \u0026lt;INTEGER\u003e: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option). Selection based on the sequence identifier # --identifier | -I \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive. --id-list \u0026lt;FILENAME\u003e: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line. Selection based on the sequence definition # --definition | -D \u0026lt;REGEX\u003e: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive. Selection based on the sequence properties # --min-count | -c \u0026lt;COUNT\u003e: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count. --max-count | -C \u0026lt;COUNT\u003e: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count. --min-length | -l \u0026lt;LENGTH\u003e: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length. --max-length | -L \u0026lt;LENGTH\u003e: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length. Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obiscript --help "},{"id":53,"href":"/obidoc/obitools/obitagpcr/","title":"obitagpcr","section":"Demultiplexing samples","content":" obitagpcr: # Description # Synopsis # obitagpcr --forward-reads|-F \u0026lt;FILENAME_F\u0026gt; --reverse-reads|-R \u0026lt;FILENAME_R\u0026gt; [--allowed-mismatches|-e \u0026lt;int\u0026gt;] [--batch-size \u0026lt;int\u0026gt;] [--compress|-Z] [--debug] [--delta|-D \u0026lt;int\u0026gt;] [--ecopcr] [--embl] [--exact-mode] [--fast-absolute] [--fasta] [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu] [--gap-penalty|-G \u0026lt;float64\u0026gt;] [--genbank] [--help|-h|-?] [--input-OBI-header] [--input-json-header] [--json-output] [--keep-errors] [--max-cpu \u0026lt;int\u0026gt;] [--min-identity|-X \u0026lt;float64\u0026gt;] [--min-overlap \u0026lt;int\u0026gt;] [--no-order] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--output-OBI-header|-O] [--output-json-header] [--penalty-scale \u0026lt;float64\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--reorientate] [--skip-empty] [--solexa] [--tag-list|-t \u0026lt;string\u0026gt;] [--template] [--unidentified|-u \u0026lt;string\u0026gt;] [--version] [--with-indels] [--without-stat|-S] [\u0026lt;args\u0026gt;] Options # obitagpcr specific options # --forward-reads | -F \u0026lt;FILENAME_F\u003e: The file names containing the forward reads. --reverse-reads | -R \u0026lt;FILENAME_R\u003e: The file names containing the reverse reads. --allowed-mismatches | -e \u0026lt;INTEGER\u003e: Used to specify the number of errors allowed for matching primers. (default: -1) --delta | -D \u0026lt;int\u003e: Length added to the fast detected overlap for the precise alignement (default: 5) --exact-mode: Do not run fast alignment heuristic. (default: false) --fast-absolute: Compute absolute fast score (no action in exact mode). (default: false) --gap-penalty | -G \u0026lt;float64\u003e: Gap penaity expressed as the multiply factor applied to the mismatch score between two nucleotides with a quality of 40 (default 2). (default: 2.000000) --keep-errors: Prints symbol counts. (default: false) --min-identity | -X \u0026lt;float64\u003e: Minimum identity between ovelaped regions of the reads to consider the aligment (default: 0.900000) --min-overlap \u0026lt;int\u003e: Minimum ovelap between both the reads to consider the aligment (default: 20) --penalty-scale \u0026lt;float64\u003e: Scale factor applied to the mismatch score and the gap penalty (default 1). (default: 1.000000) --reorientate: Reverse complemente reads if needed to store all the sequences in the same orientation respectively to forward and reverse primers (default: false) --tag-list | -s \u0026lt;string\u003e: File name of the NGSFilter file describing PCRs. --template: Print on the standard output an example of CSV configuration file. (default: false) --unidentified | -u \u0026lt;string\u003e: Filename used to store the sequences unassigned to any sample. --with-indels: Allows for indels during the primers matching. (default: false) --without-stat | -S : Remove alignment statistics from the produced consensus sequences. (default: false) Controlling the input data # OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step. The file format options # --fasta: indicates that sequence data is in fasta format. --fastq: indicates that sequence data is in fastq format. --embl: indicates that sequence data is in EMBL-ENA flatfile format. --csv: indicates that sequence data is in CSV format. --genbank: indicates that sequence data is in GenBank flatfile format. --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format. Controlling the way OBITools4 are formatting annotations # These options only apply to the FASTA and FASTQ formats --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format. --input-json-header: FASTA/FASTQ title line annotations follow the JSON format. Controlling quality score decoding # This option only applies to the FASTQ formats --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA) Controlling the output data # --compress | -Z : output is compressed using gzip. (default: false) --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the \u0026ndash;no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation. --fasta-output: writes sequence data in fasta format (default if quality data is not available). --fastq-output: writes sequence data in fastq format (default if quality data is available). --json-output: writes sequence data in JSON format. --out | -o \u0026lt;FILENAME\u003e: filename used for saving the output (default: \u0026ldquo;-\u0026rdquo;, the standard output) --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON). --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format). --skip-empty: sequences of length equal to zero are removed from the output (default: false). --no-progressbar: deactivates progress bar display (default: false). General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obitagpcr --help "},{"id":54,"href":"/obidoc/obitools/obitaxonomy/","title":"obitaxonomy","section":"Taxonomy","content":" obitaxonomy: manage and request a taxonomy database # Description # Synopsis # obitaxonomy [--alternative-names|-a] [--batch-size \u0026lt;int\u0026gt;] [--debug] [--download-ncbi] [--dump|-D \u0026lt;TAXID\u0026gt;] [--extract-taxonomy] [--fixed|-F] [--force-one-cpu] [--help|-h|-?] [--max-cpu \u0026lt;int\u0026gt;] [--no-progressbar] [--out|-o \u0026lt;FILENAME\u0026gt;] [--parents|-p \u0026lt;TAXID\u0026gt;] [--pprof] [--pprof-goroutine \u0026lt;int\u0026gt;] [--pprof-mutex \u0026lt;int\u0026gt;] [--rank \u0026lt;RANK\u0026gt;] [--rank-list|-l] [--raw-taxid] [--restrict-to-taxon|-r \u0026lt;string\u0026gt;]... [--solexa] [--sons|-s \u0026lt;TAXID\u0026gt;] [--taxonomy|-t \u0026lt;string\u0026gt;] [--version] [--with-path] [--with-query|-P] [--without-parent] [--without-rank|-R] [--without-scientific-name|-S] [\u0026lt;args\u0026gt;] Options # obitaxonomy specific options # --alternative-names | -a : Enable the search on all alternative names and not only scientific names. (default: false) --download-ncbi: Download the current NCBI taxonomy taxdump (default: false) --dump | -D \u0026lt;TAXID\u003e: Dump a sub-taxonomy corresponding to the precised clade (default: \u0026ldquo;\u0026rdquo;) --extract-taxonomy: Extract taxonomy from a sequence file (default: false) --fixed: Match taxon names using a fixed pattern, not a regular expression (default: false) --parents | -p \u0026lt;TAXID\u003e: Displays every parental tree\u0026rsquo;s information for the provided taxid. (default: \u0026ldquo;NA\u0026rdquo;) --rank \u0026lt;RANK\u003e: Restrict to the given taxonomic rank. (default: \u0026ldquo;\u0026rdquo;) --rank-list | -l : List every taxonomic rank available in the taxonomy. (default: false) --restrict-to-taxon | -r \u0026lt;string\u003e: Restrict output to some subclades. (default: []) --sons | -s \u0026lt;TAXID\u003e: Displays every sons\u0026rsquo; tree\u0026rsquo;s information for the provided taxid. (default: \u0026ldquo;NA\u0026rdquo;) --with-path: Adds a column containing the full path for each displayed taxon. (default: false) --with-query | -P : Adds a column containing query used to filter taxon name for each displayed taxon. (default: false) --without-parent: Supress the column containing the parent\u0026rsquo;s taxonid from the output. (default: false) --without-rank | -R : Supress the column containing the taxonomic rank from the output. (default: false) --without-scientific-name | -S : Supress the column containing the scientific name from the output. (default: false) Taxonomic options # --taxonomy | -t \u0026lt;string\u003e: Path to the taxonomic database. General options # --help | -h|-? : shows this help. --version: prints the version and exits. --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable. Computation related options # --max-cpu \u0026lt;INTEGER\u003e: OBITools can take advantage of your computer\u0026rsquo;s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable. --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false). --batch-size \u0026lt;INTEGER\u003e: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE) Debug related options # --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG) --pprof: enables pprof server. Look at the log for details. (default: false). --pprof-mutex \u0026lt;INTEGER\u003e: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX) --pprof-goroutine \u0026lt;INTEGER\u003e: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE) Examples # obitaxonomy --help "},{"id":55,"href":"/obidoc/docs/cookbook/reference_db/","title":"Build a reference database","section":"Cookbook","content":" Build a reference database # One of the crucial steps in the analysis of environmental DNA data is the taxonomic assignment of sequences, i.e. assigning a species, genus or other taxonomic rank to the sequences present in the collected samples.\nTaxonomic assignment requires annotated reference sequences, against which the sequences of interest are compared. These reference sequences form what is known as a reference database, which is a sequence file in fasta format, for a given marker of metabarcoding.\nHere is a quick step-by-step guide to creating a reference database, here for assigning sequences from wolf fecal samples to study its diet, a dataset used in the metabarcoding analysis tutorial here.\nOne way to build a reference database is to use the obipcr program to simulate a PCR and extract all sequences from a general purpose DNA database such as GenBank or EMBL that can be amplified in silico by the two primers used for PCR amplification.\nThe steps to create a reference database are:\nDownload sequences from a public database such as GenBank or EMBL Perform an in silico PCR amplification of these sequences with a given marker with obipcr Clean up the database by deleting sequences that do not provide sufficient taxonomic information and are redundant Since Genbank and the taxonomy associated with sequences are constantly evolving, you may not get exactly the same results when using the following commands.\nDownload the sequences # In this example, the sequences are downloaded from the GenBank FTP server. Please note that the download takes more than a day and currently occupies around 1.5 TB, so make sure you have the necessary storage capacity before launching it. To have a local copy of GenBank sequences, please go to the Prepare a local copy of GenBank page.\nPerform a in silico PCR amplification # In this example, we amplify the 12S-V5 region [@Riaz2011-gn] with the forward primer TTAGATACCCCACTATGC and the reverse primer TAGAACAGGCTCCTCTAG, with the following command, to study the wolf diet (see the tutorial). Do not forget to update the release number of GenBank in the command line.\nobipcr -e 3 -l 50 -L 150 \\ --forward TTAGATACCCCACTATGC \\ --reverse TAGAACAGGCTCCTCTAG \\ --no-order \\ genbank/Release_264/fasta/* \u0026gt; v05_pcr.fasta The -l and -L options define the minimum and maximum sizes of sequence fragments to be amplified. Three mismatches with primer sequences are allowed here (-e 3), and we recommend using the --no-order option to speed up the program (see obipcr documentation).\nThis previous command produces a fasta file, with the computed amplified sequences.\nClean the database # We choose to apply these different steps of filtering to clean up the sequences obtained with obipcr :\nKeep the sequences with a taxid and a taxonomic description to family, genus and species ranks ( obigrep ) Remove redundant sequences (dereplicate) Ensure that the dereplicated sequences have a taxid (taxon identifier) at the family level Ensure that sequences each have a unique identification ID with obiannotate Index the database Keep annotated sequences # To use the -t taxonomy option on all OBITools commands, you can either enter the path to the taxonomy if you have downloaded the sequences from the help page here which looks like Release_264/taxonomy, or download the taxdump file online with curl.\ncurl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz The obigrep program allows to filter sequences, to keep only those with a taxid and a sufficient taxonomic description.\nobigrep -t taxdump.tar.gz \\ -A taxid \\ --require-rank species \\ --require-rank genus \\ --require-rank family \\ v05_pcr.fasta \u0026gt; v05_clean.fasta Dereplicate sequences # The obiuniq program is able to dereplicate the sequences.\nobiuniq -c taxid v05_clean.fasta \u0026gt; v05_clean_uniq.fasta Ensure that the dereplicated sequences have a taxid at the family level # Some sequences lose taxonomic information at the dereplication stage if certain versions of the sequence did not have this information beforehand. So we apply a second filter of this type.\nobigrep -t taxdump.tar.gz --require-rank=family v05_clean_uniq.fasta \u0026gt; v05_clean_uniq.fasta Ensure that sequences each have a unique identifier # Index the database # obirefidx -t taxdump.tar.gz v05_clean_uniq.fasta \u0026gt; v05_clean_uniq_indexed.fasta The database provided in the tutorial is called wolf_data/db_v05_r117_indexed.fasta.\n"},{"id":56,"href":"/obidoc/docs/cookbook/minion/","title":"Oxford Nanopore data analysis","section":"Cookbook","content":" Nanopore data analysis # MinION\n"}]