11  Computations on sequences

11.1 obipairing

Replace the illuminapairedends original OBITools

Alignment procedure

obipairing is introducing a new alignment algorithm compared to the illuminapairedend command of the OBITools V2. Nethertheless this new algorithm has been design to produce the same results than the previous, except in very few cases.

The new algorithm is a two-step procedure. First, a FASTN-type algorithm (Lipman and Pearson 1985) identifies the best offset between the two matched readings. This identifies the region of overlap.

In the second step, the matching regions of the two reads are extracted along with a flanking sequence of Δ\Delta base pairs. The two subsequences are then aligned using a “one side free end-gap” dynamic programming algorithm. This latter step is only called if at least one mismatch is detected by the FASTP step.

Unless the similarity between the two reads at their overlap region is very low, the addition of the flanking regions in the second step of the alignment ensures the same alignment as if the dynamic programming alignment was performed on the full reads.

The scoring system

In the dynamic programming step, the match and mismatch scores take into account the quality scores of the two aligned nucleotides. By taking these into account, the probability of a true match can be calculated for each aligned base pair.

If we consider a nucleotide read with a quality score QQ, the probability of misreading this base (PEP_E) is : PE=10Q10 P_E = 10^{-\frac{Q}{10}}

Thus, when a given nucleotide XX is observed with the quality score QQ. The probability that XX is really an XX is :

P(X=X)=1PE P(X=X) = 1 - P_E

Otherwise, XX is actually one of the three other possible nucleotides (XE1X_{E1}, XE2X_{E2} or XE3X_{E3}). If we suppose that the three reading error have the same probability :

P(X=XE1)=P(X=XE3)=P(X=XE3)=PE3 P(X=X_{E1}) = P(X=X_{E3}) = P(X=X_{E3}) = \frac{P_E}{3}

At each position in an alignment where the two nucleotides X1X_1 and X2X_2 face each other (not a gapped position), the probability of a true match varies depending on whether X1=X2X_1=X_2, an observed match, or X1X2X_1 \neq X_2, an observed mismatch.

Probability of a true match when X1=X2X_1=X_2

That probability can be divided in two parts. First X1X_1 and X2X_2 have been correctly read. The corresponding probability is :

PTM=(1PE1)(1PE2)=(110Q110)(110Q210) \begin{aligned} P_{TM} &= (1- PE_1)(1-PE_2)\\ &=(1 - 10^{-\frac{Q_1}{10} } )(1 - 10^{-\frac{Q_2}{10}} ) \end{aligned}

Secondly, a match can occure if the true nucleotides read as X1X_1 and X2X_2 are not X1X_1 and X2X_2 but identical.

P(X1==XE1)P(X2==XE1)=PE1PE29P(X1==XEx)P(X2==XEx)=PE1PE23 \begin{aligned} P(X_1==X_{E1}) \cap P(X_2==X_{E1}) &= \frac{P_{E1} P_{E2}}{9} \\ P(X_1==X_{Ex}) \cap P(X_2==X_{Ex}) & = \frac{P_{E1} P_{E2}}{3} \end{aligned}

The probability of a true match between X1X_1 and X2X_2 when X1=X2X_1 = X_2 an observed match :

P(MATCH|X1=X2)=(1PE1)(1PE2)+PE1PE23 \begin{aligned} P(MATCH | X_1 = X_2) = (1- PE_1)(1-PE_2) + \frac{P_{E1} P_{E2}}{3} \end{aligned}

Probability of a true match when X1X2X_1 \neq X_2

That probability can be divided in three parts.

  1. X1X_1 has been correctly read and X2X_2 is a sequencing error and is actually equal to X1X_1. Pa=(1PE1)PE23 P_a = (1-P_{E1})\frac{P_{E2}}{3}
  2. X2X_2 has been correctly read and X1X_1 is a sequencing error and is actually equal to X2X_2. Pb=(1PE2)PE13 P_b = (1-P_{E2})\frac{P_{E1}}{3}
  3. X1X_1 and X2X_2 corresponds to sequencing error but are actually the same base XExX_{Ex} Pc=2PE1PE29 P_c = 2\frac{P_{E1} P_{E2}}{9}

Consequently : P(MATCH|X1X2)=(1PE1)PE23+(1PE2)PE13+2PE1PE29 \begin{aligned} P(MATCH | X_1 \neq X_2) = (1-P_{E1})\frac{P_{E2}}{3} + (1-P_{E2})\frac{P_{E1}}{3} + 2\frac{P_{E1} P_{E2}}{9} \end{aligned}

Probability of a match under the random model

The second considered model is a pure random model where every base is equiprobable, hence having a probability of occurrence of a nucleotide equals 0.250.25. Under that hypothesis

P(MATCH|Random model)=0.25 P(MATCH | \text{Random model}) = 0.25

The score is a log ration of likelyhood

Score is define as the logarithm of the ratio between the likelyhood of the observations considering the sequencer error model over tha likelyhood u

Evolution of the match and mismatch scores when the quality of base is 20 while the second range from 10 to 40.

11.2 obimultiplex

Replace the ngsfilter original OBITools

11.3 obicomplement

11.4 obiclean

11.5 obiuniq