Probalign

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.

[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution.

The partition function is calculated using a dynamic programming approach.

The following describes the algorithm used by probalign to determine the base pair probabilities.

[2] To score an alignment of two sequences two things are needed: The score

{\displaystyle S(a)}

of an alignment a is defined as:

∑

σ (

gap cost

{\displaystyle S(a)=\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}

Now the boltzmann weighted score of an alignment a is:

σ (

gap cost

σ (

{\displaystyle e^{\frac {S(a)}{T}}=e^{\frac {\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}{T}}=\left(\prod _{x_{i}-y_{i}\in a}e^{\frac {\sigma (x_{i},y_{j})}{T}}\right)\cdot e^{\frac {gapcost}{T}}}

is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

{\displaystyle Pr[a|x,y]={\frac {e^{\frac {S(a)}{T}}}{Z}}}

is the partition function, i.e. the sum of the boltzmann weights of all alignments.

denote the partition function of the prefixes

Three different cases are considered: Then we have:

The matrixes are initialized as follows: The partition function for the alignments of two sequences

, which can be recursively computed: Finally the probability that positions

form a base pair is given by:

σ (

are the respective values for the recalculated

with inversed base pair strings.