Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.
[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution.
The partition function is calculated using a dynamic programming approach.
The following describes the algorithm used by probalign to determine the base pair probabilities.
[2] To score an alignment of two sequences two things are needed: The score
{\displaystyle S(a)}
of an alignment a is defined as:
∑
i
σ (
i
gap cost
{\displaystyle S(a)=\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}
Now the boltzmann weighted score of an alignment a is:
σ (
gap cost
σ (
{\displaystyle e^{\frac {S(a)}{T}}=e^{\frac {\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}{T}}=\left(\prod _{x_{i}-y_{i}\in a}e^{\frac {\sigma (x_{i},y_{j})}{T}}\right)\cdot e^{\frac {gapcost}{T}}}
is a scaling factor.
The probability of an alignment assuming boltzmann distribution is given by
{\displaystyle Pr[a|x,y]={\frac {e^{\frac {S(a)}{T}}}{Z}}}
is the partition function, i.e. the sum of the boltzmann weights of all alignments.
denote the partition function of the prefixes
Three different cases are considered: Then we have:
The matrixes are initialized as follows: The partition function for the alignments of two sequences
, which can be recursively computed: Finally the probability that positions
form a base pair is given by:
σ (
are the respective values for the recalculated
with inversed base pair strings.