[1] They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities.
Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids.
The genetic instructions of every replicating cell in a living organism are contained within its DNA.
At the molecular level, there are regulatory systems that correct most — but not all — of these changes to the DNA before it is replicated.
[6] Conversely, the change may allow the cell to continue functioning albeit differently, and the mutation can be passed on to the organism's offspring.
If this change does not result in any significant physical disadvantage to the offspring, the possibility exists that this mutation will persist within the population.
The 20 amino acids translated by the genetic code vary greatly by the physical and chemical properties of their side chains.
This helps researchers better understand the origin and function of genes through the nature of homology and conservation.
Substitution matrices are utilized in algorithms to calculate the similarity of different sequences of proteins; however, the utility of Dayhoff PAM Matrix has decreased over time due to the requirement of sequences with a similarity more than 85%.
In order to fill in this gap, Henikoff and Henikoff introduced BLOSUM (BLOcks SUbstitution Matrix) matrix which led to marked improvements in alignments and in searches using queries from each of the groups of related proteins.
[1] Several sets of BLOSUM matrices exist using different alignment databases, named with numbers.
The percentage used was appended to the name, giving BLOSUM80 for example where sequences that were more than 80% identical were clustered.
A database storing the sequence alignments of the most conserved regions of protein families.
By using the block, counting the pairs of amino acids in each column of the multiple alignment.
Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods.
[7] Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance.
The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them.
[12] Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins.
is a scaling factor, set such that the matrix contains easily computable integer values.
[14] The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a).
Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance.
The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM.
BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers[15] and T-cell epitopes.
This form of scoring system is utilized by a wide range of alignment software including BLAST.
[18] There are several software packages in different programming languages that allow easy use of Blosum matrices.