Gap penalty

The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

Ideally, this alignment technique is most suitable for closely related sequences of similar lengths.

The Needleman-Wunsch algorithm is a dynamic programming technique used to conduct global alignment.

Unlike global alignment, it compromises of no end gaps in one or both sequences.

When comparing proteins, one uses a similarity matrix which assigns a score to each possible residue pair.

A single matrix may be reasonably efficient over a relatively broad range of evolutionary change.

[6] BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences.

Short alignments are more easily detected using a matrix with a higher "relative entropy" than that of BLOSUM-62.

The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries.

[7] Indels can have severe biological consequences by causing mutations in the DNA strand that could result in the inactivation or over activation of the target protein.

For example, if a one or two nucleotide indel occurs in a coding sequence the result will be a shift in the reading frame, or a frameshift mutation that may render the protein inactive.

Aligning two short DNA sequences, with '-' depicting a gap of one base pair.

In general, if the interest is to find closely related matches (e.g. removal of vector sequence during genome sequencing), a higher gap penalty should be used to reduce gap openings.

On the other hand, gap penalty should be lowered when interested in finding a more distant match.

and was proposed as studies had shown the distribution of indel sizes obey a power law.

[13] Rather than using substitution matrices to measure the similarity of amino acid pairs, profile–profile alignment methods require a profile-based scoring function to measure the similarity of profile vector pairs.

The gap information is usually used in the form of indel frequency profiles, which is more specific for the sequences to be aligned.

ClustalW and MAFFT adopted this kind of gap penalty determination for their multiple sequence alignments.

[13] Alignment accuracies can be improved using this model, especially for proteins with low sequence identity.

When working with popular algorithms there seems to be little theoretical basis for the form of the gap penalty functions.

[14] Consequently, for any alignment situation gap placement must be empirically determined.

[14] Some algorithms use predicted or actual structural information to bias the placement of gaps.

text
Example of Protein Sequence Alignment
text
Blosum-62 Matrix
This graph shows the difference between types of gap penalties. The exact numbers will change for different applications but this shows the relative shape of each function.