Structural alignment

For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the peptide bond.

For simplicity and efficiency, often only the alpha carbon positions are considered, since the peptide bond has a minimally variant planar conformation.

This method traditionally uses a simple least-squares fitting algorithm, in which the optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition.

[7] More recently, maximum likelihood and Bayesian methods have greatly increased the accuracy of the estimated rotations, translations, and covariance matrices for the superposition.

[8][9] Algorithms based on multidimensional rotations and modified quaternions have been developed to identify topological relationships between protein structures without the need for a predetermined alignment.

[10] The SuperPose Archived 2015-10-31 at the Wayback Machine method is sufficiently extensible to correct for relative domain rotations and other structural pitfalls.

[1][2][3] A subtle but important distinction from maximal structural superposition is the conversion of an alignment to a meaningful similarity score.

Critically, even if a method's estimated E-value is precisely correct on average, if it lacks a low standard deviation on its estimated value generation process, then the rank ordering of the relative similarities of a query protein to a comparison set will rarely agree with the "true" ordering.

[14] These measures can be rigorously optimized using an algorithm capable of maximizing the number of atoms in two proteins that can be superimposed under a predefined distance cutoff.

[15] Unfortunately, the algorithm for optimal solution is not practical, since its running time depends not only on the lengths but also on the intrinsic geometry of input proteins.

This is typically achieved by constructing a sequence-to-sequence matrix or series of matrices that encompass comparative metrics: rather than absolute distances relative to a fixed coordinate space.

DALI's actual alignment process requires a similarity search after the two proteins' distance matrices are built; this is normally conducted via a series of overlapping submatrices of size 6x6.

Submatrix matches are then reassembled into a final alignment via a standard score-maximization algorithm — the original version of DALI used a Monte Carlo simulation to maximize a structural similarity score that is a function of the distances between putative corresponding atoms.

In particular, more distant atoms within corresponding features are exponentially downweighted to reduce the effects of noise introduced by loop mobility, helix torsions, and other minor structural variations.

[20] Because DALI relies on an all-to-all distance matrix, it can account for the possibility that structurally aligned features might appear in different orders within the two sequences being compared.

The combinatorial extension (CE) method is similar to DALI in that it too breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment.

Only AFPs that meet given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby increasing efficiency.

Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes.

[22] Like DALI and SSAP, CE has been used to construct an all-to-all fold classification database Archived 1998-12-03 at the Wayback Machine from the known protein structures in the PDB.

It converts the statistics of the number of residues with high-confidence matches and the size of the protein to compute an Expectation value for the outcome by chance.

[24] It is often used in conjunction with these slower tools to pre-screen large data bases to extract the just the best E-value related structures for more exhaustive superposition or expensive calculations.

In this twilight remote homology regime, Mammoth's e-values for the CASP[1] protein structure prediction evaluation have been shown to be significantly more correlated with human ranking than SSAP or DALI.

[12] Mammoths ability to extract the multi-criteria partial overlaps with proteins of known structure and rank these with proper E-values, combined with its speed facilitates scanning vast numbers of decoy models against the PDB data base for identifying the most likely correct decoys based on their remote homology to known proteins.

SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein.

The MultiBind and MAPPIS servers [32][33] allow the identification of common spatial arrangements of physicochemical properties such as H-bond donor, acceptor, aliphatic, aromatic or hydrophobic in a set of user provided protein binding sites defined by interactions with small molecules (MultiBind) or in a set of user-provided protein–protein interfaces (MAPPIS).

[37] However, as algorithmic improvements and computer performance have erased purely technical deficiencies in older approaches, it has become clear that there is no one universal criterion for the 'optimal' structural alignment.

[44] Choosing a software tool for structural alignment can be a challenge due to the large variety of available packages that differ significantly in methodology and reliability.

Structural alignment of thioredoxins from humans and the fly Drosophila melanogaster . The proteins are shown as ribbons, with the human protein in red, and the fly protein in yellow. Generated from PDB 3TRX and 1XWC .
Illustration of the atom-to-atom vectors calculated in SSAP. From these vectors a series of vector differences, e.g., between (FA) in Protein 1 and (SI) in Protein 2 would be constructed. The two sequences are plotted on the two dimensions of a matrix to form a difference matrix between the two proteins. Dynamic programming is applied to all possible difference matrices to construct a series of optimal local alignment paths that are then summed to form the summary matrix, on which a second round of dynamic programming is performed.