Homology modeling

However, the errors are significantly higher in the loop regions, where the amino acid sequences of the target and template proteins may be completely different.

The method of homology modeling is based on the observation that protein tertiary structure is better conserved than amino acid sequence.

The simplest method of template identification relies on serial pairwise sequence alignments aided by database search techniques such as FASTA and BLAST.

More sensitive methods based on multiple sequence alignment – of which PSI-BLAST is the most common example – iteratively update their position-specific scoring matrix to successively identify more distantly related homologs.

Protein threading,[10] also known as fold recognition or 3D-1D alignment, can also be used as a search technique for identifying templates to be used in traditional homology modeling methods.

Other factors may tip the balance in marginal cases; for example, the template may have a function similar to that of the query sequence, or it may belong to a homologous operon.

Perhaps most importantly, the coverage of the aligned regions: the fraction of the query sequence structure that can be predicted from the template, and the plausibility of the resulting model.

It is possible to use the sequence alignment generated by the database search technique as the basis for the subsequent model production; however, more sophisticated approaches have also been explored.

[13] Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as a set of Cartesian coordinates for each atom in the protein.

The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the Protein Data Bank.

[18] The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by NMR spectroscopy.

Local scores are important in the context of modeling because they can give an estimate of the reliability of different regions of a predicted structure.

This information can be used in turn to determine which regions should be refined, which should be considered for modeling by multiple templates, and which should be predicted ab initio.

ProsaII (Sippl 1993), which is based on a combination of a pairwise statistical potential and a solvation term, is also applied extensively in model evaluation.

The data presented in Wallner and Elofsson's study suggests that their machine-learning approach based on structural features is indeed superior to statistics-based methods.

Several large-scale benchmarking efforts have been made to assess the relative quality of various current homology modeling methods.

LiveBench and EVA run continuously to assess participating servers' performance in prediction of imminently released structures from the PDB.

[14] This low-identity region is often referred to as the "twilight zone" within which homology modeling is extremely difficult, and to which it is possibly less suited than fold recognition methods.

[27] The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment.

The PDBREPORT Archived 2007-05-31 at the Wayback Machine database lists several million, mostly very small but occasionally dramatic, errors in experimental (template) structures that have been deposited in the PDB.

Serious local errors can arise in homology models where an insertion or deletion mutation or a gap in a solved structure result in a region of target sequence for which there is no corresponding template.

Although some guidance is provided even with a single template by the positioning of the ends of the missing region, the longer the gap, the more difficult it is to model.

[3] Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success.

[29] The rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which the backbone structure is relatively easy to predict.

[30] One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states.

[31] It has been suggested that a major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements.