Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.
Experimentally, physical interactions between pairs of proteins can be inferred from a variety of techniques, including yeast two-hybrid systems, protein-fragment complementation assays (PCA), affinity purification/mass spectrometry, protein microarrays, fluorescence resonance energy transfer (FRET), and Microscale Thermophoresis (MST).
In addition, a number of bound protein complexes have been structurally solved and can be used to identify the residues that mediate the interaction so that similar motifs can be located in other organisms.
The phylogenetic profile method is based on the hypothesis that if two or more proteins are concurrently present or absent across several genomes, then they are likely functionally related.
[5] Figure A illustrates a hypothetical situation in which proteins A and B are identified as functionally linked due to their identical phylogenetic profiles across 5 different genomes.
However, comparisons between phylogenetic trees are difficult, and current methods circumvent this by simply comparing distance matrices[4].
The distance matrices of the proteins are used to calculate a correlation coefficient, in which a larger value corresponds to co-evolution.
The downside is that difference matrices are not perfect representations of phylogenetic trees, and inaccuracies may result from using such a shortcut.
[4] Another factor worthy of note is that there are background similarities between the phylogenetic trees of any protein, even ones that do not interact.
Figure B depicts the BLAST sequence alignment of Succinyl coA Transferase with its two separate homologs in E. coli.
[3] The conserved neighborhood method is based on the hypothesis that if genes encoding two proteins are neighbors on a chromosome in many genomes, then they are likely functionally related.
The method is based on an observation by Bork et al. of gene pair conservation across nine bacterial and archaeal genomes.
[8] For instance, the trpA and trpB genes in Escherichia coli encode the two subunits of the tryptophan synthase enzyme known to interact to catalyze a single reaction.
RFD produces results based on the domain composition of interacting and non-interacting protein pairs.
On the other hand, the ability for these methods to make a prediction is constrained by a limited number of known protein complex structures.
[18] This library is then used to identify potential interactions between pairs of targets, providing that they have a known structure (i.e. present in the PDB).
The probabilities required in the formula are calculated using an Expectation Maximization procedure, which is a method for estimating parameters in statistical models.
This is a useful mode of inquiry in cases where both proteins in the pair have known structures and are known (or at least strongly suspected) to interact, but since so many proteins do not have experimentally determined structures, sequence-based interaction prediction methods are especially useful in conjunction with experimental studies of an organism's interactome.