CS-BLAST

Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST.

CSI-BLAST (Context-Specific Iterated BLAST) is the context-specific analog of PSI-BLAST[5] (Position-Specific Iterated BLAST), which computes the mutation profile with substitution probabilities and mixes it with the query profile.

Homology is the relationship between biological structures or sequences derived from a common ancestor.

Inferring homologous relationships involves calculating scores of aligned pairs minus penalties for gaps.

In order to have a homologous relationship, the sum of scores over all the aligned pairs of amino acids or nucleotides must be sufficiently high [2].

Since context information is encoded in transition probabilities between states, mixing mutation probabilities from substitution matrices weighted for corresponding states achieves improved alignment qualities when compared to standard substitution matrices.

The query profile results from the artificial mutations in which the bar heights are proportional to the corresponding amino acid probabilities.

(A FIGURE NEEDS TO GO HERE THIS IS THE CAPTION) “Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red).

Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseudocount scheme.

Histogram bar heights represent the fraction of amino acids in profile columns”.

(A DIFFERENT GRAPH NEEDS TO GO HERE) CS-BLAST offers improved sensitivity and alignment quality in sequence comparison.

It produces higher quality alignments and generates reliable E-values without a loss of speed.

Then, Biegert and Söding compared the sequence window to a library with thousands of context profiles.

The library is generated by clustering a representative set of sequence profile windows.

The actual predicting of mutation probabilities is achieved by weighted mixing of the central columns of the most similar context profiles.

This makes computation simpler and allows for runtime to be scaled linearly instead of quadratically.

The image illustrates the calculation of expected mutation probabilities for a specific residue at a certain position.

In predicting substitution probabilities using only the amino acid’s local sequence context, you gain the advantage of not needing to know the structure of the query protein while still allowing for the detection of more homologous proteins than standard substitution matrices [4].

Bigert and Söding’s approach to predicting substitution probabilities was based on a generative model.

In another paper in collaboration with Angermüller, they develop a discriminative machine learning method that improves prediction accuracy [2].

With the discriminative model, the goal is to predict a context specific substitution probability given a query sequence.

is modeled directly by the exponential of an affine function of the context account profile where

As with the generative model, target distribution is obtained by mixing the emission probabilities of each context state weighted by the similarity.

The MPI Bioinformatics toolkit in an interactive website and service that allows anyone to do comprehensive and collaborative protein analysis with a variety of different tools including CS-BLAST as well as PSI-BLAST [1].

This tool allows for input of a protein and select options for you to customize your analysis.

[1] Alva, Vikram, Seung-Zin Nam, Johannes Söding, and Andrei N. Lupas.

“The MPI Bioinformatics Toolkit as an Integrative Platform for Advanced Protein Sequence and Structure Analysis.” Nucleic Acids Research 44.Web server Issue (2016): W410-415.

“Discriminative Modelling of Context-specific Amino Acid Substitution Properties” BIOINFORMATICS 28.24 (2012): 3240-247.

[3] Astschul, Stephen F., et al. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25.17 (1997): 3389-402.

“Sequence Context-specific Profiles for Homology Searching.” Proceedings of the National Academy of Sciences 106.10 (2009): 3770-3775.