Position weight matrix

PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.

In the first step in constructing a PWM, a basic position frequency matrix (PFM) is created by counting the occurrences of each nucleotide at each position.

From the PFM, a position probability matrix (PPM) can now be created by dividing that former nucleotide count at each position by the number of sequences, thereby normalising the values.

Formally, given a set X of N aligned sequences of length l, the elements of the PPM M are calculated: where i

For example, the probability of the sequence S = GAGGTAAAC given the above PPM M can be calculated: Pseudocounts (or Laplace estimators) are often applied when calculating PPMs if based on a small dataset, in order to avoid matrix entries having a value of 0.

[2] This is equivalent to multiplying each column of the PPM by a Dirichlet distribution and allows the probability to be calculated for new sequences (that is, sequences which were not part of the original dataset).

The simplest background model assumes that each letter appears equally frequently in the dataset.

entries in the matrix make clear the advantage of adding pseudocounts, especially when using small datasets to construct M. The background model need not have equal values for each symbol: for example, when studying organisms with a high GC-content, the values for C and G may be increased with a corresponding decrease for the A and T values.

When the PWM elements are calculated using log likelihoods, the score of a sequence can be calculated by adding (rather than multiplying) the relevant values at each position in the PWM.

The self-information of observing a particular symbol at a particular position of the motif is: The expected (average) self-information of a particular element in the PWM is then: Finally, the IC of the PWM is then the sum of the expected self-information of every element: Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g., the GC-content of DNA of thermophilic bacteria range from 65.3 to 70.8,[3] thus a motif of ATAT would contain much more information than a motif of CCGG).

However, it has been shown that when using PSSM to search genomic sequences (see below) this uniform correction can lead to overestimation of the importance of the different bases in a motif, due to the uneven distribution of n-mers in real genomes, leading to a significantly larger number of false positives.

[6] More sophisticated algorithms for fast database searching with nucleotide as well as amino acid PWMs/PSSMs are implemented in the possumsearch software.

A PSSM with additional probabilities for insertion and deletion at each position can be interpreted as a hidden Markov model.

PWMs are often represented graphically as sequence logos .