[3][4] The human genome contains on the order of 20,000 genes which work in concert to produce roughly 1,000,000 distinct proteins.
While knowledge of the precise proteins a cell makes (proteomics) is more relevant than knowing how much messenger RNA is made from each gene,[why?]
gene expression profiling provides the most global picture possible in a single experiment.
More commonly, expression profiling takes place before enough is known about how genes interact with experimental conditions for a testable hypothesis to exist.
[8] Apart from selecting a clustering algorithm, user usually has to choose an appropriate proximity measure (distance or similarity) between data objects.
The simplest form of class discovery would be to list all the genes that changed by more than a certain amount between two experimental conditions.
In general, expression profiling studies report those genes that showed statistically significant differences under changed experimental conditions.
Newer microarray analysis techniques automate certain aspects of attaching biological significance to expression profiling results, but this remains a very difficult problem.
Both DNA microarrays and quantitative PCR exploit the preferential binding or "base pairing" of complementary nucleic acid sequences, and both are used in gene expression profiling, often in a serial fashion.
While high throughput DNA microarrays lack the quantitative accuracy of qPCR, it takes about the same time to measure the gene expression of a few dozen genes via qPCR as it would to measure an entire genome using DNA microarrays.
[10] Simply stating that a group of genes were regulated by at least twofold, once a common practice, lacks a solid statistical footing.
With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold.
Rather than identify differentially expressed genes using a fold change cutoff, one can use a variety of statistical tests or omnibus tests such as ANOVA, all of which consider both fold change and variability to create a p-value, an estimate of how often we would observe the data by chance alone.
Applying p-values to microarrays is complicated by the large number of multiple comparisons (genes) involved.
For example, a p-value of 0.05 is typically thought to indicate significance, since it estimates a 5% probability of observing the data by chance.
One obvious solution is to consider significant only those genes meeting a much more stringent p value criterion, e.g., one could perform a Bonferroni correction on the p-values, or use a false discovery rate calculation to adjust p-values in proportion to the number of parallel tests involved.
Many tests begin with the assumption of a normal distribution in the data, because that seems like a sensible starting point and often produces results that appear more significant.
Many modern microarray analysis techniques involve bootstrapping (statistics), machine learning or Monte Carlo methods.
[14] As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy.
While the statistics may identify which gene products change under experimental conditions, making biological sense of expression profiling rests on knowing which protein each gene product makes and what function this protein performs.
Observing these links we may begin to suspect that they represent much more than chance associations in the results, and that they are all on our list because of an underlying biological process.
Fairly straightforward statistics provide estimates of whether associations between genes on lists are greater than what one would expect by chance.
While this may be true, there are a number of reasons why making this a firm conclusion based on enrichment alone represents an unwarranted leap of faith.
GSEA uses a Kolmogorov Smirnov style statistic to see whether any previously defined gene sets exhibited unusual behavior in the current expression profile.
In many cases, analyzing expression profiling results takes far more effort than performing the initial experiments.
Most researchers use multiple statistical methods and exploratory data analysis before publishing their expression profiling results, coordinating their efforts with a bioinformatician or other expert in DNA microarrays.