Weighted correlation network analysis

Weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA), is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables.

While it can be applied to most high-dimensional data sets, it has been most widely used in genomic applications.

WGCNA can be used as a data reduction technique (related to oblique factor analysis), as a clustering method (fuzzy clustering), as a feature selection method (e.g. as gene screening method), as a framework for integrating complementary (genomic) data (based on weighted correlations between quantitative variables), and as a data exploratory technique.

[1] Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique.

Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as systems biologic or systems genetic data analysis method.

By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based meta analysis techniques.

[2] The WGCNA method was developed by Steve Horvath, a professor of human genetics at the David Geffen School of Medicine at UCLA and of biostatistics at the UCLA Fielding School of Public Health and his colleagues at UCLA, and (former) lab members (in particular Peter Langfelder, Bin Zhang, Jun Dong).

In particular, weighted correlation networks were developed in joint discussions with cancer researchers Paul Mischel, Stanley F. Nelson, and neuroscientists Daniel H. Geschwind, Michael C. Oldham, according to the acknowledgement section in.

However, using the absolute value of the correlation may obfuscate biologically relevant information, since no distinction is made between gene repression and activation.

Varied transformation (or scaling) approaches can be considered if a signed co-expression measure between gene expression profiles

Note that the unsigned similarity between two oppositely expressed genes (

Because hard thresholding encodes gene connections in a binary fashion, it can be sensitive to the choice of the threshold and result in the loss of co-expression information.

[3] The continuous nature of the co-expression information can be preserved by employing soft thresholding, which results in a weighted network.

Specifically, WGCNA uses the following power function assess their connection strength:

can be chosen using the scale-free topology criterion which amounts to choosing the smallest value of

, the weighted network adjacency is linearly related to the co-expression similarity on a logarithmic scale.

Since this soft-thresholding procedure applied to a pairwise correlation matrix leads to weighted adjacency matrix, the ensuing analysis is referred to as weighted gene co-expression network analysis.

Roughly speaking, a pair of genes has a high proximity if it is closely interconnected.

Typically, WGCNA uses the topological overlap measure (TOM) as proximity.

The TOM is a highly robust measure of network interconnectedness (proximity).

This proximity is used as input of average linkage hierarchical clustering.

Eigengenes define robust biomarkers,[12] and can be used as features in complex machine learning models such as Bayesian networks.

[13] To find modules that relate to a clinical trait of interest, module eigengenes are correlated with the clinical trait of interest, which gives rise to an eigengene significance measure.

Eigengenes can be used as features in more complex predictive models including decision trees and Bayesian networks.

[14] To identify intramodular hub genes inside a given module, one can use two types of connectivity measures.

, is defined based on correlating each gene with the respective module eigengene.

The second, referred to as kIN, is defined as a sum of adjacencies with respect to the module genes.

[4] To test whether a module is preserved in another data set, one can use various network statistics, e.g.

[2][15] Such as, WGCNA study reveals novel transcription factors are associated with Bisphenol A (BPA) dose-response.

[27] The WGCNA R software package[28] provides functions for carrying out all aspects of weighted network analysis (module construction, hub gene selection, module preservation statistics, differential network analysis, network statistics).