Polygenic score

[2][3][4][5][6] It gives an estimate of how likely an individual is to have a given trait based only on genetics, without taking environmental factors into account; and it is typically calculated as a weighted sum of trait-associated alleles.

Although polygenic risk scores from study in humans have gained the most attention, the basic idea was first introduced for selective plant and animal breeding.

[20] Similar to the latter-day approaches of constructing a polygenic risk score, an individual's—animal or plant—breeding value was calculated to be the combined weight of several single-nucleotide polymorphisms (SNPs) by their individual effects on a trait.

Genome-wide association studies enable mapping phenotypes to the variations in nucleotide bases in human populations.

Learning which variations influence which specific traits and how strongly they do so, are the key targets for constructing polygenic scores in humans.

[22] The concept was successfully applied in 2009 by researchers who organized a genome-wide association study (GWAS) regarding schizophrenia with the objective of constructing scores of risk propensity.

That study was the first to use the term polygenic score for a prediction drawn from a linear combination of single-nucleotide polymorphism (SNP) genotypes—which was able to explain 3% of the variance in schizophrenia.

The simplest, the so-called "pruning and thresholding" method, sets weights equal to the coefficient estimates from a regression of the trait on each genetic variant.

SNPs that are physically close to each other are more likely to be in linkage disequilibrium, meaning they typically are inherited together and therefore don't provide independent predictive power.

[2][27] From prior information penalized regression assigns probabilities on: 1) how many genetic variants are expected to affect a trait, and 2) the distribution of their effect sizes.

[33] As the number of genome-wide association studies has exploded, along with rapid advances in methods for calculating polygenic scores, its most obvious application is in clinical settings for disease prediction or risk stratification.

[41] The use of polygenic scores for embryo selection has been criticised due to alleged ethical and safety issues as well as limited practical utility.

[42][43][44] However, trait-specific evaluations claiming the contrary have been put forth[45][46] and ethical arguments for PGS-based embryo selection have also been made.

A common metric for evaluating such continuous estimates of yes/no questions (see Binary classification) is the area under the ROC curve (AUC).

Recent scientific progress in prediction power relies heavily on the creation and expansion of large biobanks containing data for both genotypes and phenotypes of very many individuals.

The construction of more diverse biobanks with successful recruitment from all ancestries is required to rectify this skewed access to and benefits from PGS-based medicine.

This comparison is important because clinical practice can be influenced by knowing which individuals have this rare genetic cause of cardiovascular disease.

[9] As of January 2021 providing PRS directly to individuals was undergoing research trials in health systems around the world, but is not yet offered as standard of care.

Consumers download their genotype (genetic variant) data and upload them into online PRS calculators, e.g. Scripps Health, Impute.me or Color Genomics.

[56][59] At a fundamental level, the use of polygenic scores in clinical context will have similar technical issues as existing tools.

[64] Unlike many other clinical laboratory or imaging methods, an individual's germ-line genetic risk can be calculated at birth for a variety of diseases after sequencing their DNA once.

[70] Likewise, a polygenic risk score based approach may reduce invasive diagnostic procedures as demonstrated in Celiac disease.

The goal of population-level screening is to identify patients at high risk for a disease who would benefit from an existing treatment.

Several clinical studies are being done in breast cancer[76][77] and heart disease is another area that could benefit from a polygenic score based screening program.

In humans, polygenic scores were originally computed in an effort to predict the prevalence and etiology of complex, heritable diseases, which are typically affected by many genetic variants that individually confer a small effect to overall risk.

Additionally, a polygenic score can be used in several different ways: as a lower bound to test whether heritability estimates may be biased; as a measure of genetic overlap of traits (genetic correlation), which might indicate e.g. shared genetic bases for groups of mental disorders; as a means to assess group differences in a trait such as height, or to examine changes in a trait over time due to natural selection indicative of a soft selective sweep (as e.g. for intelligence where the changes in frequency would be too small to detect on each individual hit but not on the overall polygenic score); in Mendelian randomization (assuming no pleiotropy with relevant traits); to detect & control for the presence of genetic confounds in outcomes (e.g. the correlation of schizophrenia with poverty); or to investigate gene–environment interactions and correlations.

For example, members of plant and animal breeds that humans have effectively created, such as modern maize or domestic cattle, are all technically "related".

In human genomic prediction, by contrast, unrelated individuals in large populations are selected to estimate the effects of common SNPs.

Because of smaller effective population in livestock, the mean coefficient of relationship between any two individuals is likely high, and common SNPs will tag causal variants at greater physical distance than for humans; this is the major reason for lower SNP-based heritability estimates for humans compared to livestock.

The two graphics illustrate sampling distributions of polygenic scores and the predictive ability of stratified sampling on polygenic risk score with increasing age. + The left panel shows how risk—(the standardized PRS on the x-axis)—can separate 'cases' (i.e., individuals with a certain disease, (red)) from the 'controls' (individuals without the disease, (blue)). The y-axis (vertical axis) indicates how many in each group are assigned a certain score. + At the right panel, the same population is divided into three groups according to their predicted risk, i.e., their assigned score, as high (red), middle (gray), or low (blue). The y-axis shows the observed risk amounts, where the x-axis shows the groups separating in risk as they age—corresponding with the predicted risk scores.
An early (2006) example of a genetic risk score applied to Type 2 Diabetes in humans. [ 19 ] The authors of the study concluded that, individually, risk alleles only moderately identify increase-of-risk of disease; but identifiable risk is "multiplicatively increased" when information is combined from several known risk polymorphisms. Using such combined information allows for identifying subgroups of a population with odds for disease that are significantly greater than when using a single polymorphism.
Predicted vs actual height using a polygenic risk score
PGS predictor performance increases with the dataset sample size available for training. Here illustrated for hypertension, hypothyroidism and type 2 diabetes. The x-axis labels number of cases (i.e. individuals with the disease) present in the training data and uses a logarithmic scale. The entire range is from 1,000 cases up to over 100,000 cases. The numbers of controls (i.e. individuals without the disease) in the training data were much larger than the numbers of cases. These particular predictors were trained using the LASSO algorithm. [ 17 ]