Population structure (genetics)

In a randomly mating (or panmictic) population, allele frequencies are expected to be roughly similar between groups.

For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.

Genetic variants do not necessarily cause observable changes in organisms, but can be correlated by coincidence because of population structure—a variant that is common in a population that has a high rate of disease may erroneously be thought to cause the disease.

Population structure commonly arises from physical separation by distance or barriers, like mountains and rivers, followed by genetic drift.

Other causes include gene flow from migrations, population bottlenecks and expansions, founder effects, evolutionary pressure, random chance, and (in humans) cultural factors.

Even in lieu of these factors, individuals tend to stay close to where they were born, which means that alleles will not be distributed at random with respect to the full range of the species.

Misspecification of such models, for instance by not taking into account the existence of structure in an ancestral population, can give rise to heavily biased parameter estimates.

This reduction in heterozygosity can be thought of as an extension of inbreeding, with individuals in subpopulations being more likely to share a recent common ancestor.

This motivates the derivation of Wright's F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity.

In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo, modelling allele frequencies at each locus with a Dirichlet distribution.

[9] Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of K, but rather an approximation considered useful for a given question.

Principal component analysis (PCA) was first applied in population genetics in 1978 by Cavalli-Sforza and colleagues and resurged with high-throughput sequencing.

[9][17] Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals.

[13] Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous isolation by distance in the data, the top PC vectors will reflect geographic variation.

[20] Multidimensional scaling and discriminant analysis have been used to study differentiation, population assignment, and to analyze genetic distances.

[21] Neighborhood graph approaches like t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) can visualize continental and subcontinental structure in human data.

[23][24] Variational autoencoders can generate artificial genotypes with structure representative of the input data, though they do not recreate linkage disequilibrium patterns.

Admixed populations will have haplotype chunks from their ancestral groups, which gradually shrink over time because of recombination.

By exploiting this fact and matching shared haplotype chunks from individuals within a genetic dataset, researchers may trace and date the origins of population admixture and reconstruct historic events such as the rise and fall of empires, slave trades, colonialism, and population expansions.

[27] Also, actual genetic findings may be overlooked if the locus is less prevalent in the population where the case subjects are chosen.

For this reason, it was common in the 1990s to use family-based data where the effect of population structure can easily be controlled for using methods such as the transmission disequilibrium test (TDT).

[28] Phenotypes (measurable traits), such as height or risk for heart disease, are the product of some combination of genes and environment.

To construct a score, researchers first enroll participants in an association study to estimate the contribution of each genetic variant.

Then, they can use the estimated contributions of each genetic variant to calculate a score for the trait for an individual who was not in the original association study.

If structure in the study population is correlated with environmental variation, then the polygenic score is no longer measuring the genetic component alone.

[30] It is also possible to use unlinked genetic markers to estimate each individual's ancestry proportions from some K subpopulations, which are assumed to be unstructured.

[31] More recent approaches make use of principal component analysis (PCA), as demonstrated by Alkes Price and colleagues,[32] or by deriving a genetic relationship matrix (also called a kinship matrix) and including it in a linear mixed model (LMM).

[29] For many traits, the role of structure is complex and not fully understood, and incorporating it into genetic studies remains a challenge and is an active area of research.