SNV calling from NGS data

These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms (see SNP genotyping).

[1] In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population,[2] as well as detecting somatic SNVs within an individual using multiple tissue samples.

Very often, the searched for variants occur with some (possibly rare) frequency, throughout the population, in which case they may be referred to as single nucleotide polymorphisms (SNPs).

Technically the term SNP only refers to these kinds of variations, however in practice they are often used synonymously with SNV in the literature on variant calling.

In an ideal error free world with high read coverage, the task of variant calling from the results of a NGS data alignment would be simple; at each locus (position on the genome) the number of occurrences of each distinct nucleotide among the reads aligned at that position can be counted, and the true genotype would be obvious; either AA if all nucleotides match allele A, BB if they match allele B, or AB if there is a mixture.

[4] The nucleotide counts used for base calling contain errors and bias, both due do the sequenced reads themselves, and the alignment process.

The formula is: In the above equation: Given the above framework, different software solutions for detecting SNVs vary based on how they calculate the prior probabilities

Instead of modelling the distribution of the observed data and using Bayesian statistics to calculate genotype probabilities, variant calls are made based on a variety of heuristic factors, such as minimum allele counts, read quality cut-offs, bounds on read depth, etc.

[1] Various methods exist for filtering data in variant calling experiments, in order to remove sources of error/bias.

For instance, strand bias can occur, where there is a highly unequal distribution of forward vs reverse directions in the reads aligned in some neighborhood.

[1] In addition to methods that align reads from individual sample(s) to a reference genome in order to detect germline genetic variants, reads from multiple tissue samples within a single individual can be aligned and compared in order to detect somatic variants.

Such investigations have resulted in diagnostic tools that have seen clinical application, and are used to improve scientific understanding of the disease, for instance by the discovery of new cancer-related genes, identification of involved gene regulatory networks and metabolic pathways, and by informing models of how tumors grow and evolve.

[11] Until recently, software tools for carrying out this form of analysis have been heavily underdeveloped, and were based on the same algorithms used to detect germline variations.

Such procedures are not optimized for this task, because they do not adequately model the statistical correlation between the genotypes present in multiple tissue samples from the same individual.

[3] More recent investigations have resulted in the development of software tools especially optimized for the detection of somatic mutations from multiple tissue samples.

Probabilistic techniques have been developed that pool allele counts from all tissue samples at each locus, and using statistical models for the likelihoods of joint-genotypes for all the tissues, and the distribution of allele counts given the genotype, are able to calculate relatively robust probabilities of somatic mutations at each locus using all available data.