DNA sequencing theory

The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too.

Publications[1] sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues.

Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary.

The target is considered "sequenced" when adequate coverage accumulates (e.g., when no gaps remain).

That is, they involve inordinately large amounts of computer time for parameters characteristic of DNA sequencing.

The probability of covering a given location on the target with at least one fragment is therefore This equation was first used to characterize plasmid libraries,[5] but it may appear in a modified form.

Note the significance of redundancy as representing the average number of times a position is covered with fragments.

Note also that in considering the covering process over all positions in the target, this probability is identical to the expected value of the random variable

The final result, remains in widespread use as a "back of the envelope" estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy.

In 1988, Eric Lander and Michael Waterman published an important paper[6] examining the covering problem from the standpoint of gaps.

They furnished a number of useful results that were adopted as the standard theory from the earliest days of "large-scale" genome sequencing.

[7] Their model was also used in designing the Human Genome Project and continues to play an important role in DNA sequencing.

Michael Wendl and Bob Waterston[9] confirmed, based on Stevens' method,[4] that both models produced similar results when the number of contigs was substantial, such as in low coverage mapping or sequencing projects.

The basic ideas of Lander–Waterman theory led to a number of additional results for particular variations in mapping techniques.

In 1995, Roach et al.[14] proposed and demonstrated through simulations a generalization of a set of strategies explored earlier by Edwards and Caskey.

The physical processes and protocols of DNA sequencing have continued to evolve, largely driven by advancements in bio-chemical methods, instrumentation, and automation techniques.

Biologists have developed methods to filter highly-repetitive, essentially un-sequenceable regions of genomes.

Wendl and Barbazuk[16] proposed an extension to Lander–Waterman Theory to account for "gaps" in the target due to filtering and the so-called "edge-effect".

Read-pairing and fragment size evidently have negligible influence for large, whole-genome class targets.

Here, the ability to detect heterozygous mutations is important and this can only be done if the sequence of the diploid genome is obtained.

Calculations show that around 50-fold redundancy is needed to avoid false-positive errors at 1% threshold.

For example, for observing a rare allele at least twice (to eliminate the possibility is unique to an individual) a little less than 4-fold redundancy should be used, regardless of the sample size.

For example, Stanhope[25] developed a probabilistic model for the amount of sequence needed to obtain at least one contig of a given size from each novel organism of the community, while Wendl et al. reported analysis for the average contig size or the probability of completely recovering a novel organism for a given rareness within the community.

[26] Conversely, Hooper et al. propose a semi-empirical model based on the gamma distribution.

[27] DNA sequencing theories often invoke the assumption that certain random variables in a model are independent and identically distributed.

In general, theory will agree well with observation up to the point that enough data have been generated to expose latent biases.