Conserved sequence

The study of sequence conservation overlaps with the fields of genomics, proteomics, evolutionary biology, phylogenetics, bioinformatics and mathematics.

[3][4] Studies in the 1960s used DNA hybridization and protein cross-reactivity techniques to measure similarity between known orthologous proteins, such as hemoglobin[5] and cytochrome c.[6] In 1965, Émile Zuckerkandl and Linus Pauling introduced the concept of the molecular clock,[7] proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms diverged.

While initial phylogenies closely matched the fossil record, observations that some genes appeared to evolve at different rates led to the development of theories of molecular evolution.

[8] Over many generations, nucleic acid sequences in the genome of an evolutionary lineage can gradually change over time due to random mutations and deletions.

[12][13] The extent to which a sequence is conserved can be affected by varying selection pressures, its robustness to mutation, population size and genetic drift.

[16] Within a sequence, amino acids that are important for folding, structural stability, or that form a binding site may be more highly conserved.

[19][20] Non-coding sequences important for gene regulation, such as the binding or recognition sites of ribosomes and transcription factors, may be conserved within a genome.

Currently the accuracy and scalability of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes.

The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species.

Other approaches such as PhyloP and PhyloHMM incorporate statistical phylogenetics methods to compare probability distributions of substitution rates, which allows the detection of both conservation and accelerated mutation.

First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a phylogenetic tree.

[44] While the origin and function of UCEs are poorly understood,[45] they have been used to investigate deep-time divergences in amniotes,[46] insects,[47] and between animals and plants.

These consist mainly of the ncRNAs and proteins required for transcription and translation, which are assumed to have been conserved from the last universal common ancestor of all life.

For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial phyla in metagenomics studies.

[58][59][60][61] As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of genetic diseases.

Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as mice[62] or fruit flies,[63] and studying the effects of knock-outs of these genes.