DNA annotation

[5] Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means.

[9][10] They appeared as a necessity to handle the enormous amount of data produced by the Maxam-Gilbert and Sanger DNA sequencing techniques developed in the late 1970s.

In fact, codon usage was the main strategy used by several early protein coding sequence (CDS) prediction methods,[12][13][14] based on the assumption that the most translated regions in a genome contain codons with the most abundant corresponding tRNAs (the molecules responsible for carrying amino acids to the ribosome during protein synthesis) allowing a more efficient translation.

[2][9][10] In the late 2000s, genome annotation shifted its attention towards identifying non-coding regions in DNA, which was achieved thanks to the appearance of methods to analyze transcription factor binding sites, DNA methylation sites, chromatin structure, and other RNA and regulatory region analysis techniques.

[21][22] Structural annotation describes the precise location of the different elements in a genome, such as open reading frames (ORFs), coding sequences (CDS), exons, introns, repeats, splice sites, regulatory motifs, start and stop codons, and promoters.

[19] To solve this problem, proteogenomics based approaches are employed, which utilize information from expressed proteins often derived from mass spectrometry.

In fact, the primary task in genome annotation is gene prediction, which is why numerous methods have been developed for this purpose.

However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups.

[19] Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example, protein-protein interaction (PPI) networks usually place proteins with similar functions close to each other.

The support vector machine (SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as k-nearest neighbors (kNN) and convolutional neural network (CNN), have also been employed.

[40] Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms.

More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does.

[28] Pseudogenes are mutated copies of protein-coding genes that lost their coding function due to a disruption in their open reading frame (ORF), making them untranslatable.

Ab initio prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead.

Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.

[50] Visualization of annotations in a genome browser requires a descriptive output file, which should describe the intron-exon structures of each annotation, their start and stop codons, UTRs and alternative transcripts, and ideally should include information about the sequence alignments and gene predictions that support each gene model.

[24] Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.

The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.

This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions.

On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called unsupervised community annotation.

[67] A great diversity of catabolic enzymes involved in hydrocarbon degradation by some bacterial strains are encoded by genes located in their mobile genetic elements (MGEs).

The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.

[68] In 2013, Phale et al.[69] published the genome annotation of a strain of Pseudomonas putida (CSV86), a bacterium known for its preference of naphthalene and other aromatic compounds over glucose as a carbon and energy source.

This was the approach of the investigation and identification of Halomonas zincidurans strain B6(T), a bacterium with thirty-one genes encoding resistance to heavy metals, especially zinc[71] and Stenotrophomonas sp.

A variety of software tools have been developed that allow scientists to view and share genome annotations, such as MAKER.

Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means.

A visualization of Porphyra umbilicalis chloroplast genome annotation ( GenBank accession: MF385003.1 ) made with Chloroplot . [ 1 ] The number of genes, the genome length, and the GC content are placed in the middle black circle. The outer gray circle shows GC content in the every section of the genome. All individual genes are placed on the outermost circle according to their position in the genome, their transcription direction and their length; they are color-coded based on the cellular function or component they are part of. Represented with arrows, the transcription directions for the inner and outer genes are listed clockwise and anticlockwise, respectively.
A release timeline of genome annotators. The dotted boxes indicate the four different generations of genome annotators and their most representative characteristics. First generation (blue) where annotators used ab initio methods at a local scale, second generation (red) with genome-wide ab initio methods, third generation (green) characterized by a combination of ab initio methods and homology-based annotations, and the fourth generation (orange) in which an approach to identification of the non-coding regions of DNA and study at the population level represented by the pangenome begun.
Generalized flowchart of a structural genome annotation pipeline. First, the repetitive regions of an assembled genome are masked by using a repeat library. Then, optionally, the masked sequence is aligned with all the available evidence ( ESTs , RNAs , and proteins ) of the organism being annotated. In eukaryotic genomes, splice sites must be identified. Finally, the coding and noncoding sequences contained in the genome are predicted with the help of databases of known DNA, RNA and protein sequences, as well as other supporting information.
An example Gene Ontology (GO) ancestor chart organized as a directed acyclic graph taken from QuickGO . [ 39 ] It shows the molecular functions, biological processes, and cellular components in which the matrilin complex , a component of the extracellular matrix , is involved. Every box is an ontology term that falls into one of the three GO categories and is color-coded respectively. Ontology terms are related to each other through specific qualifiers (such as "is a", "part of", etc.), which are represented by different kinds of arrows.
A snapshot of an annotated GBK file created with Prokka. [ 51 ] It shows the components (features) of a small portion of Candidatus Carsonella ruddii ' s genome, including their positions (structural annotation) and inferred functions (functional annotation).
A linear comparative genome visualization of several type species of phylogenetically related viral families and genera . Functional annotations of proteins are displayed in distinct colors and homologies in different tones.