DNA annotation

[5] Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means.

[9][10] They appeared as a necessity to handle the enormous amount of data produced by the Maxam-Gilbert and Sanger DNA sequencing techniques developed in the late 1970s.

In fact, codon usage was the main strategy used by several early protein coding sequence (CDS) prediction methods,[12][13][14] based on the assumption that the most translated regions in a genome contain codons with the most abundant corresponding tRNAs (the molecules responsible for carrying amino acids to the ribosome during protein synthesis) allowing a more efficient translation.

[2][9][10] In the late 2000s, genome annotation shifted its attention towards identifying non-coding regions in DNA, which was achieved thanks to the appearance of methods to analyze transcription factor binding sites, DNA methylation sites, chromatin structure, and other RNA and regulatory region analysis techniques.

[21][22] Structural annotation describes the precise location of the different elements in a genome, such as open reading frames (ORFs), coding sequences (CDS), exons, introns, repeats, splice sites, regulatory motifs, start and stop codons, and promoters.

[19] To solve this problem, proteogenomics based approaches are employed, which utilize information from expressed proteins often derived from mass spectrometry.

In fact, the primary task in genome annotation is gene prediction, which is why numerous methods have been developed for this purpose.

However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups.

[19] Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example, protein-protein interaction (PPI) networks usually place proteins with similar functions close to each other.

The support vector machine (SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as k-nearest neighbors (kNN) and convolutional neural network (CNN), have also been employed.

[40] Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms.

More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does.

[28] Pseudogenes are mutated copies of protein-coding genes that lost their coding function due to a disruption in their open reading frame (ORF), making them untranslatable.

Ab initio prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead.

Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.

[50] Visualization of annotations in a genome browser requires a descriptive output file, which should describe the intron-exon structures of each annotation, their start and stop codons, UTRs and alternative transcripts, and ideally should include information about the sequence alignments and gene predictions that support each gene model.

[24] Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.

The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.

This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions.

On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called unsupervised community annotation.

[67] A great diversity of catabolic enzymes involved in hydrocarbon degradation by some bacterial strains are encoded by genes located in their mobile genetic elements (MGEs).

The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.

[68] In 2013, Phale et al.[69] published the genome annotation of a strain of Pseudomonas putida (CSV86), a bacterium known for its preference of naphthalene and other aromatic compounds over glucose as a carbon and energy source.

This was the approach of the investigation and identification of Halomonas zincidurans strain B6(T), a bacterium with thirty-one genes encoding resistance to heavy metals, especially zinc[71] and Stenotrophomonas sp.

A variety of software tools have been developed that allow scientists to view and share genome annotations, such as MAKER.

Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means.