Sequence analysis

It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements.

[7] In 1970, Saul B. Needleman and Christian D. Wunsch published the first computer algorithm for aligning two sequences.

[8] Over this time, developments in obtaining nucleotide sequence improved greatly, leading to the publication of the first complete genome of a bacteriophage in 1977.

[9] Robert Holley and his team in Cornell University were believed to be the first to sequence an RNA molecule.

It is the first step in sequence analysis to limit wrong conclusions due to poor quality data.

The following analyses steps are peculiar to DNA sequences: Identifying variants is a popular aspect of sequence analysis as variants often contain information of biological significance, such as explaining the mechanism of drug resistance in an infectious disease.

The read alignments are sorted using SAMtools, after which variant callers such as GATK[20] are used to identify differences compared to the reference sequence.

The output of variant calling is typically in vcf format, and can be filtered using allele frequencies, quality scores, or other factors based on the research question at hand.

[14] This step adds context to the variant data using curated information from peer-reviewed papers and publicly available databases like gnomAD and Ensembl.

Variants can be annotated with information about genomic features, functional consequences, regulatory elements, and population frequencies using tools like ANNOVAR or SnpEff,[23] or custom scripts and pipeline.

The following steps are peculiar to RNA sequences: Mapped RNA sequences are analyzed to estimate gene expression levels using quantification tools such as HTSeq,[24] and identify differentially expressed genes (DEGs) between experimental conditions using statistical methods like DESeq2.

[25] This is carried out to compare the expression levels of genes or isoforms between or across different samples, and infer biological relevance.

The results in the table can be further visualized using volcano plots and heatmaps, where colors represent the estimated expression level.

[14][12][13][26] RNA sequence analysis explores gene expression dynamics and regulatory mechanisms underlying biological processes and diseases.

Proteome sequence analysis studies the complete set of proteins expressed by an organism or a cell under specific conditions.

It describes protein structure, function, post-translational modifications, and interactions within biological systems.

[14] Beyond preprocessing raw MS data to remove noise, normalize intensities, and detect peaks and converting proprietary file formats (e.g., RAW) to open-source formats (mzML, mzXML) for compatibility with downstream analysis tools, other analytical steps include peptide identification, peptide quantification, protein inference and quantification, generating quality control report, and normalization, imputation and significance testing.

All browsers support multiple data formats for upload and download and provide links to external tools and resources for sequence analyses, which contributes to their versatility.

In 1987, Michael Gribskov, Andrew McLachlan, and David Eisenberg introduced the method of profile comparison for identifying distant similarities between proteins.

In 1993, a probabilistic interpretation of profiles was introduced by Anders Krogh and colleagues using hidden Markov models.

[37] Machine learning has played a significant role in predicting the sequence of transcription factors.