Transcriptomics technologies

Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions.

Subsequent technological advances since the late 1990s have repeatedly transformed the field and made transcriptomics a widespread discipline in biological sciences.

Transcriptome analysis has enabled the study of how gene expression changes in different organisms and has been instrumental in the understanding of human disease.

Libraries of silkmoth mRNA transcripts were collected and converted to complementary DNA (cDNA) for storage using reverse transcriptase in the late 1970s.

[19][20] In 1995, one of the earliest sequencing-based transcriptomic methods was developed, serial analysis of gene expression (SAGE), which worked by Sanger sequencing of concatenated random transcript fragments.

[34][35] Microarray technology allowed the assay of thousands of transcripts simultaneously and at a greatly reduced cost per gene and labour saving.

[50] Serial analysis of gene expression (SAGE) was a development of EST methodology to increase the throughput of the tags generated and allow some quantitation of transcript abundance.

[21] The cap analysis gene expression (CAGE) method is a variant of SAGE that sequences tags from the 5’ end of an mRNA transcript only.

SAGE and CAGE methods produce information on more genes than was possible when sequencing single ESTs, but sample preparation and data analysis are typically more labour-intensive.

Spotted arrays use two different fluorophores to label the test and control samples, and the ratio of fluorescence is used to calculate a relative measure of abundance.

RNA-Seq refers to the combination of a high-throughput sequencing methodology with computational methods to capture and quantify transcripts present in an RNA extract.

[9] Both low-abundance and high-abundance RNAs can be quantified in an RNA-Seq experiment (dynamic range of 5 orders of magnitude)—a key advantage over microarray transcriptomes.

RNA-Seq methodology has constantly improved, primarily through the development of DNA sequencing technologies to increase throughput, accuracy, and read length.

Methods differ in the use of transcript enrichment, fragmentation, amplification, single or paired-end sequencing, and whether to preserve strand information.

[72] UMIs provide an absolute scale for quantification, the opportunity to correct for subsequent amplification bias introduced during library construction, and accurately estimate the initial sample size.

UMIs are particularly well-suited to single-cell RNA-Seq transcriptomics, where the amount of input RNA is restricted and extended amplification of the sample is required.

[85][86] A large number of reads are needed to ensure sufficient coverage of the transcriptome, enabling detection of low abundance transcripts.

Added to those considerations is that every species has a different number of genes and therefore requires a tailored sequence yield for an effective transcriptome.

[88][89][90] Transcriptomics methods are highly parallel and require significant computation to produce meaningful data for both microarray and RNA-Seq experiments.

Multiple short probes matching a single transcript can reveal details about the intron-exon structure, requiring statistical models to determine the authenticity of the resulting signal.

Raw data is examined to ensure: quality scores for base calls are high, the GC content matches the expected distribution, short sequence motifs (k-mers) are not over-represented, and the read duplication rate is acceptably low.

Legend: RAM – random access memory; MPI – message passing interface; EST – expressed sequence tag.

The final outputs of these analyses are gene lists with associated pair-wise tests for differential expression between treatments and the probability estimates of those differences.

[10][141] RNA-Seq approaches have allowed for the large-scale identification of transcriptional start sites, uncovered alternative promoter usage, and novel splicing alterations.

[142] RNA-Seq can also identify disease-associated single nucleotide polymorphisms (SNPs), allele-specific expression, and gene fusions, which contributes to the understanding of disease causal variants.

[145][146] RNA-Seq of human pathogens has become an established method for quantifying gene expression changes, identifying novel virulence factors, predicting antibiotic resistance, and unveiling host-pathogen immune interactions.

This technique enables the study of the dynamic response and interspecies gene regulatory networks in both interaction partners from initial contact through to invasion and the final persistence of the pathogen or clearance by the host immune system.

[156] Integration of RNA-Seq datasets across different tissues has been used to improve annotation of gene functions in commercially important organisms (e.g. cucumber)[157] or threatened species (e.g.

For example, a database of SNPs used in Douglas fir breeding programs was created by de novo transcriptome analysis in the absence of a sequenced genome.

This article was adapted from the following source under a CC BY 4.0 license (2017) (reviewer reports): Rohan Lowe; Neil Shirley; Mark Bleackley; Stephen Dolan; Thomas Shafee (18 May 2017).

Transcriptomics method use over time. Published papers referring to RNA-Seq (black), RNA microarray (red), expressed sequence tag (blue), digital differential display (green), and serial/cap analysis of gene expression (yellow) since 1990. [ 1 ]
Summary of SAGE . Within the organisms, genes are transcribed and spliced (in eukaryotes ) to produce mature mRNA transcripts (red). The mRNA is extracted from the organism, and reverse transcriptase is used to copy the mRNA into stable double-stranded–cDNA ( ds - cDNA ; blue). In SAGE, the ds-cDNA is digested by restriction enzymes (at location ‘X’ and ‘X’+11) to produce 11-nucleotide "tag" fragments. These tags are concatenated and sequenced using long-read Sanger sequencing (different shades of blue indicate tags from different genes). The sequences are deconvoluted to find the frequency of each tag. The tag frequency can be used to report on transcription of the gene that the tag came from. [ 51 ]
Summary of DNA Microarrays . Within the organisms, genes are transcribed and spliced (in eukaryotes) to produce mature mRNA transcripts (red). The mRNA is extracted from the organism and reverse transcriptase is used to copy the mRNA into stable ds-cDNA (blue). In microarrays, the ds-cDNA is fragmented and fluorescently labelled (orange). The labelled fragments bind to an ordered array of complementary oligonucleotides, and measurement of fluorescent intensity across the array indicates the abundance of a predetermined set of sequences. These sequences are typically specifically chosen to report on genes of interest within the organism's genome. [ 51 ]
Summary of RNA-Seq . Within the organisms, genes are transcribed and spliced (in eukaryotes) to produce mature mRNA transcripts (red). The mRNA is extracted from the organism, fragmented, and copied into stable ds-cDNA (blue). The ds-cDNA is sequenced using high-throughput , short-read sequencing methods. These sequences can then be aligned to a reference genome sequence to reconstruct which genome regions were being transcribed. This data can be used to annotate where expressed genes are, their relative expression levels, and any alternative splice variants. [ 51 ]
Microarray and sequencing flow cell . Microarrays and RNA-seq rely on image analysis in different ways. In a microarray chip, each spot on a chip is a defined oligonucleotide probe, and fluorescence intensity directly detects the abundance of a specific sequence (Affymetrix). In a high-throughput sequencing flow cell, spots are sequenced one nucleotide at a time, with the colour at each round indicating the next nucleotide in the sequence (Illumina Hiseq). Other variations of these techniques use more or fewer colour channels. [ 51 ] [ 101 ]
Heatmap identification of gene co-expression patterns across different samples. Each column contains the measurements for gene expression change for a single sample. Relative gene expression is indicated by colour: high-expression (red), median-expression (white) and low-expression (blue). Genes and samples with similar expression profiles can be automatically grouped (left and top trees). Samples may be different individuals, tissues, environments or health conditions. In this example, expression of gene set 1 is high and expression of gene set 2 is low in samples 1, 2, and 3. [ 51 ] [ 129 ]