Pan-genome

In the fields of molecular biology and genetics, a pan-genome (pangenome or supragenome) is the entire set of genes from all strains within a clade.

[2] The genetic repertoire of a bacterial species is much larger than the gene content of an individual strain.

[13] An open access book reviewing the pangenome concept and its implications, edited by Tettelin and Medini, was published in the spring of 2020.

[citation needed] The pan-genome can be somewhat arbitrarily classified as open or closed based on the alpha value of Heaps' law:

[23][15] Usually, the pangenome software can calculate the parameters of the Heap law that best describe the behavior of the data.

It is believed that parasitism and species that are specialists in some ecological niche tend to have closed pangenomes.

[26] Some studies point that prokaryotes pangenomes are the result of adaptive, not neutral evolution that confer species the ability to migrate to new niches.

In 2011 genomic fluidity was proposed as a measure to categorize the gene-level similarity among groups of sequenced isolates.

[31] 'Metapangenome' has been defined as the outcome of the analysis of pangenomes in conjunction with the environment where the abundance and prevalence of gene clusters and genomes are recovered through shotgun metagenomes.

[33] Other authors consider that Metapangenomics expands the concept of pangenome by incorporating gene sequences obtained from uncultivated microorganisms by a metagenomics approach.

[35] The Anvi'o platform developed a workflow that integrates analysis and visualization of metapangenomes by generating pangenomes and study them in conjunction with metagenomes.

[32] In 2018, 87% of the available whole genome sequences were bacteria fueling researchers interest in calculating prokaryote pangenomes at different taxonomic levels.

[22] In 2015, the pangenome of 44 strains of Streptococcus pneumoniae bacteria shows few new genes discovered with each new genome sequenced (see figure).

[45] Among plants, there are examples of pangenome studies in model species, both diploid [9] and polyploid,[10] and a growing list of crops.

They have been reviewed by Eizenga et al. [52] As interest in pangenomes increased, there have been several software tools developed to help analyze this kind of data.

[55] There are seven kinds of software developed to analyze pangenomes: Those dedicated to cluster homologous genes; identify SNPs; plot pangenomic profiles; build phylogenetic relationships of orthologous genes/families of strains/isolates; function-based searching; annotation and/or curation; and visualization.

[11][59] In 2018 panX was released, an interactive web tool that allows inspection of gene families evolutionary history.

[65] panX can display an alignment of genomes, a phylogenetic tree, mapping of mutations and inference about gain and loss of the family on the core-genome phylogeny.

In 2019 OrthoVenn 2.0 [66] allowed comparative visualization of families of homologous genes in Venn diagrams up to 12 genomes.

In 2023, BRIDGEcerealwas developed to survey and graph indel-based haplotypes from pan-genome through a gene model ID.

In 2020, a computational comparison of tools for extracting gene-based pangenomic contents (such as GET_HOMOLOGUES, PanDelos, Roary, and others) has been released.

The analysis was performed by taking into account different bacterial populations, which are synthetically generated by changing evolutionary parameters.

Again in 2020, several tools introduced a graphical representation of the pangenomes showing the contiguity of genes (PPanGGOLiN,[46] Panaroo[65]).

Pangenome analysis of Streptococcus agalactiae genomes made with Anvi'o [ 1 ] software whose development is led by A. Murat Eren . Genomes obtained from Tettelin et al. (2005). [ 2 ] Each circle corresponds to one genome and each radius represent a gene family. At the bottom and at right are localized the core genome families. Some families in the core may have more than one homologous gene per genome. In the middle, at the left of the figure the shell genome is observed. At the top left are shown families from the dispensable genome and singletons.
In the pangenome, we can identify three sets of genes: Core, Shell, and Cloud genome. The Core genome comprises the genes that are present in all genomes analyzed. To avoid dismissing families due to sequencing artifacts some authors consider the softcore (>95% occurrence). The Shell genome consists of the genes shared by the majority of genomes (10-95% occurrence). The gene families present in only one genome or <10% occurrence are described as Dispensable or Cloud genome.
a) Closed pangenomes are characterized by large core genomes and small accessory genomes. b) Open pangenomes tend to have small core genomes and large accessory genomes. c) The size of open pangenomes tends to increase with every added genome, meanwhile closed pangenome's size tends to be asymptotic despite adding more genomes. Due to this characteristic, complete pangenome size for closed pangenomes can be predicted.
The supergenome is defined as all genes accessible for a certain species, the pangenome if sequencing of all genomes of one species was available. Metapangenome is the pangenome analysis applied to metagenomic samples, where the union of genes of several species is evaluated for a given habitat.
The S. pneumoniae pan-genome. (a) Number of new genes as a function of the number of sequenced genomes. The predicted number of new genes drops sharply to zero when the number of genomes exceeds 50. (b) Number of core genes as a function of the number of sequenced genomes. The number of core genes converges to 1,647 for number of genomes n→∞. From Donati et al. [ 36 ]
Pangenome analysis of genomes of Streptococcus agalactiae . [ 2 ] Example of phylogenies made with BPGA software. This software allows us to generate phylogenies based on the clustering of the core genome or pangenome. Core and pan phylogenetic reconstructions are not necessarily matching.
Pangenome graph of 3 117 Acinetobacter baumannii genomes. Edges correspond to genomic colocalization and nodes correspond to genes. The thickness of the edges is proportional to the number of genomes sharing that link. The edges between persistent (similar to core genes), shell and cloud nodes are colored in orange, green and blue, respectively.
Pangenome graph of 3 117 Acinetobacter baumannii genomes generated with PPanGGOLiN software. Edges correspond to genomic colocalization and nodes correspond to genes. The thickness of the edges is proportional to the number of genomes sharing that link. The edges between persistent (similar to core genes), shell and cloud nodes are colored in orange, green and blue, respectively.
Example of possible outputs of BPGA software. Pangenome analysis of genomes of Streptococcus agalactiae . At the left, the distribution of Go terms by core/dispensable/unique genome is shown. In this example, the category replication, recombination, and repair are enriched on unique gene families. On the right, a typical pan/core plot is shown, when more genomes have added the size of the core is decreasing, and on the contrary the size of the pangenome increases. [ 2 ]