k-mer

Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences,[1] improve heterologous gene expression,[2][3] identify species in metagenomic samples,[4] and create attenuated vaccines.

The frequency of k-mer usage is affected by numerous forces, working at multiple levels, which are often in conflict.

[10] Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that single nucleotide changes, which are often silent, to alter the fitness of an organism.

[11] Rather, current evidence suggests that GC‐biased gene conversion (gBGC) is a driving factor behind variation in GC content.

[16] That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved.

Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined.

What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably.

[21] This interaction highlights the interrelationship between the forces affecting k-mers for varying values of k. One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes.

[23] This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.

[4] The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.

[25][26] In order to create a De Bruijn Graph, the k-mers stored in each edge with length

This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing.

[27] Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths.

[28] In addition, k-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics.

With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity.

[11] Prior work has also shown that tetranucleotide biases are able to effectively detect horizontal gene transfer in both prokaryotes[32] and eukaryotes.

[34] Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA.

In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the phylogenetic species concept, though this proposal does not appear to have gained traction within the scientific community.

[35] Other applications within genetics and genomics include: k-mer frequency and spectrum variation is heavily used in metagenomics for both analysis[47][48] and binning.

TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (k = 4) frequencies.

[49]  Other tools that similarly rely on k-mer frequency for metagenomic binning are CompostBin (k = 6),[50] PCAHIER,[51] PhyloPythia (5 ≤ k ≤ 6),[52] CLARK (k ≥ 20),[53] and TACOA (2 ≤ k ≤ 6).

[61] In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates.

[62] The most studied application of k-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines.

[63] Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened pathogenicity while eliciting a strong immune response.

[65] Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach.

GC-content, due to its effect on DNA melting point, is used to predict annealing temperature in PCR, another important biotechnology tool.

While simple implementations such as the above pseudocode work for small values of k, they need to be adapted for high-throughput applications or when k is large.

The sequence ATGG has two 3-mers: ATG and TGG.
An example 8-mer spectrum for E. coli comparing 8-mers' frequency ( i.e. multiplicities) with their number of occurrences.
This figure shows the process of splitting reads into smaller k -mers (4-mers in this case) in order to be able to be used in a De Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in De Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these k -mers overlap by k-1 and can then be used in a De Bruijn graph.