The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.
A tip is where an error occurred during the sequencing process and has caused the graph to end prematurely and includes both correct and incorrect k-mers.
If there is not a reference genome, tips are eliminated by tracing the branches backward until a point of ambiguity is found.
These bubbles and tips can then be removed, since we can identify that they were formed from errors in the bp reads, giving us a graph structure that should accurately and completely reflect the original sequence.
When comparing two strands of DNA, colored de Bruijn graphs are frequently used to identify errors.
This algorithm can have high false positive rates since there is a difficulty of separating repeat- and variant-induced bubbles; however, there is often a reference genome to help improve reliability.
Since this is the case most often, the path divergence algorithm is useful, especially when considering where deletions occur and the variant is so complex it is constrained to the reference allele.
In the simplest cases, the samples are combined into a group of a single color and the data is analysed as described previously.
However, by maintaining separate colors for each sample set, additional information on how the bubbles were formed, whether by error or by repeats, presents itself.
[5] In 1997, the Department of Technology at Genzyme Genetics in Framingham, Massachusetts developed a new approach that provided a breakthrough in dealing with bubbles using the multiplex allele-specific diagnostic assay (MASDA).
This program combines forward dot-blot, complex simultaneous probe hybridization and direct mutation detection to help solve the dual problem of multiple sample analysis.
[9] The colored de Bruijn graphs can be used to genotype any DNA sample at a known loci, even when the coverage is less than sufficient for variant assembly.
The algorithm then calculates the likelihood of each genotype and accounts for the structure of the graph, both of the local and genome-wide sequence.