Compression of genomic sequencing data

[1][2][3] With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored.

The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (‘123’ in this example).

Encoding schemes are used to convert coordinate integers into binary form to provide additional compression gains.

A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims.

[4] Brandon et al. (2009)[4] alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of mitochondrial DNA variant data as an example (see Figure 2).

Their result suggests that the revised Cambridge Reference Sequence may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals.

Additionally, a reference sequence can be designed based on statistical properties [1][4] or engineered [11][12] to improve the compression ratio.

Figure 1: The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data back to text form.