Theoretical and technological backgrounds

biological barcodes or comments encoding in living cells, the encoded DNA sequences should not share the same sequence space as the natural ones to avoid interference with cellular functions. In other words, they should be orthogonal to exclude biological relevance.

One unique feature of information storage in DNA is that there are always many copies of DNA molecules synthesized which represent the same data. In other words, there is a high inherent data redundancy. In this study, using a novel way of adding error detection codes block by block, an efficient self-error-detecting, three-base block encoding scheme (SED3B) which can take full advantage of the inherent redundancy feature for error correction was established. SED3B can effectively repress error enrichment emerging from DNA replication as proved byin siliconand experimental verifications. With merely 30 sequences for error correction, the SED3B scheme can tolerate a high error rate of 19.1%. Errors in a rate of 40% still can be corrected with 180 DNA sequences as proved by in silicon simulations.

Over 12,100 years of continuous replication are estimated to be required to make the SED3B encoded information in growingE. colicells unrecoverable as proved byin vivo, error-prone PCR experiments. In addition to limited extreme GC contents, homopolymers, and simple secondary structure, SED3B encoded sequences also show very low biological relevance as proved by comparative studies with naturally formed sequences. Features of high error tolerance and low biological relevance make SED3B promising for orthogonal information encoding in living cells,e.g. as comment language in programming cell or for biological barcode encoding. To facilitate the usage of SED3B as a univeral information encoding scheme in living cells, an online encoding-decoding system with cases of comment and barcode encoding is implemented and released in http://biosystem.bt1.tu-harburg.de/sed3b/.

5.2 Theoretical and technological backgrounds

The focus of this part of work is to design an encoding scheme for reliable digital data encoding in DNA with regarding to the unique features of DNA as data storage media.

There are many methods available for data encoding in DNA. In the previous section, the available methods for data encoding in DNA have been briefly introduced. In this section, four representative state-of-the-art methods released in recent years are detailed, their merits and limitations or disadvantages are mentioned. Other efforts on associative memory and DNA computing were not included because they were designed for different purposes for DNA information storage needs. In a recent study, Erlichet al. reported a storage strategy, called DNA Fountain. They proved that the 2.14 × 10⁶bytes encoded data could be retrieved

92 Orthogonal information encoding in living cells by 2.18 × 10¹⁵ times, indicating a highly robust system. However, this method was not detailed here since this strategy is not applicable forin vivoapplications to the author.

5.2.1 The method of Church et al.

In the study of Churchet al. in 2012, they used a “one bit per base” coding system with the base “A/C” for zero and “G/T” for one [23]. To avoid the formation of extreme GC, homopolymers and secondary structures in the encoded DNA sequences, they applied random disruption mechanism. They encoded an html version draft of a book that included 53,426 words, 11 JPG images, and one JavaScript program into a 5.27-megabit bitstream and all data blocks were recovered with merely 10 bit errors emerged. The errors identified after sequencing are mainly due to the lack of an error correction mechanism. Furthermore, this method sacrifices half of the storage capacity which in turn would double the costs.

5.2.2 The method of Goldman et al.

In the study of Goldmanet al., a base-3 encoding scheme was applied. Digital information was first converted to base-3 using a Huffman code that replaces each byte with five or six base-3 digits (trits) [24].This in turn was convertedin silicoto DNA code by replacement of each trit with one of the three nucleotides different from the previous one used as shown in Table 5.1. DNA homopolymers are abolished while sacrificing one fourth of the encoding capacity. However, this method cannot avoid extreme GC and complex secondary structure contents effectively. Furthermore, to make sure a full coverage of every fragment during sequencing, a fourfold redundancy was created by fragment overlapping which resulted with an efficiency of (3/4)/4=18.75% without considering the index and compress issues.

Together a simple parity-check for single base error-detection with 1.2×10⁵copies of each DNA string, the information could be recovered without any errors ( 1.2×10⁷copies of each DNA string were actually used in Goldman’s experiments and they supposed that 1% of them are enough for reliable information storage). However, such high coverage reduces the data density and raises the cost for information storage in DNA.

5.2.3 The method of Grass et al.

In 2015, Grasset al. reported an encoding strategy applying the Reed–Solomon (RS) coding to data storage in DNA [245]. First, two bytes of a digital file are mapped to three elements of the Galois Field of size 47 (GF(47)) by base conversion (256²to 47³). Second, RS codes are employed to add redundancy A to the individual blocks. Finally, the data blocks were

5.2 Theoretical and technological backgrounds 93

Table 5.1 Base-3 to DNA encoding ensuring no repeated nucleotides in the Goldman’s method

Previous base written Next base to encode

0 1 2

A C G T

C G T A

G T A C

T A C G

converted into DNA by mapping every element of GF(47) to three nucleotides by utilizing the GF(47) to DNA codon wheel as shown in Figure 5.1, thereby guaranteeing that no base is repeated more than three times. By encapsulating the DNA in an inorganic matrix, they estimated a reliable information storage in DNA for 2000 years, which is far beyond the capabilities of transitional digital information storage media (<50 years).

Fig. 5.1 GF(47) to DNA codon wheel for mapping every element of GF(47) to three nu-cleotides

94Orthogonalinformationencodinginlivingcells

Table 5.2 Comparison of capabilities of current available encoding schemes for digital information storage in DNA

Churchet al. 2012 Goldmanet al.2013 Grasset al. 2014 This study

Extreme GC Yes No No GC% <66.7%

Length of homopolymers Up to 1 Up to 1 Up to 3 Up to 3G, 5A, 7T, 5C

Secondary structures Yes No No Yes

Error correction codes No Parity checking RS codes SEDTB

Encoding Efficiencyⁱ 50% 18.75% 63.37% 66.70%

Low Biological relevance No No No Yes

iThe encoding efficiencies were calculated without consideration of index and compress issues.

Im Dokument Analysis and engineering of biomolecules and microorganisms: from genome-scale study of pathogens to programming of DNA and cells (Seite 117-121)