Principles of a self-error-detecting, three-base block encoding scheme (SED3B) 95

5.3 Principles of a self-error-detecting, three-base block en-coding scheme (SED3B)

To fully utilize the redundancy feature of DNA molecules for error correction, a novel self-error-detecting, three-base block (SED3B) encoding scheme was proposed for effective and flexible error correction. In details, binary bits are first transformed into data encoding DNA bases four bits by four bits using the scheme shown in Figure 5.2. Then one error checking base was inserted per two data encoding bases to form a three-base block encoding manner.

The third base is designed to detect whether there are errors emerged in the two encoding bases. A simple way of error checking by the third base is the checksum principle [252].

However, the checksum method has no optimization option for homopolymers and extreme GC contents. Instead, a novel strategy was utilized to enable error checking by the third base. At first, all possible 16 two-base combinations were divided into four groups based on the principle that all the four two-base combinations in the same group do not share any identical base in neither the first nor the second base, and then every group was assigned with an error detecting base as shown in Figure 5.2. Thus, the data encoding two-base and the error detecting base won’t match to each other anymore if error emerges in any of the three bases. In other words, a single base error on any of the three bases can be detected.

To avoid extreme GC, long homopolymers and complex secondary structures generated in the encoding DNA strings, three additional principles are followed while assigning error checking bases to the four groups of two-base combinations: 1) no more than 3 G/C present in the three base block to avoid extreme GC contents; 2) no identical bases present in all the three bases to avoid long homopolymers; 3) no complementarily matched three base blocks present. However, principals 1) and 2) cannot be satisfied simultaneously. To address this issue, two different rules were introduced for assigning error detecting bases as shown in Figure 5.2, rows of “error detecting base rules”. Rule I satisfies the principle that there are no more than 2 G/C present in all the three bases while a “TTT” homopolymer does present there. Rule II abolishes any three base homopolymer while enabling G/C presents in all three bases. During the encoding process, Rule I is used in general and only if “TTT” is present, the rule for assigning the error detecting bases is switched to Rule II temporarily and then switched back to Rule I after having encoded once. Thus, continuous “T” homopolymers can be avoided as the error detecting base for “TT” is switched to “G”, not “T” in Rule II and extreme GC content can also be avoided as three G/C combinations in Rule II are only present if the previous encoding three-base block is “TTT”. Finally, no more than seven continuous “T”, five continuous “A/C” and three continuous “G” are possible to exist in the encoded DNA strings which have been proved to be acceptable by current DNA synthesis

96 Orthogonal information encoding in living cells and sequencing technologies [26]. The GC content can be controlled below 67.7%. Since two-thirds of the total bases are used for data encoding, the SED3B scheme has a theoretic encoding efficiency of 66.7% regardless the addressing and compress problems.

Fig. 5.2 Illustration of encoding binary data into DNA string using the SED3B encoding scheme.

5.4 High error tolerance revealed by in silicon simulations

To test the error detection capability, different rates of random errors were introduced into the SED3B encoded DNA fragments, and calculated the percentage of errors that could be detected by SED3B. As shown by the green triangles in Figure 5.3, more than 90% errors can

5.4 High error tolerance revealed byin siliconsimulations 97 be detected while an error rate less than 10% and 78% errors still can be detected even when the error rate is as high as 30%. The error rates after error repression shown in red crisscross are more than one magnitude lower than the untreated ones. Next, the error correction capability was tested using variant numbers of DNA sequences. Simulations with 10 and 100 DNA sequences were performed individually at first. As shown in Figure 5.4, the error tolerance ability by using 100 DNA sequences is higher than that using 10 DNA fragments as expected. The error tolerance is up to 5% using 10 DNA fragments and up to 33% using 100 DNA fragments. To estimate the number of sequences required for reliable correction of a specific rate of errors, series of simulations with error rates ranging from 1% to 40% were performed, with a step increment of 1%. At each simulated error rate, the simulation started with a small number of sequences to retrial the data for 500 iterations. If errors have emerged in any of the 500 iterations, the sequence number was increased by one and the process was iterated until there are no errors emerged all 500 iterations. As shown in Figure 5.5, although the required sequence number increased exponentially with the increase of the error rate, 200 sequences were enough to correct a very high error rate of 40%.

Fig. 5.3 Error detection and repression by using the SED3B encoding scheme.

▲Percentages of errors detected by SED3B method.+Remained percentages of errors in DNA fragments after removing the errors detected.×Percentages of random errors introduced during simulations. Errors were introduced into DNA fragments randomly base by base. A range of error rates from 1 to 30% was simulated with a stepping increment of 1%. Random errors were introduced in each step with a specific error rate setting, and each step was iterated for 500 times. More than 90% errors could be detected while the error rate less than 10%. More than 78% errors have been detected even the error rate is as high as 30%. The error rates after error repression shown in red crisscross are more than one magnitude lower than the untreated ones.

98 Orthogonal information encoding in living cells

Fig. 5.4 Error correction capabilities by multiple DNA sequences encoded by SED3B encoding scheme.

+Percentages of errors introduced during simulations.■Remained percentages of errors in DNA strings after removing the detected errors.▲The emerged percentages of errors in final recovered information using 10 error-containing DNA strings for retrieval of the information.×The emerged percentages of errors in final recovered information using 100 DNA strings for information retrieval. Errors were introduced into DNA fragments randomly base by base. A range of error rates from 1% to 40% was simulated with a stepping increment of 1%. Random errors were introduced in each step with a specific error rate setting, and each step was iterated for 500 times.

5.5 SED3B encoded DNA sequences show low biological

Im Dokument Analysis and engineering of biomolecules and microorganisms: from genome-scale study of pathogens to programming of DNA and cells (Seite 121-124)