• Keine Ergebnisse gefunden

A Machine Discovery from Amino Acid Sequences by Decision Trees over Regular Patterns

5 PAC-Learnable Class

This section provides a theoretical foundation on the classes of sets classified by decision trees over regular patterns from the point of algorithmic learning theory [Valiant 1984].

For integers k, d ~ 0, we consider a decision tree T over k-variable regular patterns whose depth is at most d. We denote by DTRP(d, k) the class of languages defined by decision trees over k-variable regular patterns with depth at most d.

Theorem 1 DTRP( d, k) is polynomial-time learnable for alld, k ~

o.

We need some terminology for the above theorem.

When we are concerned with learning, we call a subset of

:E* a concept. A concept class C is a non empty collection of concepts. For a concept c E C, a pair (x, c(x)) is called an example of c for x E :E*,where c(x) = 1 (c(x) = 0) if x is in c (is not in c). For an alphabet :E and an integer n ~ 0, :E$n denotes the set {x E :E*

Ilxl ::;

n}.

A concept class C is said to be polynomial-time learn-able [Blumer et al. 1989, Natarajan 1989, Valiant 1984]

if there is an algorithm A which satisfies (1) and (2).

(1) A takes a sequence of examples as an input and runs in polynomial-time with respect to the length of in-put.

(2) There exists a polynomial p("".) such that for any integer n ~ 0, any concept c E C, any real number e, 8 (0

<

e,8

<

1), and any probability distribution P on :E$n, if A takes p(n,;,

t)

examples which are generated randomly according to P, then A outputs, with proba.bility. at least 1 - 8, a representation of a hypothesis h with P(cEB h) < e.

Theorem 2 [Blumer et al. 1989, Natarajan 1989] A concept class -C is polynomial-time learnable if the fol-lowing conditions hold.

1. C is of polynomial dimension, i.e., there is a polyno-mial d(n) such that I{c n :E$n IcE C}I :::; 2d(n) for all n ~ O.

2. There is a polynomia.l-time algorithm called a polynomial-time hypothesis finder for C which pro-duces a hypothesis from a sequence of examples such that it is consistent with the given examples.

Moreover, the polynomial-time hypothesis finder for C is a learning algorithm satisfying (1) and (2) if C satisfies 1.

The following lemma can be easily shown.

Lemma 1 Let T be a decision tree over regular patterns and Tv be a subtree of T at node v. We denote Tv by

7r(To, TJ), where 7r is the label of node v and To, Tl are the left and right subtrees of Tv, respectively. Let S be a set of strings and let T' be the tree obtained from T by replacing Tv wit.h To at node v. If no string in S matches

7r, then L(T) n S = L(T') n S.

Proof of Theorem 1.

First we show that the concept class DTRP( d, k) is of polynomial dimension. Let DTRP(d, I.~)n

=

{L n ~~n I L E DTRP( d, I.:)} for n 2: O. \Ve evaluate the cardinality of DTRP(d, k)n. Let 7r be a regular pattern with 17r1 >

n

+

k, then no string of length at most n matches 7r. By Lemma 1, we need to consider only regular patterns of length at most n

+

k. The number of such patterns is roughly bonnded by (I~I

+

1 )n+k. Since a tree of depth bounded by d has at most 2d - 1 internal nodes and at most 2d leaves, IDTRP(d, k)nl ::; ((I~I

+

1)n+k?d-1 ·22d This shows that the dimension of DTRP(d, k)n is O(n).

Next we show that there is a polynomial-time hypoth-esis finder for DTRP(d, k). Let P and N be the sets of strings which appear in positive and negative examples, respectively. Let TI (I.:, P, N) be the set of regular pat-terns 7r up to renaming of variables such that it contains at most k variable occurrences and 7r() is a substring of some s in PuN. By Lemma 1, we need to consider only patterns in TI( I.~, P, N) in order to find a decision tree over regular patterns which is consistent with P and N. Then ITI(k, P, N)I ::; LSEPUN((lsI2)k+1). Therefore the number of possible trees is bounded by (ITI(k, P, N)1)2d-l . 22d, which is bounded by a polynomial with respect to the input length LSEPUN

lsi·

It is known that, given a regular pattern 7r and string

W, we can decide in polynomial time whether w matches

7r or not. Therefore, given a string wand a decision tree T over k-variable regular patterns whose depth is at most d, we can decide whether w E L(T) or not in polynomial-time.

The required polynomial-time algorithm enumerates decision trees T over regular patterns in TI(k, P, N) with depth at most d. Then it checks whether s E L(T) for each s E P and t ~ N for each tEN. If such a tree T is found, the algorithm outputs T as a hypothesis. D

\Ve should say that the polynomial-time learning algo-rithm in the proof of Theorem 1 exhausts an enormous amount of time and is not suited for practical use.

We may understand the relationship of Algorithms 1 and 2 in Section 3 to Theorem 1 in the following way:

\Vhen we set n(p, N) to be the family of k-variahle regu-lar patterns made from P and N, Algorithms 1 and 2 run sufficiently fast in practicalnse (of conrse, in polynomial-time) and produce a decision tree over I.~-variable regular patterns which classifies given positive and negative ex-amples. But the produced decision tree is not guaranteed to be of depth at most d. Hence, these algorithms are not any learning algorithm in the exact sense of (2).

However, experiences tell that these algorithms usu-ally find small enough decision trees over regular pat-terns in our experiments on transmembrane domains.

For the class DTRP(d, I.:), Theorem 2 asserts that if a polynomial-time algorithm A produces a decision tree over k-variable regular patterns with depth at most d which classifies given positive and negative examples then it is a polynomial-time learning algorithm. In this sense, we may say that Algorithms 1 and 2 are polynomial-time algorithms for DTRP(d, I.~) which of-ten produce reasonahle hypotheses although there is no mathematical proof showing how often snch small hy-potheses are obtained. This aspect is very important and useful when we are concerned with machine discov-ery.

Ehrenfeucht and Haussler [1989] have considered learning of decision trees of a fixed rank. For learning decision trees over regular patterns, the restriction by rank can be shown to have no sense. Instead, we con-sider the depth of a decision tree. It is also reasonable to put a restriction on the number of variables in a reg-ular pattern. It has been shown that the class of regular pattern languages is not polynomial-time learnable un-less NP =1= RP [Miyano et al. 1991]. Therefore, unless restrictions such as bound on the number of variables in a regular pattern are given, we may not expect any positive results for polynomial-time learning.

6 Conclusion

We have shown that the idea of combining regular pat-terns and decision trees works quite well for transmem-brane domain identification. The experiments also have shown the importance of negative motifs.

A union of regular patterns is regarded as a special form of a decision tree called a decision list. We have reported in [Arikawa

et

al. 1992] that the union of small number of regular patterns can also recognize transmem-brane domains with high accuracy. However, the time exhausted in finding hypotheses in [Arikawa et al. 1992]

is much larger than that reported in this paper.

Our system constructs a decision tree over regular pat-terns just from strings called positive and negative ex-amples. We need not take care of which attribut~s to specify as in ID3. Therefore it can be applied to 'another classification problems for proteins and DNA sequences.

We believe that our approach provides a new application

of algorithmic learning to Molecular Biology.

We are now in the process of examining our method for some other related problems s11ch as predicting the secondary structure of proteins.

References

[Arikawa et al. 1992] S. Arikawa, S. Kuhara, S. Miyano, A. Shinohara and T. Shinohara. A Learning Algo-rithm for Elementary Formal Systems and its Ex-periments on Identification of Transmembrane Do-mains. In Proc. 25th Hawaii Int. Conf. on Sys. Sci, IEEE, Hawaii, 1992. pp. 675-684.

[Rairoch 1991] A. Bairoch, PROSITE: A Dictionary of Sites a.nd Patterns in Proteins, sl Nucleic Acids Res., Vol. 19 (1991), pp. 2241-2245.

[Blumer et al. 1989] A. Blumer, A. Ehrenfeucht, D.

Haussler and M.K. ·Warmuth. Learnahility and the Vapnik-Chervonenkis Dimf'nsion. JACM, Vol. 36 (1989), pp. 929-965.

[Ehrenfeucht and Haussler 1989] A. Ehrenfeucht and D. Haussler. Learning Decision Trees from Ran-dom Examples. Inform. Comput., Vol. 82 (1989), pp. 231-246.

[Engelman et al. 1986] D.M. Engf'lman, T.A. Steiz and A. Goldman. Identifying Nonpolar Transbilayer He-lices in Amino Acid Sequences of Membrane Pro-teins. Ann. Rev. Biophys. Biophys. Chem., Vol. 15 (1986), pp. 321-353.

[Gusev and Chuzhanova 1990] V. Gusev, N. Chuzhanova. The Algorithms for Recognition of the Functional Sites in Genetic Texts. In Proc. 1st

"Vorkshop on Algorithmic Learning Theory, Tokyo, 1990. pp. 109-119.

[Hartmann et al. 1989] E. Hartmann, T.A. Rapoport and H.F. Lodish. Predicting the Orientation of Eukaryotic Membrane-Spanning Proteins. In Proc.

Natl. Acad. Sci. U.S.A., Vol. 86 (1989), pp. 5786-5790.

[Holly and Karplus 1989] L.H. Holly and M. Karplus.

Protein Secondary Structure Prediction with a Neu-ra.l Network, In Proc. Nat!. Acad. Sci. USA, Vol. 86 (1989), pp. 152-156.

[Kyte and Doolittle 1982] J. Kyte and R.F. Doolittle.

A Simple Method for Displaying the Hydropathic Character of Protein. J. Mol. BioI., Vol. 157 (1982), pp. 10.5-132.

625

[Lipp et al. 1989] J. Lipp, N. Flint, M.T. Haellptle and B. Dobberskin. Structural Requirements for Mem-brane Assemhly of Proteins Spanning the 1\lem-brane Several Times. J. Cell BioI., Vol. 109 (1989), pp. 2013-2022.

[Miyano et al. 1991] S. Miyano, A. Shinohara and T. Shinbhara. \Vhich Classes of Elementary Formal Systems are Polynomial-Time Learnable? In Proc.

2nd Algorithmic Learning Theory, Tokyo, 1991. pp.

139-150.

[Natarajan 1989] n.K. Natarajan, On Learning Set.s and Functions. AJachine Learning, Vol. 4 (1989), pp. 67-97.

[PIR] Protein Identification Resource, National Biomf'd-ical Resf'arch Foundation.

[Quinlan 1986] J.R. Quinlan, Induction of Decision Trees. l\larhine Learning, Vol. 1 (1986), pp. 81-106.

[Quinlan and Rivest 1989] J.R. Quinlan and R. L.

Rivest. Inferring Decision Trees using the 1\fini-mum Description Length Principle. Inform. Com-put., Vol. 80 (1989), pp. 227-218.

[Rao and Argos 1986] J.K.1\1. Rao and P. Argos. A Con-firmationa.l Preferf'nce Paramder to Predict Helices in Integral 1\lemhrane Proteins. Biorhim. Biophys.

Acta, Vol. 869 (1986), pp. 197-214.

[Shinoha1'a 1982] T. Shinohara. Polynomial Time Infer-ence of Pattern Languages and its Applicat ions. In

Proc. 7th IBM Symp. !lJathematiral Foundations of Computer Science, 1982. pp. 191-209.

[Shinohara 1981] T. Shinohara. Polynomial Time In-ference of Regular Pattern Languages. In Proc.

RL\JS Symp. SoftWAre Science and Engin('('ring (Lecture Notes in Computer Science, Vol. 147), 1983. pp. 115-127.

[Utgoff 1989] P.E. Utgoff. Incrf'mental Induction of De-cision Tree. MAchine Learning, Vol. 4 (1989), pp.161-186.

[Valiant 1981] L. Valiant. A Theory of the Lf'arna hIe.

Commun. A.CM, Vol. 27 (1981), pp. 113/1-1142.

[Von Heijine 1988] G. von Heijine. Transcending the Im-penetrahlf': How Proteins Come to Terms with Membranes. Biochim. Biophys. A.cta, Vol. 9-17 (1988), pp. 107-333.

[\Vu et al.] C.H. \Vu, G.M. \Vhiston and G . .J. :Montllor.

PROCANS: A Protein Classification System Using a Neural Network, J.JCNN Int. Joint Conf. N(,l1ral Networks, Vol. 2 (1990), pp. 91-96.

Efficient Induction of Version Spaces through