Decision Tree and Automata Learning

(1)

Decision Tree and Automata Learning

Stefan Edelkamp

(2)

1 Overview

- Decision tree representation

- Top Down Induction, Attribute Selection - Entropy, Information gain

- ID3 learning algorithm - Overfitting

- Continous, multi-valued, costly and unkown costs - Grammar and DFA Learning

- Angluin’s ID algorithm

Overview 1

(3)

2 Decision Tree Learning

PlayTennis:

Outlook

Overcast Humidity

Normal High

No Yes

Wind

Strong Weak

No Yes

Yes Sunny Rain

(4)

Training Examples:

Day Outlook Temp. Humidity Wind Play?

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Decision Tree Learning 3

(5)

Decision Trees

DT Representation: internal node tests attribute, branch corresponds to attribute value, leaf node assigns classification

When to consider DT:

- Instances describable by attribute–value pairs - Target function discrete valued

- Disjunctive hypothesis may be required - Possibly noisy training data

Examples: Equipment or medical diagnosis, Credit risk analysis, modeling calendar scheduling preferences

(6)

Top-Down Induction

Main Loop: A “best” decision attribute for next node, assign A as decision attribute for node

- for each value of A, create new descendant of node - sort training examples to leaf nodes

if training examples perfectly classified ⇒ stop else iterate over new leaf nodes Best Attribute:

A1=? A2=?

f

t t f

[29+,35-] [29+,35-]

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

(7)

Entropy

Entropy(S)

1.0

0.5

0.0 0.5 1.0

p+

- S is a sample of training examples

p_⊕: proportion of positive examples in S, p : proportion of negative examples in S Entropy measures the impurity of S

Entropy(S) ≡ −p_⊕ log₂ p_⊕ − p log₂ p

= expected # bits needed to encode class (⊕ or ) of randomly drawn member of S (under the optimal, shortest-length code)

(8)

Selecting the Next Attribute

Information Gain: expected reduction in entropy due to sorting on A Gain(S, A) ≡ Entropy(S) − ^X

v∈D_A

|S_v|

|S| Entropy(S_v)

Which attribute is the best classifier?

High Normal

Humidity

[3+,4-] [6+,1-]

Wind

Weak Strong

[6+,2-] [3+,3-]

= .940 - (7/14).985 - (7/14).592

= .151 = .940 - (8/14).811 - (6/14)1.0

= .048 Gain (S, Humidity ) Gain (S, )Wind

=0.940

E E=0.940

=0.811 E

=0.592 E

=0.985

E E=1.00

[9+,5-]

S:

[9+,5-]

S:

(9)

ID3: Hypothesis Space Search

...

+ + +

A1

+ – + – A2

A3 +

...

+ – + – A2

A4 – + – + –

A2 + – +

... ...

–

Target Function: surely in there . . . , no backtracking ⇒ Local minima . . .

Statistical Choices: robust to noisy data, inductive bias: “prefer shortest tree”

(10)

Occam’s Razor

Bias: preference for some hypotheses, rather than a restriction of hypothesis space . . . prefershortest hypothesis that fits the data

Arguments in favor to short hypotheses:

- a short hyp that fits data unlikely to be coincidence, a long hyp that fits data likely be coincidence

Arguments opposed short hypotheses:

- There are many ways to define small sets of hyps, e.g., all trees with a prime number of nodes that use attributes beginning with “Z”

- What’s so special about small sets based on size of hypothesis?

(11)

Overfitting in Decision Trees

Consider adding noisy Sunny, Hot, N ormal, Strong, P layT ennis = N o Consider error of hypothesis h over

- training data: error_train(h)

- entire distribution D of data: error_D(h)

Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h⁰ ∈ H such that

error_train(h) < error_train(h⁰) and error_D(h) > error_D(h⁰)

(12)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0 10 20 30 40 50 60 70 80 90 100

Accuracy

Size of tree (number of nodes)

On training data On test data

(13)

Avoiding Overfitting

Option 1: stop growing when split not statistically significant Option 2: grow full tree, then post-prune

Select best tree:

- measure performance over training data

- measure performance over separate validation data set - min. |tree| + |misclassif ications(tree)|

(14)

Rule Post-Pruning

- Convert tree to equivalent set of rules - Prune each rule independently of others

- Sort final rules into desired sequence for use Perhaps most frequently used method

(15)

Converting A Tree to Rules

Outlook

Overcast Humidity

Normal High

No Yes

Wind

Strong Weak

No Yes

Yes Sunny Rain

IF (Outlook = Sunny) ∧ (Humidity = High) THEN P layT ennis = N o

IF (Outlook = Sunny) ∧ (Humidity = N ormal) THEN P layT ennis = Y es

. . .

(16)

Continuous Valued Attributes

Create a discrete attribute to test continuous (T emperature > 72.3) = t, f Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

Attributes with Many Values: Gain will select it e.g. Date = J un 3 1996 One approach: use

GainRatio(S, A) ≡ Gain(S, A)

SplitInf ormation(S, A) SplitInf ormation(S, A) ≡ −

c X

i=1

|S_i|

|S| log₂ |S_i|

|S|

where S_i is subset of S for which A has value v_i

(17)

Attributes with Costs

e.g medical diagnosis: BloodTest has costs

Consistent tree with low expected cost: replace gain by - Gain²(S, A)/Cost(A).

- 2^Gain(S,A) − 1/(Cost(A) + 1)^wm, where w ∈ [0,1] determines importance of cost

(18)

Unknown Attribute Values

Use training example anyway, sort through tree

- node n tests A ⇒ assign most common value of A among other examples sorted to node n

- assign most common value of A among other examples with same target value - assign probability p_i to each possible value v_i of A

- assign fraction p_i of example to each descendant Classify new examples in same fashion

(19)

3 Automata Learning

Grammar Inference: process of learning an unknown grammar given a finite set of labeled examples

Regular Grammar: recognized by DFA

Given: finite set of positive examples, finite, possibly empty set of negative examples

Task: learn minimum state DFA equivalent to target . . . is NP-hard Simplifications:

- criteria on samples (e.g. structural completeness)

- knowledgeable teacher (oracle) who responds to queries generated by learner

(20)

Applications

- Inference of control structures in learning by examples

- Inference of normal models of systems in test

- Inference of consistent environments in partial models

Automata Learning 18

(21)

Trace Tree

(22)

DFA

(23)

Chart Parsing

(24)

Chart Parsing

(25)

Chart Parsing

(26)

Chart Parsing

(27)

Chart Parsing

(28)

Some Notation

Σ: set of symbols, Σ^∗: set of strings , λ: empty string

M = (Q, δ, Σ, q₀, F): DFA, L(M): language accepted by M

q in M alive: to be reached by some string α and left with some string β such that αβ ∈ L(M) ⇒ in minimal DFA only one non-alive state d₀

set of strings P live-complete w.r.t. M: ∀ live states q in M: ∃ α ∈ P with δ(q₀, α) = q

⇒ P⁰ = P ∪ {d₀} represents all states in M

f : P⁰ × Σ → Σ^∗ ∪ {d₀} by f(d₀, b) = d₀ and f(α, b) = αb

Transition set T⁰: P⁰ ∪ {f(α, b) | (α, b) ∈ P × Σ}; T = T⁰ − {d₀}

(29)

Angluin’s ID-Algorithm

Aim: construct a partition of T⁰ that places all the equivalent elements in one state Equivalence relation: Nerode ⇒ DFA minimal

Start: one accepting and one non-accepting state Partitioning:

- ∀i string v_i is drawn, s.t. ∀ q, q⁰ ∃j ≤ i with δ(q, v_j) ∈ F and δ(q⁰, v_j) ∈/ F, or vice versa

⇒ i-th partition E_i: E_i(d₀) = ∅ and E_i(α) = {v_j|j ≤ i, αv_j ∈ L(M)}

- ∀α, β ∈ T with δ(q₀, α) = δ(q₀, β) we have E_j(α) = E_j(β), j ≤ i

(30)

Construct ( i + 1) -th Partition

Separation:

- ∀i search α, β and b s.t. E_i(α) = E_i(β) but E_i(f(α, b)) 6= E_i(f(β, b)) γ: element in E_i(f(α, b)) and not in E_i(f(β, b)) or vice versa

∀α ∈ T : query the string αv_i+1, v_i+1 = bγ

αv_i+1 ∈ L(M) ⇒ E_i+1 ← E_i ∪ {v_i+1}; otherwise, E_i+1 ← E_i . . . iterate until no separating pair α, β exists

(31)

Pseudo Code

Input: Live complete set P, Teacher to answer membership queries Output: Canonical DFA M for target regular grammar

i ← 0;v_i ← λ;V ← {λ}

T ← P ∪ {f(α, b)|(α, b) ∈ P × Σ}

T⁰ ← T ∪ {d₀}, E₀(d₀) ← ∅ for each α ∈ T

if (α ∈ L) E₀(α) ← {λ} else E₀(α) ← ∅

while (∃α, β ∈ P⁰ ∧ b ∈ Σ: E_i(α) = E_i(β), but E_i(f(α, b)) 6= E_i(f(β, b))) γ ← Select(E_i(f(α, b)) ⊕ E_i(f(β, b)))

v_i+1 ← bγ; V ← V ∪ {v_i+1}; i ← i + 1 for each α ∈ T

if (αv_i ∈ L) E_i(α) ← E_i−1(α) ∪ {v_i} else E_i(α) ← E_i−1(α) return extracted DFA M for L from E_i and T

(32)

Extracting the Automat M

. . . from sets E_i and transition set T: - states of M are sets E_i(α), for α ∈ T - initial state of M is E_i(λ)

- accepting states of M are sets E_i(α), where α ∈ T and λ ∈ E_i(α).

If E_i(α) = ∅ then we add self loops on the state E_i(α) for all b ∈ Σ; else we set the transition δ(E_i(α), b) = E_i(f(α, b)) for all α ∈ P and b ∈ Σ

(33)

Time Complexity

Theorem

n: # states in M ⇒ ID asks no more than n · |Σ| · |P| queries Proof:

Each time at least one set E_i (corresponding to a state) is partitioned into two subsets

Each iteration: ask |T| questions, where T contains no more than |Σ| · |P| elements

⇒ algorithm iterates through while-loop at most n times since