Decision Tree and Automata Learning
Stefan Edelkamp
1 Overview
- Decision tree representation
- Top Down Induction, Attribute Selection - Entropy, Information gain
- ID3 learning algorithm - Overfitting
- Continous, multi-valued, costly and unkown costs - Grammar and DFA Learning
- Angluin’s ID algorithm
Overview 1
2 Decision Tree Learning
PlayTennis:
Outlook
Overcast Humidity
Normal High
No Yes
Wind
Strong Weak
No Yes
Yes Sunny Rain
Training Examples:
Day Outlook Temp. Humidity Wind Play?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree Learning 3
Decision Trees
DT Representation: internal node tests attribute, branch corresponds to attribute value, leaf node assigns classification
When to consider DT:
- Instances describable by attribute–value pairs - Target function discrete valued
- Disjunctive hypothesis may be required - Possibly noisy training data
Examples: Equipment or medical diagnosis, Credit risk analysis, modeling calendar scheduling preferences
Top-Down Induction
Main Loop: A “best” decision attribute for next node, assign A as decision attribute for node
- for each value of A, create new descendant of node - sort training examples to leaf nodes
if training examples perfectly classified ⇒ stop else iterate over new leaf nodes Best Attribute:
A1=? A2=?
f
t t f
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
Decision Tree Learning 5
Entropy
Entropy(S)
1.0
0.5
0.0 0.5 1.0
p+
- S is a sample of training examples
p⊕: proportion of positive examples in S, p : proportion of negative examples in S Entropy measures the impurity of S
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p
= expected # bits needed to encode class (⊕ or ) of randomly drawn member of S (under the optimal, shortest-length code)
Selecting the Next Attribute
Information Gain: expected reduction in entropy due to sorting on A Gain(S, A) ≡ Entropy(S) − X
v∈DA
|Sv|
|S| Entropy(Sv)
Which attribute is the best classifier?
High Normal
Humidity
[3+,4-] [6+,1-]
Wind
Weak Strong
[6+,2-] [3+,3-]
= .940 - (7/14).985 - (7/14).592
= .151 = .940 - (8/14).811 - (6/14)1.0
= .048 Gain (S, Humidity ) Gain (S, )Wind
=0.940
E E=0.940
=0.811 E
=0.592 E
=0.985
E E=1.00
[9+,5-]
S:
[9+,5-]
S:
Decision Tree Learning 7
ID3: Hypothesis Space Search
...
+ + +
A1
+ – + – A2
A3 +
...
+ – + – A2
A4 – + – + –
A2 + – +
... ...
–
Target Function: surely in there . . . , no backtracking ⇒ Local minima . . .
Statistical Choices: robust to noisy data, inductive bias: “prefer shortest tree”
Occam’s Razor
Bias: preference for some hypotheses, rather than a restriction of hypothesis space . . . prefershortest hypothesis that fits the data
Arguments in favor to short hypotheses:
- a short hyp that fits data unlikely to be coincidence, a long hyp that fits data likely be coincidence
Arguments opposed short hypotheses:
- There are many ways to define small sets of hyps, e.g., all trees with a prime number of nodes that use attributes beginning with “Z”
- What’s so special about small sets based on size of hypothesis?
Decision Tree Learning 9
Overfitting in Decision Trees
Consider adding noisy Sunny, Hot, N ormal, Strong, P layT ennis = N o Consider error of hypothesis h over
- training data: errortrain(h)
- entire distribution D of data: errorD(h)
Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h0 ∈ H such that
errortrain(h) < errortrain(h0) and errorD(h) > errorD(h0)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0 10 20 30 40 50 60 70 80 90 100
Accuracy
Size of tree (number of nodes)
On training data On test data
Avoiding Overfitting
Option 1: stop growing when split not statistically significant Option 2: grow full tree, then post-prune
Select best tree:
- measure performance over training data
- measure performance over separate validation data set - min. |tree| + |misclassif ications(tree)|
Rule Post-Pruning
- Convert tree to equivalent set of rules - Prune each rule independently of others
- Sort final rules into desired sequence for use Perhaps most frequently used method
Decision Tree Learning 12
Converting A Tree to Rules
Outlook
Overcast Humidity
Normal High
No Yes
Wind
Strong Weak
No Yes
Yes Sunny Rain
IF (Outlook = Sunny) ∧ (Humidity = High) THEN P layT ennis = N o
IF (Outlook = Sunny) ∧ (Humidity = N ormal) THEN P layT ennis = Y es
. . .
Continuous Valued Attributes
Create a discrete attribute to test continuous (T emperature > 72.3) = t, f Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Attributes with Many Values: Gain will select it e.g. Date = J un 3 1996 One approach: use
GainRatio(S, A) ≡ Gain(S, A)
SplitInf ormation(S, A) SplitInf ormation(S, A) ≡ −
c X
i=1
|Si|
|S| log2 |Si|
|S|
where Si is subset of S for which A has value vi
Decision Tree Learning 14
Attributes with Costs
e.g medical diagnosis: BloodTest has costs
Consistent tree with low expected cost: replace gain by - Gain2(S, A)/Cost(A).
- 2Gain(S,A) − 1/(Cost(A) + 1)wm, where w ∈ [0,1] determines importance of cost
Unknown Attribute Values
Use training example anyway, sort through tree
- node n tests A ⇒ assign most common value of A among other examples sorted to node n
- assign most common value of A among other examples with same target value - assign probability pi to each possible value vi of A
- assign fraction pi of example to each descendant Classify new examples in same fashion
Decision Tree Learning 16
3 Automata Learning
Grammar Inference: process of learning an unknown grammar given a finite set of labeled examples
Regular Grammar: recognized by DFA
Given: finite set of positive examples, finite, possibly empty set of negative examples
Task: learn minimum state DFA equivalent to target . . . is NP-hard Simplifications:
- criteria on samples (e.g. structural completeness)
- knowledgeable teacher (oracle) who responds to queries generated by learner
Applications
- Inference of control structures in learning by examples
- Inference of normal models of systems in test
- Inference of consistent environments in partial models
Automata Learning 18
Trace Tree
DFA
Automata Learning 20
Chart Parsing
Chart Parsing
Automata Learning 22
Chart Parsing
Chart Parsing
Automata Learning 24
Chart Parsing
Some Notation
Σ: set of symbols, Σ∗: set of strings , λ: empty string
M = (Q, δ, Σ, q0, F): DFA, L(M): language accepted by M
q in M alive: to be reached by some string α and left with some string β such that αβ ∈ L(M) ⇒ in minimal DFA only one non-alive state d0
set of strings P live-complete w.r.t. M: ∀ live states q in M: ∃ α ∈ P with δ(q0, α) = q
⇒ P0 = P ∪ {d0} represents all states in M
f : P0 × Σ → Σ∗ ∪ {d0} by f(d0, b) = d0 and f(α, b) = αb
Transition set T0: P0 ∪ {f(α, b) | (α, b) ∈ P × Σ}; T = T0 − {d0}
Automata Learning 26
Angluin’s ID-Algorithm
Aim: construct a partition of T0 that places all the equivalent elements in one state Equivalence relation: Nerode ⇒ DFA minimal
Start: one accepting and one non-accepting state Partitioning:
- ∀i string vi is drawn, s.t. ∀ q, q0 ∃j ≤ i with δ(q, vj) ∈ F and δ(q0, vj) ∈/ F, or vice versa
⇒ i-th partition Ei: Ei(d0) = ∅ and Ei(α) = {vj|j ≤ i, αvj ∈ L(M)}
- ∀α, β ∈ T with δ(q0, α) = δ(q0, β) we have Ej(α) = Ej(β), j ≤ i
Construct ( i + 1) -th Partition
Separation:
- ∀i search α, β and b s.t. Ei(α) = Ei(β) but Ei(f(α, b)) 6= Ei(f(β, b)) γ: element in Ei(f(α, b)) and not in Ei(f(β, b)) or vice versa
∀α ∈ T : query the string αvi+1, vi+1 = bγ
αvi+1 ∈ L(M) ⇒ Ei+1 ← Ei ∪ {vi+1}; otherwise, Ei+1 ← Ei . . . iterate until no separating pair α, β exists
Automata Learning 28
Pseudo Code
Input: Live complete set P, Teacher to answer membership queries Output: Canonical DFA M for target regular grammar
i ← 0;vi ← λ;V ← {λ}
T ← P ∪ {f(α, b)|(α, b) ∈ P × Σ}
T0 ← T ∪ {d0}, E0(d0) ← ∅ for each α ∈ T
if (α ∈ L) E0(α) ← {λ} else E0(α) ← ∅
while (∃α, β ∈ P0 ∧ b ∈ Σ: Ei(α) = Ei(β), but Ei(f(α, b)) 6= Ei(f(β, b))) γ ← Select(Ei(f(α, b)) ⊕ Ei(f(β, b)))
vi+1 ← bγ; V ← V ∪ {vi+1}; i ← i + 1 for each α ∈ T
if (αvi ∈ L) Ei(α) ← Ei−1(α) ∪ {vi} else Ei(α) ← Ei−1(α) return extracted DFA M for L from Ei and T
Extracting the Automat M
. . . from sets Ei and transition set T: - states of M are sets Ei(α), for α ∈ T - initial state of M is Ei(λ)
- accepting states of M are sets Ei(α), where α ∈ T and λ ∈ Ei(α).
If Ei(α) = ∅ then we add self loops on the state Ei(α) for all b ∈ Σ; else we set the transition δ(Ei(α), b) = Ei(f(α, b)) for all α ∈ P and b ∈ Σ
Automata Learning 30
Time Complexity
Theorem
n: # states in M ⇒ ID asks no more than n · |Σ| · |P| queries Proof:
Each time at least one set Ei (corresponding to a state) is partitioned into two subsets
Each iteration: ask |T| questions, where T contains no more than |Σ| · |P| elements
⇒ algorithm iterates through while-loop at most n times since