• Keine Ergebnisse gefunden

Planning and Optimization G6. Monte-Carlo Tree Search Algorithms (Part II) Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G6. Monte-Carlo Tree Search Algorithms (Part II) Malte Helmert and Gabriele R¨oger"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

G6. Monte-Carlo Tree Search Algorithms (Part II)

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

(2)

ε-greedy Softmax UCB1 Summary

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

(3)

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(4)

ε-greedy Softmax UCB1 Summary

ε-greedy

(5)

ε-greedy: Idea

tree policy parametrized with constant parameter ε with probability 1−ε, pick one of thegreedy actions uniformly at random

otherwise, pick a non-greedy successor uniformly at random ε-greedy Tree Policy

π(a|d) = ( 1−

|Ak?(d)| ifa∈Ak?(d)

|A(d(s))\Ak?(d)| otherwise,

withAk?(d) ={a(c)∈A(s(d))|c ∈arg minc0∈children(d)k(c0)}.

(6)

ε-greedy Softmax UCB1 Summary

ε-greedy: Example

d

c1

Qˆ(c1) = 6

c2

Qˆ(c2) = 12

c3

Qˆ(c3) = 6

c4

Q(cˆ 4) = 9

Assuminga(ci) =ai andε= 0.2, we get:

π(a1 |d) =

0.4

π(a2 |d) =

0.1

π(a3 |d) =

0.4

π(a4 |d) =

0.1

(7)

ε-greedy: Example

d

c1

Qˆ(c1) = 6

c2

Qˆ(c2) = 12

c3

Qˆ(c3) = 6

c4

Q(cˆ 4) = 9

Assuminga(ci) =ai andε= 0.2, we get:

π(a1 |d) = 0.4 π(a2 |d) = 0.1

π(a3 |d) = 0.4 π(a4 |d) = 0.1

(8)

ε-greedy Softmax UCB1 Summary

ε-greedy: Asymptotic Optimality

Asymptotic Optimality ofε-greedy explores forever

not greedy in the limit not asymptotically optimal

asymptotically optimal variant usesdecaying ε, e.g. ε= 1k

(9)

ε-greedy: Asymptotic Optimality

Asymptotic Optimality ofε-greedy explores forever

not greedy in the limit not asymptotically optimal

asymptotically optimal variant usesdecaying ε, e.g. ε= 1k

(10)

ε-greedy Softmax UCB1 Summary

ε-greedy: Weakness

Problem:

whenε-greedy explores, all non-greedy actions are treatedequally

d

c1

Q(cˆ 1) = 8

c2

Q(cˆ 2) = 9

c3

Q(cˆ 3) = 50

. . . cl+2

Q(cˆ l+2) = 50

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

| {z }

`nodes Assuminga(ci) =ai,ε= 0.2 and `= 9, we get:

π(a1 |d) = 0.8

π(a2 |d) =π(a3 |d) =· · ·=π(a11|d) = 0.02

(11)

ε-greedy: Weakness

Problem:

whenε-greedy explores, all non-greedy actions are treatedequally

d

c1

Q(cˆ 1) = 8

c2

Q(cˆ 2) = 9

c3

Q(cˆ 3) = 50

. . . cl+2

Q(cˆ l+2) = 50

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

| {z }

`nodes Assuminga(ci) =ai,ε= 0.2 and `= 9, we get:

π(a1 |d) = 0.8

π(a2 |d)=π(a3 |d) =· · ·=π(a11|d)= 0.02

(12)

ε-greedy Softmax UCB1 Summary

Softmax

(13)

Softmax: Idea

tree policy with constant parameterτ

select actions proportionallyto their action-value estimate most popular softmax tree policy uses Boltzmann exploration

⇒ selects actions proportionally to e

Qkˆ (c) τ

Tree Policy based on Boltzmann Exploration π(a(c)|d) = e

Qkˆ(c) τ

P

c0∈children(d)e

Qkˆ (c0) τ

(14)

ε-greedy Softmax UCB1 Summary

Softmax: Example

d

c1

Q(cˆ 1) = 8

c2

Q(cˆ 2) = 9

c3

Q(cˆ 3) = 50

. . . cl+2

Q(cˆ l+2) = 50

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

| {z }

`nodes Assuminga(ci) =ai,τ = 10 and`= 9, we get:

π(a1 |d) = 0.49 π(a2 |d) = 0.45

π(a3 |d) =. . .=π(a11|d) = 0.007

(15)

ε-greedy Softmax UCB1 Summary

Boltzmann Exploration: Asymptotic Optimality

Asymptotic Optimality of Boltzmann Exploration explores forever

not greedy in the limit:

state- and action-value estimates converge to finite values therefore, probabilities also converge to positive, finite values not asymptotically optimal

asymptotically optimal variant usesdecaying τ, e.g. τ = log1k careful: τ must not decay faster than logarithmically

(i.e., must have τ ≥ constlogk) to explore infinitely

(16)

ε-greedy Softmax UCB1 Summary

Boltzmann Exploration: Weakness

a1 a2

a3

cost P

a1 a2 a3

cost P

Boltzmann exploration and ε-greedy only consider meanof sampled action-values

as we sample the same node many times, we can also gather information about variance (howreliable the information is) Boltzmann exploration ignores the variance,

treating the two scenarios equally

(17)

UCB1

(18)

ε-greedy Softmax UCB1 Summary

Upper Confidence Bounds: Idea

Balanceexplorationandexploitationby preferring actions that have been successful in earlier iterations (exploit)

have been selected rarely (explore)

(19)

Upper Confidence Bounds: Idea

select successorc of d that minimizes ˆQk(c)−Ek(d)·Bk(c) based onaction-value estimateQˆk(c),

exploration factorEk(d) and bonus termBk(c).

select Bk(c) such that

Q?(s(c),a(c))≥Qˆk(c)−Ek(d)·Bk(c) with high probability

Idea: ˆQk(c)−Ek(d)·Bk(c) is a lower confidence bound onQ?(s(c),a(c)) under the collected information

(20)

ε-greedy Softmax UCB1 Summary

Bonus Term of UCB1

use Bk(c) =

q2·lnNk(d)

Nk(c) as bonus term

bonus term is derived from Chernoff-Hoeffding bound:

gives the probability that asampled value(here: ˆQk(c)) is far from itstrue expected value(here: Q?(s(c),a(c))) in dependence of thenumber of samples(here: Nk(c)) picks the optimal actionexponentially more often

concrete MCTS algorithm that uses UCB1 is called UCT

(21)

Exploration Factor (1)

Exploration factorEk(d) serves two rolesin SSPs:

UCB1 designed for MAB with reward in [0,1]

⇒Qˆk(c)∈[0; 1] for allk andc bonus term Bk(c) =

q2·lnNk

(d)

Nk(c) always≥0 when d is visited,

Bk+1(c)>Bk(c) ifa(c) is not selected Bk+1(c)<Bk(c) ifa(c) is selected

if Bk(c)≥2 for some c, UCB1must explore hence, ˆQk(c) andBk(c) are always of similar size

⇒set Ek(d) to a value that depends on ˆVk(d)

(22)

ε-greedy Softmax UCB1 Summary

Exploration Factor (2)

Exploration factorEk(d) serves two rolesin SSPs:

Ek(d) allows to adjust balance between exploration and exploitation search with Ek(d) = ˆVk(d) very greedy

in practice, Ek(d) is often multiplied with constant>1 UCB1 often requires hand-tailored Ek(d) to work well

(23)

ε-greedy Softmax UCB1 Summary

Asymptotic Optimality

Asymptotic Optimality of UCB1 explores forever

greedy in the limit asymptotically optimal

notheoretical justification to use UCB1 for SSPs/MDPs (MAB proof requiresstationary rewards)

development of tree policies is an activeresearch topic

(24)

ε-greedy Softmax UCB1 Summary

Asymptotic Optimality

Asymptotic Optimality of UCB1 explores forever

greedy in the limit asymptotically optimal

However:

notheoretical justification to use UCB1 for SSPs/MDPs (MAB proof requiresstationary rewards)

development of tree policies is an activeresearch topic

(25)

Symmetric Search Tree up to depth 4

full tree up to depth 4

(26)

ε-greedy Softmax UCB1 Summary

Asymmetric Search Tree of UCB1

(equal number of search nodes)

(27)

Summary

(28)

ε-greedy Softmax UCB1 Summary

Summary

ε-greedy, Boltzmann exploration and UCB1balance exploration and exploitation

ε-greedy selects greedy action with probability 1−ε and another action uniformly at random otherwise ε-greedy selects non-greedy actions withsame probability Boltzmann exploration selects each action proportional to its action-value estimate

Boltzmann exploration does not take confidence of estimate into account

UCB1 selects actions greedily w.r.t. upper confidence bound on action-value estimate

Referenzen

ÄHNLICHE DOKUMENTE

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

I Adding more constraints can only remove feasible solutions I Fewer feasible solutions can only increase objective value I Higher objective value means better informed heuristic

Adding more constraints can only remove feasible solutions Fewer feasible solutions can only increase objective value Higher objective value means better informed heuristic Are

I a determinization is sampled (Hindsight Optimization) I runs under a fixed policy are simulated (Policy Simulation) I considered outcomes are sampled (Sparse Sampling) I runs under

a determinization is sampled (Hindsight Optimization) runs under a fixed policy are simulated (Policy Simulation) considered outcomes are sampled (Sparse Sampling) runs under

Like (L)RTDP, MCTS performs trials (also called rollouts) Like Policy Simulation, trials simulate execution of a policy Like other Monte-Carlo methods, Monte-Carlo backups

I ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise I ε-greedy selects non-greedy actions with same probability I