• Keine Ergebnisse gefunden

Planning and Optimization

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G6. Monte-Carlo Tree Search Algorithms (Part II)

Malte Helmert and Gabriele R¨ oger

Universit¨ at Basel

December 14, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 1 / 25

Planning and Optimization

December 14, 2020 — G6. Monte-Carlo Tree Search Algorithms (Part II)

G6.1 ε-greedy G6.2 Softmax G6.3 UCB1 G6.4 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 2 / 25

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(2)

G6. Monte-Carlo Tree Search Algorithms (Part II) ε-greedy

G6.1 ε-greedy

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 5 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) ε-greedy

ε-greedy: Idea

I tree policy parametrized with constant parameter ε I with probability 1 − ε, pick one of the greedy actions

uniformly at random

I otherwise, pick a non-greedy successor uniformly at random ε-greedy Tree Policy

π(a | d ) = ( 1−

|A

k?

(d)| if a ∈ A k ? (d )

|A(d (s))\A

k?

(d )| otherwise,

with A k ? (d ) = {a(c ) ∈ A(s (d )) | c ∈ arg min c

0

∈children(d ) Q ˆ k (c 0 )}.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 6 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) ε-greedy

ε-greedy: Example

d

c 1

Q(c ˆ

1

) = 6

c 2

Q(c ˆ

2

) = 12

c 3

Q(c ˆ

3

) = 6

c 4

Q(c ˆ

4

) = 9

Assuming a(c i ) = a i and ε = 0.2, we get:

I π(a 1 | d ) = 0.4 I π(a 2 | d ) = 0.1

I π(a 3 | d) = 0.4 I π(a 4 | d) = 0.1

G6. Monte-Carlo Tree Search Algorithms (Part II) ε-greedy

ε-greedy: Asymptotic Optimality

Asymptotic Optimality of ε-greedy I explores forever

I not greedy in the limit not asymptotically optimal

asymptotically optimal variant uses decaying ε, e.g. ε = k 1

(3)

G6. Monte-Carlo Tree Search Algorithms (Part II) ε-greedy

ε-greedy: Weakness

Problem:

when ε-greedy explores, all non-greedy actions are treated equally

d

c 1

Q(c ˆ

1

) = 8

c 2

Q(c ˆ

2

) = 9

c 3

Q(c ˆ

3

) = 50

. . . c l+2

Q(c ˆ

l+2

) = 50

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

| {z }

` nodes Assuming a(c i ) = a i , ε = 0.2 and ` = 9, we get:

I π(a 1 | d ) = 0.8

I π(a 2 | d ) = π(a 3 | d ) = · · · = π(a 11 | d ) = 0.02

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 9 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) Softmax

G6.2 Softmax

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 10 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) Softmax

Softmax: Idea

I tree policy with constant parameter τ

I select actions proportionally to their action-value estimate I most popular softmax tree policy uses Boltzmann exploration I ⇒ selects actions proportionally to e

Qkˆ(c) τ

Tree Policy based on Boltzmann Exploration π(a(c) | d ) = e

Qkˆ(c) τ

P

c

0

∈children(d) e

−Qkˆ(c0) τ

G6. Monte-Carlo Tree Search Algorithms (Part II) Softmax

Softmax: Example

d

c 1

Q(c ˆ

1

) = 8

c 2

Q(c ˆ

2

) = 9

c 3

Q(c ˆ

3

) = 50

. . . c l +2

Q ˆ (c

l+2

) = 50

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

| {z }

` nodes Assuming a(c i ) = a i , τ = 10 and ` = 9, we get:

I π(a 1 | d ) = 0.49

I π(a 2 | d ) = 0.45

(4)

G6. Monte-Carlo Tree Search Algorithms (Part II) Softmax

Boltzmann Exploration: Asymptotic Optimality

Asymptotic Optimality of Boltzmann Exploration I explores forever

I not greedy in the limit:

I state- and action-value estimates converge to finite values I therefore, probabilities also converge to positive, finite values not asymptotically optimal

asymptotically optimal variant uses decaying τ , e.g. τ = log 1 k careful: τ must not decay faster than logarithmically

careful:

(i.e., must have τ ≥ const log k ) to explore infinitely

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 13 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) Softmax

Boltzmann Exploration: Weakness

a

1

a

2

a

3

cost P

a

1

a

2

a

3

cost P

I Boltzmann exploration and ε-greedy only consider mean of sampled action-values

I as we sample the same node many times, we can also gather information about variance (how reliable the information is) I Boltzmann exploration ignores the variance,

treating the two scenarios equally

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 14 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

G6.3 UCB1

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Upper Confidence Bounds: Idea

Balance exploration and exploitation by preferring actions that I have been successful in earlier iterations (exploit)

I have been selected rarely (explore)

(5)

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Upper Confidence Bounds: Idea

I select successor c of d that minimizes ˆ Q k (c) − E k (d ) · B k (c ) I based on action-value estimate Q ˆ

k

(c),

I exploration factor E

k

(d) and I bonus term B

k

(c).

I select B k (c ) such that

Q ? (s (c), a(c )) ≥ Q ˆ k (c ) − E k (d ) · B k (c) with high probability

I Idea: ˆ Q k (c) − E k (d ) · B k (c ) is a lower confidence bound on Q ? (s(c), a(c )) under the collected information

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 17 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Bonus Term of UCB1

I use B k (c) = q 2·ln

N

k

(d )

N

k

(c) as bonus term

I bonus term is derived from Chernoff-Hoeffding bound:

I gives the probability that a sampled value (here: ˆ Q

k

(c)) I is far from its true expected value (here: Q

?

(s(c), a(c))) I in dependence of the number of samples (here: N

k

(c)) I picks the optimal action exponentially more often

I concrete MCTS algorithm that uses UCB1 is called UCT

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 18 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Exploration Factor (1)

Exploration factor E k (d ) serves two roles in SSPs:

I UCB1 designed for MAB with reward in [0, 1]

⇒ Q ˆ k (c) ∈ [0; 1] for all k and c I bonus term B k (c ) =

q 2·ln N

k

(d)

N

k

(c) always ≥ 0 I when d is visited,

I B

k+1

(c) > B

k

(c) if a(c) is not selected I B

k+1

(c) < B

k

(c) if a(c) is selected

I if B k (c ) ≥ 2 for some c, UCB1 must explore I hence, ˆ Q k (c ) and B k (c) are always of similar size

⇒ set E k (d ) to a value that depends on ˆ V k (d )

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Exploration Factor (2)

Exploration factor E k (d ) serves two roles in SSPs:

I E k (d ) allows to adjust balance between exploration and exploitation I search with E k (d ) = ˆ V k (d ) very greedy

I in practice, E k (d ) is often multiplied with constant > 1

I UCB1 often requires hand-tailored E k (d ) to work well

(6)

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Asymptotic Optimality

Asymptotic Optimality of UCB1 I explores forever

I greedy in the limit asymptotically optimal

However:

I no theoretical justification to use UCB1 for SSPs/MDPs (MAB proof requires stationary rewards)

I development of tree policies is an active research topic

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 21 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Symmetric Search Tree up to depth 4

full tree up to depth 4

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 22 / 25

G6. Monte-Carlo Tree Search Algorithms (Part II) UCB1

Asymmetric Search Tree of UCB1

(equal number of search nodes)

G6. Monte-Carlo Tree Search Algorithms (Part II) Summary

G6.4 Summary

(7)

G6. Monte-Carlo Tree Search Algorithms (Part II) Summary

Summary

I ε-greedy, Boltzmann exploration and UCB1 balance exploration and exploitation

I ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise I ε-greedy selects non-greedy actions with same probability I Boltzmann exploration selects each action proportional to its

action-value estimate

I Boltzmann exploration does not take confidence of estimate into account

I UCB1 selects actions greedily w.r.t. upper confidence bound on action-value estimate

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 25 / 25

Referenzen

ÄHNLICHE DOKUMENTE

As an alter- native for standard Greedy-training methods based upon a-posteriori error estimates on a training subset of the parameter set, we consider a nonlinear optimization

Prepare an example that illustrates the different routing paths obtained when executing the algorithms Greedy-Face-Greedy, GOAFR, FACE using before crossing, and FACE using

Die vertikale Polari- sation V ist die Polarisation, die aus dem NLOK durch einmaligen Durchgang erzeugt wird.. Während die horizontale Polarisation H mit einem

Quality and Robustness of Heuristics The number of states that GBFS potentially expands and the numbers of states expansions in best-case and worst-case search runs of GBFS

ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise ε-greedy selects non-greedy actions with same probability Boltzmann

1.2 Feasibility In Matroids Proof : First we prove that in a matroid the best-in-greedy algorithm finds an optimal solution for all weight functions.. This contradicts the

Comparing the routing length of the isometric embedding with virtual trees (for FAT trees), using virtual trees has a slightly lower.. Routing Performance: a) compares Prefix