• Keine Ergebnisse gefunden

Planning and Optimization G4. Monte-Carlo Tree Search: Framework Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G4. Monte-Carlo Tree Search: Framework Malte Helmert and Gabriele R¨oger"

Copied!
37
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

G4. Monte-Carlo Tree Search: Framework

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

(2)

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

(3)

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(4)

Motivation

(5)

Motivation

Previously discussed Monte-Carlo methods:

Hindsight Optimization suffers from assumption of clairvoyance

Policy Simulation overcomes assumption of clairvoyance by sampling the execution of a policy

Policy Simulation is suboptimal due to inability of policy to improve Sparse Sampling achievesnear-optimality without considering all outcomes

Sparse Sampling wastes time in non-promising parts of state space

(6)

Monte-Carlo Tree Search

Monte-Carlo Tree Search(MCTS) has several similarities with algorithms we have already seen:

Like (L)RTDP, MCTS performs trials (also calledrollouts) Like Policy Simulation, trials simulate execution of a policy Like other Monte-Carlo methods, Monte-Carlo backups are performed

Like Sparse Sampling, an outcome is only explicated if it is sampled in a trial

(7)

Monte-Carlo Tree Search

Monte-Carlo Tree Search(MCTS) has several similarities with algorithms we have already seen:

Like (L)RTDP, MCTS performs trials (also calledrollouts) Like Policy Simulation, trials simulate execution of a policy Like other Monte-Carlo methods, Monte-Carlo backups are performed

Like Sparse Sampling, an outcome is only explicated if it is sampled in a trial

(8)

Monte-Carlo Tree Search

Monte-Carlo Tree Search(MCTS) has several similarities with algorithms we have already seen:

Like (L)RTDP, MCTS performs trials (also calledrollouts) Like Policy Simulation, trials simulate execution of a policy Like other Monte-Carlo methods, Monte-Carlo backups are performed

Like Sparse Sampling, an outcome is only explicated if it is sampledin a trial

(9)

MCTS Tree

(10)

MCTS Tree

Unlike previous methods, the SSP is explicated as a tree

Duplicates (also: transpositions) possible,

i.e., multiple search nodeswith identical associated state Search tree can (and often will) have unboundeddepth

(11)

Tree Structure

Differentiate between two types of search nodes:

Decision nodes Chance nodes

Search nodescorrespond 1:1totraces from initial state Decision and chance nodes alternate

Decision nodes correspond to states in a trace Chance nodes correspond to actionsin a trace

Decision nodes have onechild node for eachapplicable action (if all children are explicated)

Chance nodes have one child node for each outcome (if all children are explicated)

(12)

MCTS Tree

Definition (MCTS Tree)

AnMCTS treeis given by a tuple G=hd0,D,C,Ei, where D andC are disjoint sets ofdecisionand chancenodes (simply search nodeif the type does not matter) d0∈D is the root node

E ⊆(D×C)∪(C ×D) is the set ofedges such that the graphhD∪C,Ei is a tree

Note: can be regarded as an AND/OR tree

(13)

Search Node Annotations

Definition (Search Node Annotations) LetG=hd0,D,C,Ei be an MCTS Tree.

Each search node n∈D∪C is annotated with a visit counterN(n)

a states(n)

Each decision node d ∈D is annotated with a state-value estimate ˆV(d)

a probabilityp(d)

Each chance nodec ∈C is annotated with

an action-value estimate (or Q-value estimate) ˆQ(c) an actiona(c)

Note: some annotations can be computed on the fly to save memory

(14)

Framework

(15)

Trials

The MCTS tree is built in trials

Trials are performed as long as resources (deliberation time, memory) allow

Initially, the MCTS tree consists of only the root node for the initial state

Trials (may) add search nodes to the tree

MCTS tree at the end of the i-th trial is denoted with Gi Use same superscript for annotations of search nodes

(16)

Trials

Taken from Browne et al., “A Survey of Monte Carlo Tree Search Methods”, 2012

(17)

Phases of Trials

Each trial consists of (up to) fourphases:

Selection: traverse the tree bysamplingthe execution of the tree policy until

1 an action is applicable that is not explicated, or

2 an outcome is sampled that is not explicated, or

3 a goal state is reached (jump to backpropagation)

Expansion: create search nodesfor the applicable action and a sampled outcome (case 1) or just the outcome (case 2) Simulation: simulate default policyuntil a goal is reached Backpropagation: update visited nodesin reverse orderby

increasing visit counter by 1

performing Monte-Carlo backup of state-/action-value estimate

(18)

Monte-Carlo Backups in MCTS Tree

let d0,c0, . . . ,cn−1,dn be the decision and chance nodes that were visited in a trial of MCTS (including explicated ones), let h be the cost incurred by the simulation of the default policy until a goal state is reached

each decision node dj for 0≤j ≤n is updated by Vˆi(dj) := ˆVi−1(dj) + 1

Ni(dj)(

n−1

X

k=j

cost(a(ck)) +h−Vˆi−1(dj))

each chance node cj for 0≤j <n is updated by Qˆi(cj) := ˆQi−1(cj) + 1

Ni(cj)(

n−1

X

k=j

cost(a(ck)) +h−Qˆi−1(cj))

(19)

MCTS: (Unit-cost) Example

Selection phase: apply tree policy to traverse tree

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

(20)

MCTS: (Unit-cost) Example

Selection phase: apply tree policy to traverse tree

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

(21)

MCTS: (Unit-cost) Example

Selection phase: apply tree policy to traverse tree

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

(22)

MCTS: (Unit-cost) Example

Selection phase: apply tree policy to traverse tree

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

(23)

MCTS: (Unit-cost) Example

Expansion phase: create search nodes

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

/ 11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

(24)

MCTS: (Unit-cost) Example

Simulation phase: apply default policy until goal

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

/ 11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(25)

MCTS: (Unit-cost) Example

Backpropagation phase: update visited nodes

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

/

17 1

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(26)

MCTS: (Unit-cost) Example

Backpropagation phase: update visited nodes

19 9

35/1 9/4 25/4

34 1 9 2 7 2 21 2 27 2

18/1

17 1

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(27)

MCTS: (Unit-cost) Example

Backpropagation phase: update visited nodes

19 9

35/1 9/4 25/4

34 1 12 3 7 2 21 2 27 2

18/1

17 1

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(28)

MCTS: (Unit-cost) Example

Backpropagation phase: update visited nodes

19 9

35/1 11/5 25/4

34 1 12 3 7 2 21 2 27 2

18/1

17 1

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(29)

MCTS: (Unit-cost) Example

Backpropagation phase: update visited nodes

19 10

35/1 11/5 25/4

34 1 12 3 7 2 21 2 27 2

18/1

17 1

11/1 9/1 15/1 23/1

10 1 8 1 14 1 22 1

17

(30)

MCTS Framework

Member of MCTSframework are specified in terms of:

Tree policy Default policy

(31)

MCTS Tree Policy

Definition (Tree Policy)

LetT be an SSP. An MCTS tree policyis a probability distribution π(a|d) over alla∈A(s(d)) for each decision noded.

Note: The tree policy may take information annotated in the current tree into account.

(32)

MCTS Default Policy

Definition (Default Policy)

LetT be an SSP. AnMCTS default policy is a probability distributionπ(a|s) over actionsa∈A(s) for each states. Note: The default policy is independent of the MCTS tree.

(33)

Monte-Carlo Tree Search

MCTS for SSPT =hS,A,c,T,s0,S?i d0 = create root node associated with s0 whiletime allows:

visit decision node(d0,T) returna(arg minc∈children(d0)Qˆ(c))

(34)

MCTS: Visit a Decision Node

visit decision node for decision noded, SSP T =hS,A,c,T,s0,S?i

if s(d)∈S? then return0

if there isa∈A(s(d)) s.t. a(c)6=afor all c ∈children(d):

select such ana and add nodec witha(c) =ato children(d) else:

c = tree policy(d)

cost = visit chance node(c,T) N(d) :=N(d) + 1

Vˆ(d) := ˆV(d) +N(d)1 ·(cost−Vˆ(d)) returncost

(35)

MCTS: Visit a Chance Node

visit chance node for chance nodec, SSP T =hS,L,c,T,s0,S?i s0 ∼succ(s(c),a(c))

letd be the node in children(c) withs(d) =s0 if there is no such node:

add node d with s(d) =s0 to children(c) cost = sample default policy(s0)

N(d) := 1,Vˆ(d) := cost else:

cost = visit decision node(d,T) cost = cost +cost(s(c),a(c)) N(c) :=N(c) + 1

Q(cˆ ) := ˆQ(c) +N(c)1 ·(cost−Q(c))ˆ returncost

(36)

Summary

(37)

Summary

Monte-Carlo Tree Search is a frameworkfor algorithms MCTS algorithms perform trials

Each trial consists of (up to) 4 phases

MCTS algorithms are specified by two policies:

atree policythat describes behavior “in” tree

and adefault policythat describes behavior “outside” of tree

Referenzen

ÄHNLICHE DOKUMENTE

Simple Monte-Carlo methods like Hindsight Optimization perform well in some games, but are suboptimal. even with

43.2 Monte-Carlo Methods 43.3 Monte-Carlo Tree Search 43.4 Summary.. Helmert (University of Basel) Foundations of Artificial Intelligence May 17, 2021 2

default policy simulates a game to obtain utility estimate default policy must be evaluated in many positions if default policy is expensive to compute,. simulations

Monte-Carlo Tree Search: Advanced Topics Optimality of MCTSM. 44.1 Optimality

I a determinization is sampled (Hindsight Optimization) I runs under a fixed policy are simulated (Policy Simulation) I considered outcomes are sampled (Sparse Sampling) I runs under

a determinization is sampled (Hindsight Optimization) runs under a fixed policy are simulated (Policy Simulation) considered outcomes are sampled (Sparse Sampling) runs under

used to traverse explicated tree from root node to a leaf maps decision nodes to a probability distribution over actions (usually as a function over a decision node and its

ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise ε-greedy selects non-greedy actions with same probability Boltzmann