• Keine Ergebnisse gefunden

Planning and Optimization G5. Monte-Carlo Tree Search Algorithms (Part I) Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G5. Monte-Carlo Tree Search Algorithms (Part I) Malte Helmert and Gabriele R¨oger"

Copied!
47
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G5. Monte-Carlo Tree Search Algorithms (Part I)

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

(2)

Introduction Default Policy Optimality MAB Summary

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

(3)

Introduction Default Policy Optimality MAB Summary

Content of this Course: Factored MDPs

Factored MDPs

Foundations Heuristic

Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(4)

Introduction Default Policy Optimality MAB Summary

Introduction

(5)

Introduction Default Policy Optimality MAB Summary

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

selection: usegiven tree policyto traverse explicated tree

expansion: add node(s) to the tree simulation: usegiven default policy to simulate run

backpropagation: update visited nodes with Monte-Carlo backups

(6)

Introduction Default Policy Optimality MAB Summary

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

selection: usegiven tree policyto traverse explicated tree

expansion: add node(s) to the tree

simulation: usegiven default policy to simulate run

backpropagation: update visited nodes with Monte-Carlo backups

(7)

Introduction Default Policy Optimality MAB Summary

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

selection: usegiven tree policyto traverse explicated tree

expansion: add node(s) to the tree simulation: usegiven default policy to simulate run

backpropagation: update visited nodes with Monte-Carlo backups

(8)

Introduction Default Policy Optimality MAB Summary

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

selection: usegiven tree policyto traverse explicated tree

expansion: add node(s) to the tree simulation: usegiven default policy to simulate run

backpropagation: update visited nodes with Monte-Carlo backups

(9)

Introduction Default Policy Optimality MAB Summary

Motivation

Monte-Carlo Tree Search is a frameworkof algorithms concrete MCTS algorithms are specified in terms of

a tree policy;

and a default policy

for most tasks, awell-suitedMCTS configuration exists but for each task, many MCTS configurationsperform poorly and every MCTS configuration that works wellin one problem performs poorly in another problem

⇒There is no “Swiss army knife” configuration for MCTS

(10)

Introduction Default Policy Optimality MAB Summary

Role of Tree Policy

used to traverse explicated treefrom root node to a leaf mapsdecision nodes to a probability distribution over actions (usually as a function over a decision node and its children) exploits information from search tree

able tolearn over time

requires MCTS tree tomemorize collected information

(11)

Introduction Default Policy Optimality MAB Summary

Role of Default Policy

used to simulate runfrom some state to a goal mapsstates to a probability distribution over actions independent from MCTS tree

does not improve over time can becomputed quickly constant memoryrequirements

accumulated cost of simulated run used to initialize state-value estimate of decision node

(12)

Introduction Default Policy Optimality MAB Summary

Default Policy

(13)

Introduction Default Policy Optimality MAB Summary

MCTS Simulation

MCTS simulation with default policyπ from state s cost := 0

whiles ∈/ S?: a:∼π(s)

cost := cost +c(a) s :∼succ(s,a) returncost

Default policy must beproper

to guarantee termination of the procedure and a finite cost

(14)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

(15)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 0

(16)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 10

(17)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2:50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 60

(18)

Introduction Default Policy Optimality MAB Summary

Default Policy Realizations

Early MCTS implementations used random default policy:

π(a|s) = ( 1

|A(s)| ifa∈A(s) 0 otherwise only properif goal can be reached from each state

poor guidance, and due to high variance evenmisguidance

(19)

Introduction Default Policy Optimality MAB Summary

Default Policy Realizations

There are onlyfew alternativesto random default policy, e.g., heuristic-based policy

domain-specific policy

Reason: No matter how good the policy, result of simulation can be arbitrarily poor

(20)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example (2)

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 0

(21)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 10

(22)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1:0

0.5 0.5 a2: 50

a3: 0

a4: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 60

(23)

Introduction Default Policy Optimality MAB Summary

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1:0

0.5 0.5 a2: 50

a3: 0

a4:100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

Accumulated cost of run: 110

(24)

Introduction Default Policy Optimality MAB Summary

Default Policy Realizations

Possible solutionto overcome this weakness:

averageover multiple random walks converges to true action-values of policy computationally often very expensive Cheaper and more successfulalternative:

skipsimulation step of MCTS

use heuristic directly for initialization of state-value estimates instead of simulating execution of heuristic-guided policy much more successful (e.g. neural networks of AlphaGo)

(25)

Introduction Default Policy Optimality MAB Summary

Asymptotic Optimality

(26)

Introduction Default Policy Optimality MAB Summary

Optimal Search

Heuristic search algorithms (like RTDP) achieve optimality by combining

greedy search admissible heuristic Bellman backups In Monte-Carlo Tree Search

search behavior defined by a tree policy

admissibility of default policy / heuristicirrelevant (and usually not given)

Monte-Carlo backups

MCTS requires a different idea foroptimal behavior in the limit.

(27)

Introduction Default Policy Optimality MAB Summary

Asymptotic Optimality

Asymptotic Optimality

Let an MCTS algorithm build an MCTS treeG=hd0,D,C,Ei.

The MCTS algorithm isasymptotically optimal if

limk→∞k(c) =Q?(s(c),a(c)) for all c ∈Ck, wherek is the number of trials.

this is just one special form of asymptotic optimality some optimal MCTS algorithms are

not asymptotically optimal by this definition

(e.g., limk→∞k(c) =`·Q?(s(c),a(c)) for some`∈R+) allpractically relevant optimal MCTS algorithms are asymptotically optimal by this definition

(28)

Introduction Default Policy Optimality MAB Summary

Asymptotically Optimal Tree Policy

An MCTS algorithm isasymptotically optimal if

1 its tree policyexplores forever:

the (infinite) sum of the probabilities that a decision node is visited must diverge

every search node isexplicated eventuallyandvisited infinitely often

2 its tree policy is greedy in the limit:

probability that optimal action is selected converges to 1

in the limit, backups based on iterations where only anoptimal policyis followed dominate suboptimal backups

3 its default policy initializes decision nodes with finite values

(29)

Introduction Default Policy Optimality MAB Summary

Example: Random Tree Policy

Example

Consider therandom tree policyfor decision node d where:

π(a|d) = ( 1

|A(s(d))| ifa∈A(s(d))

0 otherwise

The random tree policyexplores forever:

Lethd0,c0, . . . ,dn,cn,di be a sequence of connected nodes inGk and letp := min0<i<n−1T(s(di),a(ci),s(di+1)).

LetPk be the probability thatd is visited in trial k. With Pk ≥(|A|1 ·p)n, we have that

limk→∞

k

X

i=1

Pk ≥k·( 1

|A|·p)n=∞

(30)

Introduction Default Policy Optimality MAB Summary

Example: Random Tree Policy

Example

Consider therandom tree policyfor decision node d where:

π(a|d) = ( 1

|A(s(d))| ifa∈A(s(d))

0 otherwise

The random tree policy isnot greedy in the limit unless all actions are always optimal:

The probability that an optimal actionais selected in decision noded is

limk→∞1− X

{a06∈πV?(s)}

1

|A(s(d))| <1.

MCTS with random tree policynot asymptotically optimal

(31)

Introduction Default Policy Optimality MAB Summary

Example: Greedy Tree Policy

Example

Consider thegreedy tree policyfor decision noded where:

π(a|d) = ( 1

|Ak?(d)| if a∈Ak?(d)) 0 otherwise,

withAk?(d) ={a(c)∈A(s(d))|c ∈arg minc0∈children(d)k(c0)}.

Greedy tree policy is greedy in the limit Greedy tree policy does not explore forever

MCTS with greedy tree policy not asymptotically optimal

(32)

Introduction Default Policy Optimality MAB Summary

Tree Policy: Objective

To satisfyboth requirements, MCTS tree policies have two contradictory objectives:

exploreparts of the search space that have not been investigated thoroughly

exploit knowledge about good actions to focus search on promising areas of the search space

central challenge: balanceexploration and exploitation

⇒borrow ideas from related multi-armed banditproblem

(33)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem

(34)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem

most commonly used tree policies areinspired from research on the multi-armed bandit problem (MAB)

MAB is a learningscenario (model not revealed to agent) agent repeatedly faces the same decision:

to pull one of several arms of a slot machine pulling an arm yields stochastic reward

⇒ in MABs, we have rewards rather than costs can be modeled as an MDP

(35)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

ComputeQ?(a) for a∈ {a1,a2,a3}

Pull arm arg maxa∈{a1,a2,a3}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k

(36)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Planning Scenario

s0

a1 a2 a3

4

3

3 1

8

5.5 2

6

0 6

6 1

6

6 2

0

4 3

8

0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

ComputeQ?(a) for a∈ {a1,a2,a3}

Pull arm arg maxa∈{a1,a2,a3}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k

(37)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

(38)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4

3

3 1

8

5.5 2 6

0

6

6 1

6

6 2

0

4 3 8

0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 1 trial is 3

(39)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 8

0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 2 trials is 3 + 6 = 9

(40)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

Accumulated reward after 3 trials is 3 + 6 + 0 = 9

(41)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

Accumulated reward after 4 trials is 3 + 6 + 0 + 6 = 15

(42)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

Accumulated reward after 5 trials is 3 + 6 + 0 + 6 + 0 = 15

(43)

Introduction Default Policy Optimality MAB Summary

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2

6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

Accumulated reward after 6 trials is 3 + 6 + 0 + 6 + 0 + 8 = 23

(44)

Introduction Default Policy Optimality MAB Summary

Policy Quality

Since model unknown to MAB agent, it cannot achieve accumulated reward ofk·V? with V?:= maxaQ?(a) ink trials Quality of MAB policyπ measured in terms ofregret, i.e., the difference between k·V? and expected reward of π in k trials Regret cannot grow slower thanlogarithmically in the number of trials

(45)

Introduction Default Policy Optimality MAB Summary

MABs in MCTS Tree

many tree policies treat each decision node as MAB where each action yields a stochastic reward

dependence of reward on future decision is ignored

MCTS planneruses simulations to learn reasonablebehavior SSP model is not considered

(46)

Introduction Default Policy Optimality MAB Summary

Summary

(47)

Introduction Default Policy Optimality MAB Summary

Summary

The simulation phase simulatestheexecution of the default policy

MCTS algorithms are optimal in the limitif the tree policy isgreedy in the limit, the tree policyexplores forever, and

the default policyinitializes with finite value Central challenge of most tree policies:

balance explorationandexploitation

each decision of an MCTS tree policy can be viewed as an multi-armed bandit problem.

Referenzen

ÄHNLICHE DOKUMENTE

Landmarks, network flows and potential heuristics are based on constraints that can be specified for a planning task.... Constraint-based Heuristics Multiple

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

The objective value of an integer program that minimizes this cost subject to the flow constraints is a lower bound on the plan cost (i.e., an admissible heuristic estimate)..

Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs Impossible

I a determinization is sampled (Hindsight Optimization) I runs under a fixed policy are simulated (Policy Simulation) I considered outcomes are sampled (Sparse Sampling) I runs under

a determinization is sampled (Hindsight Optimization) runs under a fixed policy are simulated (Policy Simulation) considered outcomes are sampled (Sparse Sampling) runs under

ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise ε-greedy selects non-greedy actions with same probability Boltzmann