• Keine Ergebnisse gefunden

Planning and Optimization

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization"

Copied!
10
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G5. Monte-Carlo Tree Search Algorithms (Part I)

Malte Helmert and Gabriele R¨ oger

Universit¨ at Basel

December 14, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 1 / 40

Planning and Optimization

December 14, 2020 — G5. Monte-Carlo Tree Search Algorithms (Part I)

G5.1 Introduction G5.2 Default Policy

G5.3 Asymptotic Optimality

G5.4 Multi-armed Bandit Problem G5.5 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 2 / 40

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(2)

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

G5.1 Introduction

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 5 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

I selection: use given tree policy to traverse explicated tree

I expansion: add node(s) to the tree I simulation: use given default policy

to simulate run

I backpropagation: update visited nodes with Monte-Carlo backups

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 6 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

I selection: use given tree policy to traverse explicated tree

I expansion: add node(s) to the tree I simulation: use given default policy

to simulate run

I backpropagation: update visited nodes with Monte-Carlo backups

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

I selection: use given tree policy to traverse explicated tree

I expansion: add node(s) to the tree I simulation: use given default policy

to simulate run

I backpropagation: update visited

nodes with Monte-Carlo backups

(3)

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

I selection: use given tree policy to traverse explicated tree

I expansion: add node(s) to the tree I simulation: use given default policy

to simulate run

I backpropagation: update visited nodes with Monte-Carlo backups

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 9 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Motivation

I Monte-Carlo Tree Search is a framework of algorithms I concrete MCTS algorithms are specified in terms of

I a tree policy;

I and a default policy

I for most tasks, a well-suited MCTS configuration exists I but for each task, many MCTS configurations perform poorly I and every MCTS configuration that works well in one problem

performs poorly in another problem

⇒ There is no “Swiss army knife” configuration for MCTS

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 10 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Role of Tree Policy

I used to traverse explicated tree from root node to a leaf I maps decision nodes to a probability distribution over actions

(usually as a function over a decision node and its children) I exploits information from search tree

I able to learn over time

I requires MCTS tree to memorize collected information

G5. Monte-Carlo Tree Search Algorithms (Part I) Introduction

Role of Default Policy

I used to simulate run from some state to a goal I maps states to a probability distribution over actions I independent from MCTS tree

I does not improve over time I can be computed quickly I constant memory requirements

I accumulated cost of simulated run used to

initialize state-value estimate of decision node

(4)

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

G5.2 Default Policy

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 13 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

MCTS Simulation

MCTS simulation with default policy π from state s cost := 0

while s ∈ / S ? : a :∼ π(s )

cost := cost + c(a) s :∼ succ(s , a) return cost

Default policy must be proper

I to guarantee termination of the procedure I and a finite cost

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 14 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

Default Policy: Example

s

t u

v w

g a

0

: 10

0 . 5 0 . 5

a

1

: 0

0.5 0.5 a

2

: 50

a

3

: 0

a

4

: 100

Consider deterministic default policy π State-value of s under π: 60

Accumulated cost of run: 110

Accumulated cost of run: 60

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

Default Policy Realizations

I Early MCTS implementations used random default policy:

π(a | s ) = ( 1

|A(s)| if a ∈ A(s ) 0 otherwise I only proper if goal can be reached from each state

I poor guidance, and due to high variance even misguidance

(5)

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

Default Policy Realizations

There are only few alternatives to random default policy, e.g., I heuristic-based policy

I domain-specific policy

Reason: No matter how good the policy, result of simulation can be arbitrarily poor

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 17 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

Default Policy: Example (2)

s

t u

v w

g a

0

: 10

0 . 5 0 . 5

a

1

: 0

0.5 0.5 a

2

: 50

a

3

: 0

a

4

: 100

Consider deterministic default policy π State-value of s under π: 60

Accumulated cost of run: 110

Accumulated cost of run: 110

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 18 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Default Policy

Default Policy Realizations

Possible solution to overcome this weakness:

I average over multiple random walks I converges to true action-values of policy I computationally often very expensive Cheaper and more successful alternative:

I skip simulation step of MCTS

I use heuristic directly for initialization of state-value estimates I instead of simulating execution of heuristic-guided policy I much more successful (e.g. neural networks of AlphaGo)

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

G5.3 Asymptotic Optimality

(6)

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Optimal Search

Heuristic search algorithms (like RTDP) achieve optimality by combining

I greedy search I admissible heuristic I Bellman backups In Monte-Carlo Tree Search

I search behavior defined by a tree policy

I admissibility of default policy / heuristic irrelevant (and usually not given)

I Monte-Carlo backups

MCTS requires a different idea for optimal behavior in the limit.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 21 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Asymptotic Optimality

Asymptotic Optimality

Let an MCTS algorithm build an MCTS tree G = hd 0 , D, C , Ei.

The MCTS algorithm is asymptotically optimal if

lim k→∞ Q ˆ k (c) = Q ? (s (c ), a(c )) for all c ∈ C k , where k is the number of trials.

I this is just one special form of asymptotic optimality I some optimal MCTS algorithms are

not asymptotically optimal by this definition

(e.g., lim k→∞ Q ˆ k (c) = ` · Q ? (s(c), a(c)) for some ` ∈ R + ) I all practically relevant optimal MCTS algorithms are

asymptotically optimal by this definition

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 22 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Asymptotically Optimal Tree Policy

An MCTS algorithm is asymptotically optimal if

1

its tree policy explores forever:

I the (infinite) sum of the probabilities that a decision node is visited must diverge

I ⇒ every search node is explicated eventually and visited infinitely often

2

its tree policy is greedy in the limit:

I probability that optimal action is selected converges to 1 I ⇒ in the limit, backups based on iterations where only

an optimal policy is followed dominate suboptimal backups

3

its default policy initializes decision nodes with finite values

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Example: Random Tree Policy

Example

Consider the random tree policy for decision node d where:

π(a | d ) = ( 1

|A(s(d ))| if a ∈ A(s(d ))

0 otherwise

The random tree policy explores forever:

Let hd 0 , c 0 , . . . , d n , c n , d i be a sequence of connected nodes in G k and let p := min 0<i <n−1 T (s(d i ), a(c i ), s (d i+1 )).

Let P k be the probability that d is visited in trial k. With P k ≥ ( |A| 1 · p) n , we have that

k

X 1

(7)

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Example: Random Tree Policy

Example

Consider the random tree policy for decision node d where:

π(a | d ) = ( 1

|A(s(d))| if a ∈ A(s (d ))

0 otherwise

The random tree policy is not greedy in the limit unless all actions are always optimal:

The probability that an optimal action a is selected in decision node d is

lim k→∞ 1 − X

{a

0

6∈π

V?

(s)}

1

|A(s (d ))| < 1.

MCTS with random tree policy not asymptotically optimal

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 25 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Example: Greedy Tree Policy

Example

Consider the greedy tree policy for decision node d where:

π(a | d ) = ( 1

|A

k?

(d )| if a ∈ A k ? (d )) 0 otherwise,

with A k ? (d ) = {a(c ) ∈ A(s (d )) | c ∈ arg min c

0

∈children(d ) Q ˆ k (c 0 )}.

I Greedy tree policy is greedy in the limit I Greedy tree policy does not explore forever

MCTS with greedy tree policy not asymptotically optimal

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 26 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Asymptotic Optimality

Tree Policy: Objective

To satisfy both requirements, MCTS tree policies have two contradictory objectives:

I explore parts of the search space that have not been investigated thoroughly

I exploit knowledge about good actions to focus search on promising areas of the search space

central challenge: balance exploration and exploitation

⇒ borrow ideas from related multi-armed bandit problem

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

G5.4 Multi-armed Bandit Problem

(8)

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem

I most commonly used tree policies are inspired from research on the multi-armed bandit problem (MAB)

I MAB is a learning scenario (model not revealed to agent) I agent repeatedly faces the same decision:

to pull one of several arms of a slot machine I pulling an arm yields stochastic reward

⇒ in MABs, we have rewards rather than costs I can be modeled as an MDP

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 29 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Planning Scenario

s 0

a 1 a 2 a 3

4

3

3 1

8

5.5 2

6

0 6

6 1

6

6 2

0

4 3

8

0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Compute Q ? (a) for a ∈ {a 1 , a 2 , a 3 }

I Pull arm arg max a∈{a

1

,a

2

,a

3

} Q ? (a) = a 3 forever I Expected accumulated reward after k trials is 8 · k

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 30 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4

3

3 1

8

5.5 2 6

0

6

6 1

6

6 2

0

4 3 8

0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit I Update ˆ Q and N based on observations I Accumulated reward after 1 trial is 3

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 8

0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit

I Update ˆ Q and N based on observations

I Accumulated reward after 2 trials is 3 + 6 = 9

(9)

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 8 0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit I Update ˆ Q and N based on observations

I Accumulated reward after 3 trials is 3 + 6 + 0 = 9

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 33 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 8 0

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit I Update ˆ Q and N based on observations

I Accumulated reward after 4 trials is 3 + 6 + 0 + 6 = 15

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 34 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit I Update ˆ Q and N based on observations

I Accumulated reward after 5 trials is 3 + 6 + 0 + 6 + 0 = 15

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Multi-armed Bandit Problem: Learning Scenario

s 0

a 1 a 2 a 3

4 3

3 1

8

5.5 2

6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

I Pull arms following policy to explore or exploit I Update ˆ Q and N based on observations

I Accumulated reward after 6 trials is 3 + 6 + 0 + 6 + 0 + 8 = 23

(10)

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

Policy Quality

I Since model unknown to MAB agent, it cannot achieve accumulated reward of k · V ? with V ? := max a Q ? (a) in k trials I Quality of MAB policy π measured in terms of regret, i.e., the difference between k · V ? and expected reward of π in k trials I Regret cannot grow slower than logarithmically in the number

of trials

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 37 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Multi-armed Bandit Problem

MABs in MCTS Tree

I many tree policies treat each decision node as MAB I where each action yields a

stochastic reward

I dependence of reward on future decision is ignored

I MCTS planner uses simulations to learn reasonable behavior I SSP model is not considered

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 14, 2020 38 / 40

G5. Monte-Carlo Tree Search Algorithms (Part I) Summary

G5.5 Summary

G5. Monte-Carlo Tree Search Algorithms (Part I) Summary

Summary

I The simulation phase simulates the execution of the default policy

I MCTS algorithms are optimal in the limit if I the tree policy is greedy in the limit, I the tree policy explores forever, and

I the default policy initializes with finite value I Central challenge of most tree policies:

balance exploration and exploitation

I each decision of an MCTS tree policy can be viewed as an

multi-armed bandit problem.

Referenzen

ÄHNLICHE DOKUMENTE

For example, if a strong endogenous component regulates departure decision in such a way that stopover duration is very similar among birds and across sites (Schaub and Jenni

used to traverse explicated tree from root node to a leaf maps decision nodes to a probability distribution over actions (usually as a function over a decision node and its

Abstract: This paper presents an innovative residential location decision support system (RLDSS) that aims at informing private households about characteristics and

Abstract: Decision Support Systems are well known in higher education for multiple purposes, such as to conjugate data and intelligence, to achieve the best and possible solutions,

1 Department of Radiology, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of

OECD-LRTAP programme (see Ottar. The long-range model operated within EMEP is of the Lagrangian type. Below we summarize the basic concepts on which this model is

We consider second stage (dynamic stochastic) scheduling problems which are variants of QlpmtnlE£rnax and Q\pmtnlELfj· Even under distributional assumptions on

UTILITY FOR WATER QUALITY AND TREATMENT COSTS... CONTROL STRATEGY A AND WATER QUALITY FOR