Planning and Optimization G5. Monte-Carlo Tree Search Algorithms (Part I) Malte Helmert and Gabriele R¨oger

(1)

Planning and Optimization

G5. Monte-Carlo Tree Search Algorithms (Part I)

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

(2)

Introduction Default Policy Optimality MAB Summary

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

(3)

Content of this Course: Factored MDPs

Factored MDPs

Foundations Heuristic

Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(4)

Introduction

(5)

Monte-Carlo Tree Search: Reminder

Performs iterations with 4 phases:

selection: usegiven tree policyto traverse explicated tree

expansion: add node(s) to the tree simulation: usegiven default policy to simulate run

backpropagation: update visited nodes with Monte-Carlo backups

(6)

Monte-Carlo Tree Search: Reminder

expansion: add node(s) to the tree

simulation: usegiven default policy to simulate run

(7)

Monte-Carlo Tree Search: Reminder

(8)

Monte-Carlo Tree Search: Reminder

(9)

Motivation

Monte-Carlo Tree Search is a frameworkof algorithms concrete MCTS algorithms are specified in terms of

a tree policy;

and a default policy

for most tasks, awell-suitedMCTS configuration exists but for each task, many MCTS configurationsperform poorly and every MCTS configuration that works wellin one problem performs poorly in another problem

⇒There is no “Swiss army knife” configuration for MCTS

(10)

Role of Tree Policy

used to traverse explicated treefrom root node to a leaf mapsdecision nodes to a probability distribution over actions (usually as a function over a decision node and its children) exploits information from search tree

able tolearn over time

requires MCTS tree tomemorize collected information

(11)

Role of Default Policy

used to simulate runfrom some state to a goal mapsstates to a probability distribution over actions independent from MCTS tree

does not improve over time can becomputed quickly constant memoryrequirements

accumulated cost of simulated run used to initialize state-value estimate of decision node

(12)

Default Policy

(13)

MCTS Simulation

MCTS simulation with default policyπ from state s cost := 0

whiles ∈/ S?: a:∼π(s)

cost := cost +c(a) s :∼succ(s,a) returncost

Default policy must beproper

to guarantee termination of the procedure and a finite cost

(14)

Default Policy: Example

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

Consider deterministic default policyπ State-value ofs underπ: 60

Accumulated cost of run: 110

(15)

Default Policy: Example

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

(16)

Default Policy: Example

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

(17)

Default Policy: Example

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2:50

a₃: 0

a₄: 100

(18)

Default Policy Realizations

Early MCTS implementations used random default policy:

π(a|s) = ( ₁

|A(s)| ifa∈A(s) 0 otherwise only properif goal can be reached from each state

poor guidance, and due to high variance evenmisguidance

(19)

Default Policy Realizations

There are onlyfew alternativesto random default policy, e.g., heuristic-based policy

domain-specific policy

Reason: No matter how good the policy, result of simulation can be arbitrarily poor

(20)

Default Policy: Example (2)

s

t u

v w

g a0: 10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

(21)

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1: 0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

(22)

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1:0

0.5 0.5 a2: 50

a₃: 0

a₄: 100

(23)

Default Policy: Example (2)

s

t u

v w

g a0:10

0.5 0.5

a1:0

0.5 0.5 a2: 50

a₃: 0

a₄:100

(24)

Default Policy Realizations

Possible solutionto overcome this weakness:

averageover multiple random walks converges to true action-values of policy computationally often very expensive Cheaper and more successfulalternative:

skipsimulation step of MCTS

use heuristic directly for initialization of state-value estimates instead of simulating execution of heuristic-guided policy much more successful (e.g. neural networks of AlphaGo)

(25)

Asymptotic Optimality

(26)

Optimal Search

Heuristic search algorithms (like RTDP) achieve optimality by combining

greedy search admissible heuristic Bellman backups In Monte-Carlo Tree Search

search behavior defined by a tree policy

admissibility of default policy / heuristicirrelevant (and usually not given)

Monte-Carlo backups

MCTS requires a different idea foroptimal behavior in the limit.

(27)

Asymptotic Optimality

Let an MCTS algorithm build an MCTS treeG=hd₀,D,C,Ei.

The MCTS algorithm isasymptotically optimal if

limk→∞Qˆ^k(c) =Q_?(s(c),a(c)) for all c ∈C^k, wherek is the number of trials.

this is just one special form of asymptotic optimality some optimal MCTS algorithms are

not asymptotically optimal by this definition

(e.g., lim_k→∞Qˆ^k(c) =`·Q_?(s(c),a(c)) for some`∈R⁺) allpractically relevant optimal MCTS algorithms are asymptotically optimal by this definition

(28)

Asymptotically Optimal Tree Policy

An MCTS algorithm isasymptotically optimal if

1 its tree policyexplores forever:

the (infinite) sum of the probabilities that a decision node is visited must diverge

⇒every search node isexplicated eventuallyandvisited infinitely often

2 its tree policy is greedy in the limit:

probability that optimal action is selected converges to 1

⇒in the limit, backups based on iterations where only anoptimal policyis followed dominate suboptimal backups

3 its default policy initializes decision nodes with finite values

(29)

Example: Random Tree Policy

Example

Consider therandom tree policyfor decision node d where:

π(a|d) = ( ₁

|A(s(d))| ifa∈A(s(d))

0 otherwise

The random tree policyexplores forever:

Lethd₀,c0, . . . ,dn,cn,di be a sequence of connected nodes inG^k and letp := min_0<i<n−1T(s(d_i),a(c_i),s(d_i+1)).

LetP^k be the probability thatd is visited in trial k. With P^k ≥(_|A|¹ ·p)ⁿ, we have that

limk→∞

k

X

i=1

P^k ≥k·( 1

|A|·p)ⁿ=∞

(30)

Example: Random Tree Policy

Example

Consider therandom tree policyfor decision node d where:

π(a|d) = ( ₁

|A(s(d))| ifa∈A(s(d))

0 otherwise

The random tree policy isnot greedy in the limit unless all actions are always optimal:

The probability that an optimal actionais selected in decision noded is

limk→∞1− X

{a⁰6∈π_V?(s)}

1

|A(s(d))| <1.

MCTS with random tree policynot asymptotically optimal

(31)

Example: Greedy Tree Policy

Example

Consider thegreedy tree policyfor decision noded where:

π(a|d) = ( ₁

|A^k_?(d)| if a∈A^k_?(d)) 0 otherwise,

withA^k_?(d) ={a(c)∈A(s(d))|c ∈arg min_c⁰∈children(d)Qˆ^k(c⁰)}.

Greedy tree policy is greedy in the limit Greedy tree policy does not explore forever

MCTS with greedy tree policy not asymptotically optimal

(32)

Tree Policy: Objective

To satisfyboth requirements, MCTS tree policies have two contradictory objectives:

exploreparts of the search space that have not been investigated thoroughly

exploit knowledge about good actions to focus search on promising areas of the search space

central challenge: balanceexploration and exploitation

⇒borrow ideas from related multi-armed banditproblem

(33)

Multi-armed Bandit Problem

(34)

Multi-armed Bandit Problem

most commonly used tree policies areinspired from research on the multi-armed bandit problem (MAB)

MAB is a learningscenario (model not revealed to agent) agent repeatedly faces the same decision:

to pull one of several arms of a slot machine pulling an arm yields stochastic reward

⇒ in MABs, we have rewards rather than costs can be modeled as an MDP

(35)

Multi-armed Bandit Problem

s₀

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

ComputeQ_?(a) for a∈ {a₁,a₂,a₃}

Pull arm arg max_a∈{a₁_,a₂_,a₃_}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k

(36)

Multi-armed Bandit Problem: Planning Scenario

s₀

a1 a2 a3

4

3

3 1

8

5.5 2

6

0 6

6 1

6

6 2

0

4 3

8

0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

ComputeQ_?(a) for a∈ {a₁,a₂,a₃}

Pull arm arg max_a∈{a₁_,a₂_,a₃_}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k

(37)

Multi-armed Bandit Problem: Learning Scenario

s₀

a₁ a₂ a₃

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations

(38)

Multi-armed Bandit Problem: Learning Scenario

s₀

a₁ a₂ a₃

4

3

3 1

8

5.5 2 6

0

6

6 1

6

6 2

0

4 3 8

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 1 trial is 3

(39)

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 8

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 2 trials is 3 + 6 = 9

(40)

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Accumulated reward after 3 trials is 3 + 6 + 0 = 9

(41)

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3 80

0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Accumulated reward after 4 trials is 3 + 6 + 0 + 6 = 15

(42)

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2 6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Accumulated reward after 5 trials is 3 + 6 + 0 + 6 + 0 = 15

(43)

Multi-armed Bandit Problem: Learning Scenario

s0

a1 a2 a3

4 3

3 1

8

5.5 2

6 0

6

6 1

6

6 2

0

4 3

8 0 0

0 1

8 3 0 6 12 0 80

.2 .8 .2 .6 .2 .9 .1

Accumulated reward after 6 trials is 3 + 6 + 0 + 6 + 0 + 8 = 23

(44)

Policy Quality

Since model unknown to MAB agent, it cannot achieve accumulated reward ofk·V_? with V_?:= max_aQ_?(a) ink trials Quality of MAB policyπ measured in terms ofregret, i.e., the difference between k·V? and expected reward of π in k trials Regret cannot grow slower thanlogarithmically in the number of trials

(45)

MABs in MCTS Tree

many tree policies treat each decision node as MAB where each action yields a stochastic reward

dependence of reward on future decision is ignored

MCTS planneruses simulations to learn reasonablebehavior SSP model is not considered

(46)

Summary

(47)

Summary

The simulation phase simulatestheexecution of the default policy

MCTS algorithms are optimal in the limitif the tree policy isgreedy in the limit, the tree policyexplores forever, and

the default policyinitializes with finite value Central challenge of most tree policies:

balance explorationandexploitation

each decision of an MCTS tree policy can be viewed as an multi-armed bandit problem.