Planning and Optimization
G5. Monte-Carlo Tree Search Algorithms (Part I)
Malte Helmert and Gabriele R¨oger
Universit¨at Basel
Introduction Default Policy Optimality MAB Summary
Content of this Course
Planning
Classical
Foundations Logic Heuristics Constraints
Probabilistic
Explicit MDPs Factored MDPs
Introduction Default Policy Optimality MAB Summary
Content of this Course: Factored MDPs
Factored MDPs
Foundations Heuristic
Search Monte-Carlo
Methods
Suboptimal Algorithms
MCTS
Introduction Default Policy Optimality MAB Summary
Introduction
Introduction Default Policy Optimality MAB Summary
Monte-Carlo Tree Search: Reminder
Performs iterations with 4 phases:
selection: usegiven tree policyto traverse explicated tree
expansion: add node(s) to the tree simulation: usegiven default policy to simulate run
backpropagation: update visited nodes with Monte-Carlo backups
Introduction Default Policy Optimality MAB Summary
Monte-Carlo Tree Search: Reminder
Performs iterations with 4 phases:
selection: usegiven tree policyto traverse explicated tree
expansion: add node(s) to the tree
simulation: usegiven default policy to simulate run
backpropagation: update visited nodes with Monte-Carlo backups
Introduction Default Policy Optimality MAB Summary
Monte-Carlo Tree Search: Reminder
Performs iterations with 4 phases:
selection: usegiven tree policyto traverse explicated tree
expansion: add node(s) to the tree simulation: usegiven default policy to simulate run
backpropagation: update visited nodes with Monte-Carlo backups
Introduction Default Policy Optimality MAB Summary
Monte-Carlo Tree Search: Reminder
Performs iterations with 4 phases:
selection: usegiven tree policyto traverse explicated tree
expansion: add node(s) to the tree simulation: usegiven default policy to simulate run
backpropagation: update visited nodes with Monte-Carlo backups
Introduction Default Policy Optimality MAB Summary
Motivation
Monte-Carlo Tree Search is a frameworkof algorithms concrete MCTS algorithms are specified in terms of
a tree policy;
and a default policy
for most tasks, awell-suitedMCTS configuration exists but for each task, many MCTS configurationsperform poorly and every MCTS configuration that works wellin one problem performs poorly in another problem
⇒There is no “Swiss army knife” configuration for MCTS
Introduction Default Policy Optimality MAB Summary
Role of Tree Policy
used to traverse explicated treefrom root node to a leaf mapsdecision nodes to a probability distribution over actions (usually as a function over a decision node and its children) exploits information from search tree
able tolearn over time
requires MCTS tree tomemorize collected information
Introduction Default Policy Optimality MAB Summary
Role of Default Policy
used to simulate runfrom some state to a goal mapsstates to a probability distribution over actions independent from MCTS tree
does not improve over time can becomputed quickly constant memoryrequirements
accumulated cost of simulated run used to initialize state-value estimate of decision node
Introduction Default Policy Optimality MAB Summary
Default Policy
Introduction Default Policy Optimality MAB Summary
MCTS Simulation
MCTS simulation with default policyπ from state s cost := 0
whiles ∈/ S?: a:∼π(s)
cost := cost +c(a) s :∼succ(s,a) returncost
Default policy must beproper
to guarantee termination of the procedure and a finite cost
Introduction Default Policy Optimality MAB Summary
Default Policy: Example
s
t u
v w
g a0: 10
0.5 0.5
a1: 0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Introduction Default Policy Optimality MAB Summary
Default Policy: Example
s
t u
v w
g a0: 10
0.5 0.5
a1: 0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 0
Introduction Default Policy Optimality MAB Summary
Default Policy: Example
s
t u
v w
g a0:10
0.5 0.5
a1: 0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 10
Introduction Default Policy Optimality MAB Summary
Default Policy: Example
s
t u
v w
g a0:10
0.5 0.5
a1: 0
0.5 0.5 a2:50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 60
Introduction Default Policy Optimality MAB Summary
Default Policy Realizations
Early MCTS implementations used random default policy:
π(a|s) = ( 1
|A(s)| ifa∈A(s) 0 otherwise only properif goal can be reached from each state
poor guidance, and due to high variance evenmisguidance
Introduction Default Policy Optimality MAB Summary
Default Policy Realizations
There are onlyfew alternativesto random default policy, e.g., heuristic-based policy
domain-specific policy
Reason: No matter how good the policy, result of simulation can be arbitrarily poor
Introduction Default Policy Optimality MAB Summary
Default Policy: Example (2)
s
t u
v w
g a0: 10
0.5 0.5
a1: 0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 0
Introduction Default Policy Optimality MAB Summary
Default Policy: Example (2)
s
t u
v w
g a0:10
0.5 0.5
a1: 0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 10
Introduction Default Policy Optimality MAB Summary
Default Policy: Example (2)
s
t u
v w
g a0:10
0.5 0.5
a1:0
0.5 0.5 a2: 50
a3: 0
a4: 100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 60
Introduction Default Policy Optimality MAB Summary
Default Policy: Example (2)
s
t u
v w
g a0:10
0.5 0.5
a1:0
0.5 0.5 a2: 50
a3: 0
a4:100
Consider deterministic default policyπ State-value ofs underπ: 60
Accumulated cost of run: 110
Accumulated cost of run: 110
Introduction Default Policy Optimality MAB Summary
Default Policy Realizations
Possible solutionto overcome this weakness:
averageover multiple random walks converges to true action-values of policy computationally often very expensive Cheaper and more successfulalternative:
skipsimulation step of MCTS
use heuristic directly for initialization of state-value estimates instead of simulating execution of heuristic-guided policy much more successful (e.g. neural networks of AlphaGo)
Introduction Default Policy Optimality MAB Summary
Asymptotic Optimality
Introduction Default Policy Optimality MAB Summary
Optimal Search
Heuristic search algorithms (like RTDP) achieve optimality by combining
greedy search admissible heuristic Bellman backups In Monte-Carlo Tree Search
search behavior defined by a tree policy
admissibility of default policy / heuristicirrelevant (and usually not given)
Monte-Carlo backups
MCTS requires a different idea foroptimal behavior in the limit.
Introduction Default Policy Optimality MAB Summary
Asymptotic Optimality
Asymptotic Optimality
Let an MCTS algorithm build an MCTS treeG=hd0,D,C,Ei.
The MCTS algorithm isasymptotically optimal if
limk→∞Qˆk(c) =Q?(s(c),a(c)) for all c ∈Ck, wherek is the number of trials.
this is just one special form of asymptotic optimality some optimal MCTS algorithms are
not asymptotically optimal by this definition
(e.g., limk→∞Qˆk(c) =`·Q?(s(c),a(c)) for some`∈R+) allpractically relevant optimal MCTS algorithms are asymptotically optimal by this definition
Introduction Default Policy Optimality MAB Summary
Asymptotically Optimal Tree Policy
An MCTS algorithm isasymptotically optimal if
1 its tree policyexplores forever:
the (infinite) sum of the probabilities that a decision node is visited must diverge
⇒every search node isexplicated eventuallyandvisited infinitely often
2 its tree policy is greedy in the limit:
probability that optimal action is selected converges to 1
⇒in the limit, backups based on iterations where only anoptimal policyis followed dominate suboptimal backups
3 its default policy initializes decision nodes with finite values
Introduction Default Policy Optimality MAB Summary
Example: Random Tree Policy
Example
Consider therandom tree policyfor decision node d where:
π(a|d) = ( 1
|A(s(d))| ifa∈A(s(d))
0 otherwise
The random tree policyexplores forever:
Lethd0,c0, . . . ,dn,cn,di be a sequence of connected nodes inGk and letp := min0<i<n−1T(s(di),a(ci),s(di+1)).
LetPk be the probability thatd is visited in trial k. With Pk ≥(|A|1 ·p)n, we have that
limk→∞
k
X
i=1
Pk ≥k·( 1
|A|·p)n=∞
Introduction Default Policy Optimality MAB Summary
Example: Random Tree Policy
Example
Consider therandom tree policyfor decision node d where:
π(a|d) = ( 1
|A(s(d))| ifa∈A(s(d))
0 otherwise
The random tree policy isnot greedy in the limit unless all actions are always optimal:
The probability that an optimal actionais selected in decision noded is
limk→∞1− X
{a06∈πV?(s)}
1
|A(s(d))| <1.
MCTS with random tree policynot asymptotically optimal
Introduction Default Policy Optimality MAB Summary
Example: Greedy Tree Policy
Example
Consider thegreedy tree policyfor decision noded where:
π(a|d) = ( 1
|Ak?(d)| if a∈Ak?(d)) 0 otherwise,
withAk?(d) ={a(c)∈A(s(d))|c ∈arg minc0∈children(d)Qˆk(c0)}.
Greedy tree policy is greedy in the limit Greedy tree policy does not explore forever
MCTS with greedy tree policy not asymptotically optimal
Introduction Default Policy Optimality MAB Summary
Tree Policy: Objective
To satisfyboth requirements, MCTS tree policies have two contradictory objectives:
exploreparts of the search space that have not been investigated thoroughly
exploit knowledge about good actions to focus search on promising areas of the search space
central challenge: balanceexploration and exploitation
⇒borrow ideas from related multi-armed banditproblem
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem
most commonly used tree policies areinspired from research on the multi-armed bandit problem (MAB)
MAB is a learningscenario (model not revealed to agent) agent repeatedly faces the same decision:
to pull one of several arms of a slot machine pulling an arm yields stochastic reward
⇒ in MABs, we have rewards rather than costs can be modeled as an MDP
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3 80
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
ComputeQ?(a) for a∈ {a1,a2,a3}
Pull arm arg maxa∈{a1,a2,a3}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Planning Scenario
s0
a1 a2 a3
4
3
3 1
8
5.5 2
6
0 6
6 1
6
6 2
0
4 3
8
0 0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
ComputeQ?(a) for a∈ {a1,a2,a3}
Pull arm arg maxa∈{a1,a2,a3}Q?(a) =a3 forever Expected accumulated reward after k trials is 8·k
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3 80
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4
3
3 1
8
5.5 2 6
0
6
6 1
6
6 2
0
4 3 8
0
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 1 trial is 3
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3 8
0
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations Accumulated reward after 2 trials is 3 + 6 = 9
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3 80
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations
Accumulated reward after 3 trials is 3 + 6 + 0 = 9
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3 80
0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations
Accumulated reward after 4 trials is 3 + 6 + 0 + 6 = 15
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2 6 0
6
6 1
6
6 2
0
4 3
8 0 0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations
Accumulated reward after 5 trials is 3 + 6 + 0 + 6 + 0 = 15
Introduction Default Policy Optimality MAB Summary
Multi-armed Bandit Problem: Learning Scenario
s0
a1 a2 a3
4 3
3 1
8
5.5 2
6 0
6
6 1
6
6 2
0
4 3
8 0 0
0 1
8 3 0 6 12 0 80
.2 .8 .2 .6 .2 .9 .1
Pull arms following policyto exploreor exploit Update ˆQ andN based on observations
Accumulated reward after 6 trials is 3 + 6 + 0 + 6 + 0 + 8 = 23
Introduction Default Policy Optimality MAB Summary
Policy Quality
Since model unknown to MAB agent, it cannot achieve accumulated reward ofk·V? with V?:= maxaQ?(a) ink trials Quality of MAB policyπ measured in terms ofregret, i.e., the difference between k·V? and expected reward of π in k trials Regret cannot grow slower thanlogarithmically in the number of trials
Introduction Default Policy Optimality MAB Summary
MABs in MCTS Tree
many tree policies treat each decision node as MAB where each action yields a stochastic reward
dependence of reward on future decision is ignored
MCTS planneruses simulations to learn reasonablebehavior SSP model is not considered
Introduction Default Policy Optimality MAB Summary
Summary
Introduction Default Policy Optimality MAB Summary
Summary
The simulation phase simulatestheexecution of the default policy
MCTS algorithms are optimal in the limitif the tree policy isgreedy in the limit, the tree policyexplores forever, and
the default policyinitializes with finite value Central challenge of most tree policies:
balance explorationandexploitation
each decision of an MCTS tree policy can be viewed as an multi-armed bandit problem.