• Keine Ergebnisse gefunden

Planning and Optimization G3. Asymptotically Suboptimal Monte-Carlo Methods Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G3. Asymptotically Suboptimal Monte-Carlo Methods Malte Helmert and Gabriele R¨oger"

Copied!
8
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G3. Asymptotically Suboptimal Monte-Carlo Methods

Malte Helmert and Gabriele R¨ oger

Universit¨ at Basel

December 9, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 1 / 32

Planning and Optimization

December 9, 2020 — G3. Asymptotically Suboptimal Monte-Carlo Methods

G3.1 Motivation

G3.2 Monte-Carlo Methods G3.3 Hindsight Optimization G3.4 Policy Simulation

G3.5 Sparse Sampling G3.6 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 2 / 32

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Factored MDPs

Factored MDPs

Foundations Heuristic

Search Monte-Carlo

Methods

Suboptimal Algorithms

MCTS

(2)

G3. Asymptotically Suboptimal Monte-Carlo Methods Motivation

G3.1 Motivation

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 5 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Motivation

Monte-Carlo Methods: Brief History

I 1930s: first researchers experiment with Monte-Carlo methods I 1998: Ginsberg’s GIB player competes with Bridge experts I 2002: Kearns et al. propose Sparse Sampling

I 2002: Auer et al. present UCB1 action selection for multi-armed bandits

I 2006: Coulom coins term Monte-Carlo Tree Search (MCTS) I 2006: Kocsis and Szepesv´ ari combine UCB1 and MCTS to

the famous MCTS variant, UCT

I 2007–2016: Constant progress of MCTS in Go culminates in AlphaGo’s historical defeat of dan 9 player Lee Sedol

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 6 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

G3.2 Monte-Carlo Methods

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

Monte-Carlo Methods: Idea

I Summarize a broad family of algorithms I Decisions are based on random samples

(Monte-Carlo sampling)

I Results of samples are aggregated by computing the average (Monte-Carlo backups)

I Apart from that, algorithms can differ significantly

Careful: Many different definitions of MC methods in the literature

(3)

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

Types of Random Samples

Random samples have in common that something is

drawn from a given probability distribution. Some examples:

I a determinization is sampled (Hindsight Optimization) I runs under a fixed policy are simulated (Policy Simulation) I considered outcomes are sampled (Sparse Sampling) I runs under an evolving policy are simulated

(Monte-Carlo Tree Search)

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 9 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

Reminder: Bellman Backups

Algorithms like Value Iterationor (L)RTDP use the Bellman equation as an update procedure.

The i-th state-value estimate of state s, ˆ V i (s), is computed with Bellman backups as

V ˆ i (s) := min

a∈A(s) c(a) + X

s

0

∈S

T (s, a, s 0 ) · V ˆ i−1 (s 0 )

! .

(Some algorithms use a heuristic if the state-value estimate on the right hand side of the Bellman backup is undefined.)

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 10 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

Monte-Carlo Backups

Monte-Carlo methods instead estimate state-values by averaging over all samples.

Let N i (s) be the number of samples for state s in the first i algorithm iterations and let cost k (s) be the cost for s in the k-th sample (cost k (s) = 0 if the k-th sample has no estimate for s ).

The i -th state-value estimate of state s , ˆ V i (s ), is computed with Monte-Carlo backups as

V ˆ i (s) := 1 N i (s ) ·

i

X

k=1

cost k (s ).

G3. Asymptotically Suboptimal Monte-Carlo Methods Monte-Carlo Methods

Monte-Carlo Backups: Properties

I no need to store cost k (s ) for k = 1, . . . , i:

it is possible to compute Monte-Carlo backups iteratively as V ˆ i (s ) := ˆ V i−1 (s) + 1

N i (s ) (cost i (s) − V ˆ i−1 (s)) I no need to know SSP model for backups

I if s is a random variable, ˆ V i (s ) converges to E [s ] due to the strong law of large numbers

I if s is not a random variable, this is not always the case

(4)

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

G3.3 Hindsight Optimization

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 13 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Idea

Repeat as long as resources (deliberation time, memory) allow:

I Sample outcomes of all actions

⇒ deterministic (classical) planning problem I For each applicable action a ∈ A(s 0 ),

compute plan in the sample that starts with a Execute the action with the lowest average plan cost

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 14 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Example

South to play, three tricks to win, trump suit ♣

0% (0/1) 100% (1/1) 0% (0/1) 50% (1/2) 100% (2/2) 0% (0/2) 67% (2/3) 100% (3/3) 33% (1/3)

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Example

South to play, three tricks to win, trump suit ♣

0% (0/1) 100% (1/1) 0% (0/1)

50% (1/2)

100% (2/2)

0% (0/2)

67% (2/3)

100% (3/3)

33% (1/3)

(5)

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Example

South to play, three tricks to win, trump suit ♣

0% (0/1) 100% (1/1) 0% (0/1)

50% (1/2) 100% (2/2) 0% (0/2)

67% (2/3) 100% (3/3) 33% (1/3)

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 17 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Example

South to play, three tricks to win, trump suit ♣

0% (0/1) 100% (1/1) 0% (0/1) 50% (1/2) 100% (2/2) 0% (0/2)

67% (2/3) 100% (3/3) 33% (1/3)

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 18 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Evaluation

I HOP well-suited for some problems

I must be possible to solve sampled SSP efficiently:

I domain-dependent knowledge (e.g., games like Bridge, Skat) I classical planner (FF-Hindsight, Yoon et. al, 2008)

I What about optimality in the limit?

⇒ often not optimal due to assumption of clairvoyance

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Non-optimality in the Limit

s

0

s

1

s

2

s

3

s

4

s

5

s

6

a

1

a

2

0

0

0 10

0

2 5

53

20

0

6

(6)

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Non-optimality in the Limit

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 21 / 32 s0

s1

s2

s3

s4

s5

s6 a1

a2 0

0

0 10

0

20

0

6

(sample probability: 60%)

s0

s1

s2

s3

s4

s5

s6 a1

a2 0

0

0 10

0

20

0

6

(sample probability: 40%) with k → ∞:

Q ˆ

k

(s

0

, a

1

) → 4 Q ˆ

k

(s

0

, a

2

) → 6

G3. Asymptotically Suboptimal Monte-Carlo Methods Hindsight Optimization

Hindsight Optimization: Evaluation

I HOP well-suited for some problems

I must be possible to solve sampled MDP efficiently:

I domain-dependent knowledge (e.g., games like Bridge, Skat) I classical planner (FF-Hindsight, Yoon et. al, 2008)

I What about optimality in the limit?

⇒ in general not optimal due to assumption of clairvoyance

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 22 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Policy Simulation

G3.4 Policy Simulation

G3. Asymptotically Suboptimal Monte-Carlo Methods Policy Simulation

Policy Simulation: Idea

Repeat as long as resources (deliberation time, memory) allow:

I For each applicable action a ∈ A(s 0 ),

start a run from s 0 with a and then follow a given policy π I Execute the action with the lowest average simulation cost Avoids clairvoyance by evaluation of policy

through simulation of its execution.

(7)

G3. Asymptotically Suboptimal Monte-Carlo Methods Policy Simulation

Policy Simulation: Evaluation

I Base policy is static

I No mechansim to overcome the weaknesses of base policy (if there are no weaknesses, we don’t need policy simulation) I Suboptimal decisions in simulation affect policy quality I What about optimality in the limit?

⇒ in general not optimal due to inability of policy to improve

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 25 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Sparse Sampling

G3.5 Sparse Sampling

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 26 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Sparse Sampling

Sparse Sampling: Idea

Sparse Sampling (Kearns et al., 2002) approaches problem that number of reachable states under a policy can be too large

I Creates search tree up to a given lookahead horizon I A constant number of outcomes is sampled

for each state-action pair

I Outcomes that were not sampled are ignored

I Near-optimal: expected cost of resulting policy close to expected cost of optimal policy

I Runtime independent from the number of states

G3. Asymptotically Suboptimal Monte-Carlo Methods Sparse Sampling

Sparse Sampling: Search Tree

Without Sparse Sampling

(8)

G3. Asymptotically Suboptimal Monte-Carlo Methods Sparse Sampling

Sparse Sampling: Search Tree

With Sparse Sampling

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 29 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Sparse Sampling

Sparse Sampling: Problems

I Independent from number of states, but still exponential in lookahead horizon

I Constants that give number of outcomes and lookahead horizon large for good bounds on near-optimality I Search time difficult to predict

I Same amount of sampling everywhere in the tree

⇒ resources are wasted in non-promising parts of the tree

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 9, 2020 30 / 32

G3. Asymptotically Suboptimal Monte-Carlo Methods Summary

G3.6 Summary

G3. Asymptotically Suboptimal Monte-Carlo Methods Summary

Summary

I Monte-Carlo methods have a long history but no successful applications until 1990s I Monte-Carlo methods use sampling and

backups that average over sample results I Hindsight optimization averages over plan cost

in sampled determinization

I Policy simulation simulates the execution of a policy

I Sparse sampling considers only a fixed amount of outcomes

I All three methods are not optimal in the limit

Referenzen

ÄHNLICHE DOKUMENTE

Landmarks, network flows and potential heuristics are based on constraints that can be specified for a planning task.... Constraint-based Heuristics Multiple

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

For abstraction heuristics and disjunctive action landmarks, we know how to determine an optimal cost partitioning, using linear programming. Although solving a linear program

I The objective value of an integer program that minimizes this cost subject to the flow constraints is a lower bound on the plan cost (i.e., an admissible heuristic estimate).. I

Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs Impossible

a determinization is sampled (Hindsight Optimization) runs under a fixed policy are simulated (Policy Simulation) considered outcomes are sampled (Sparse Sampling) runs under

ε-greedy selects greedy action with probability 1 − ε and another action uniformly at random otherwise ε-greedy selects non-greedy actions with same probability Boltzmann