• Keine Ergebnisse gefunden

Planning and Optimization G8. Trial-based Heuristic Tree Search Gabriele R¨oger and Thomas Keller

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G8. Trial-based Heuristic Tree Search Gabriele R¨oger and Thomas Keller"

Copied!
33
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G8. Trial-based Heuristic Tree Search

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

December 17, 2018

(2)

Motivation THTS Framework THTS Algorithms Summary

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo Methods

(3)

Motivation THTS Framework THTS Algorithms Summary

Motivation

(4)

Motivation THTS Framework THTS Algorithms Summary

AO

& LAO

: Recap

Iteratively build explicatedgraph

Extend explicated graph by expanding fringe node inpartial solution graph

State-value estimates are initialized with admissible heuristic Propagate information with Bellman backupsin partial solution graph

(5)

Motivation THTS Framework THTS Algorithms Summary

(Labeled) Real-Time Dynamic Programming: Recap

Iteratively performs trials

Simulatesgreedy policy in each trial

Encountered states are updated with Bellman backup Admissible heuristic used if no state-value estimate available Labeling procedure marks states that have converged

(6)

Motivation THTS Framework THTS Algorithms Summary

Monte-Carlo Tree Search: Recap

Iteratively explicates search treein trials Uses tree policyto traverse tree

First encountered statenot yet in tree added to search tree State-value estimates are initialized with default policy Propagates information with Monte-Carlo backups in reverse order through visited states

(7)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

All are asymptotically optimal (or such a version exists) In practice, all have complementary strengths

There are a significant differences between these algorithms but they also havea lot in common

common framework that allows to describe all three:

Trial-based Heuristic Tree Search (THTS)

(8)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Framework

(9)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicatesearch tree decision (OR) nodes for states chance (AND) nodes for actions Annotate nodes with

state-/action-value estimate visit counter

solved label

Initialize search nodes with heuristic

6 variable ingredients: action selection outcome selection

initialization trial length backup function

recommendation function

abc

(10)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicatesearch tree decision (OR) nodes for states chance (AND) nodes for actions Annotate nodes with

state-/action-value estimate visit counter

solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection outcome selection

initialization trial length backup function

recommendation function

abc

(11)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicatesearch tree decision (OR) nodes for states chance (AND) nodes for actions Annotate nodes with

state-/action-value estimate visit counter

solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection outcome selection initialization trial length

backup function

recommendation function

abc

(12)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicatesearch tree decision (OR) nodes for states chance (AND) nodes for actions Annotate nodes with

state-/action-value estimate visit counter

solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection outcome selection initialization trial length backup function

recommendation function

abc

(13)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicatesearch tree decision (OR) nodes for states chance (AND) nodes for actions Annotate nodes with

state-/action-value estimate visit counter

solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection outcome selection initialization trial length backup function

recommendation function

abc

(14)

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

THTS for SSPT =hS,L,c,T,s0,S?i d0 = create root node associated with s0 whiletime allows:

visit decision node(d0,T) returnrecommend(d0)

(15)

Motivation THTS Framework THTS Algorithms Summary

THTS: Visit a Decision Node

visit decision node for decision noded, SSP T =hS,L,c,T,s0,S?i if s(d)∈S? then return0

a:= select action(d) if anot explicated:

cost = expand and initialize(d,a) if nottrial length reached(d)

let c be the node in children(d) with a(c) =a cost = visit chance node(c,T)

backup(d,cost) returncost

(16)

Motivation THTS Framework THTS Algorithms Summary

THTS: Visit a Chance Node

visit chance node for chance nodec, SSP T =hS,L,c,T,s0,S?i s0 =select outcome(s(c),a(c))

if s0 not explicated:

cost = expand and initialize(c,s0) if nottrial length reached(c)

let d be the node in children(c) with s(d) =s0 cost = visit decision node(d,T)

cost = cost +c(s(c),a(c)) backup(c,cost)

returncost

(17)

Motivation THTS Framework THTS Algorithms Summary

THTS Algorithms

(18)

Motivation THTS Framework THTS Algorithms Summary

MCTS in the THTS Framework

Trial length: terminate trial when node is explicated Action selection: tree policy

Outcome selection: sample

Initialization: add single node to the tree

and initialize with heuristic that simulates the default policy Backup function: Monte-Carlo backups

Recommendation function: expected best arm

(19)

Motivation THTS Framework THTS Algorithms Summary

AO

(Tree Search Version) in the THTS Framework

Trial length: terminate trial when node is expanded Action selection: greedy

Outcome selection: depends on AO version

Initialization: expand decision node and all its chance node successors, then initialize all ˆVk with admissible heuristic Backup function: Bellman backups & solved labels Recommendation function: expected best arm

(20)

Motivation THTS Framework THTS Algorithms Summary

LRTDP (Tree Search Version) in the THTS Framework

Trial length: finish trials only in goal states Action selection: greedy

Outcome selection: sample unsolved outcome

Initialization: expand decision node and all its chance node successors, then initialize all ˆVk with admissible heuristic Backup function: Bellman backups & solved labels Recommendation function: expected best arm

(21)

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Recommendation function:

Most played arm[Bubeck et al. 2009, Chaslot et al. 2008]

Empirical distribution of plays [Bubeck et al. 2009]

Secure arm [Chaslot et al. 2008]

Initialization:

Expand decision node and initializechance nodeswith heuristic forstate-actionpairs [Keller & Eyerich, 2012]

Any classical heuristic on anydeterminization Occupation measureheuristic [Trevizan et al., 2017]

(22)

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Backup functions:

Temporal Differences [Sutton & Barto, 1987]

Q-Learning [Watkins, 1989]

Selective Backups [Feldman & Domshlak, 2012; Keller, 2015]

MaxMonte-Carlo [Keller & Helmert, 2013]

Partial Bellman [Keller & Helmert, 2013]

(23)

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Action selections:

Uniform sampling (UNI) ε-greedy (ε-G)

ε-G with decaying ε:

εLIN-G [Singh et al., 2000; Auer et al., 2002]

εRT-G [Keller, 2015]

εLOG-G [Keller, 2015]

Boltzmann exploration (BE)

BE with logarithmic decayingτ (BE-DT) [Singh et al., 2000]

UCB1 [Auer et al., 2002]

Root-valued UCB (RT-UCB) [Keller, 2015]

(24)

Motivation THTS Framework THTS Algorithms Summary

Experimental Comparison

THTS allows to mix and matchingredients Not all combinations asymptotically optimal Analysis based on properties of ingredientspossible

In [Keller, 2015], comparison of:

1 trial length, 1 outcome selection, 1 initialization 2 different recommendation functions

9 different backup functions 9 different action selections

⇒ 162 different THTS algorithms 115shown to be asymptotically optimal

(25)

Motivation THTS Framework THTS Algorithms Summary

Experimental Comparison

THTS allows to mix and matchingredients Not all combinations asymptotically optimal Analysis based on properties of ingredientspossible In [Keller, 2015], comparison of:

1 trial length, 1 outcome selection, 1 initialization 2 different recommendation functions

9 different backup functions 9 different action selections

⇒ 162 different THTS algorithms 115shown to be asymptotically optimal

(26)

Motivation THTS Framework THTS Algorithms Summary

Asymptotic Optimality

UNI

ǫ-G ǫLO G-G

ǫRT-G ǫLIN-G

BE

BE-DT RT-UCB

UCB1 LSMC

MC ESMC LSTD TD ESTD QL MaxMC PB

(27)

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm

Boltzman exploration and root-valued UCB1 perform best in most domains

Monte-Carlo and Partial Bellman backups perform best in most domains

almost all action selections and backup functions perform best in at least one domain

Academic Crossing

Elevators Game

Navigation Recon

Skill Sysadmin

Tamarisk Traffic

Triangle Wildfire

Total MCUCB1MPA 27 65 78 86 45 92 77 89 86 71 46 84 70 Prost2011 26 62 49 84 42 90 69 88 83 60 49 85 66

(28)

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm

Boltzman exploration and root-valued UCB1 perform best in most domains

Monte-Carlo and Partial Bellman backups perform best in most domains

almost all action selections and backup functions perform best in at least one domain

1 UCB1 4

RT-UCB

4 BE

2

BE-DT 1

ǫ-G 1

ǫRT-G 1

ǫLOG-G 1

ǫLIN-G

6 MC

4 PB

2

TD 2

MaxMC 1

SMC 1

QL

(29)

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm

Boltzman exploration and root-valued UCB1 perform best in most domains

Monte-Carlo and Partial Bellman backups perform best in most domains

almost all action selections and backup functions perform best in at least one domain

1 UCB1 4

RT-UCB

4 BE

2

BE-DT 1

ǫ-G 1

ǫRT-G 1

ǫLOG-G 1

ǫLIN-G

6 MC

4 PB

2

TD 2

MaxMC 1

SMC 1

QL

(30)

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm

Boltzman exploration and root-valued UCB1 perform best in most domains

Monte-Carlo and Partial Bellman backups perform best in most domains

almost all action selections and backup functions perform best in at least one domain

1 UCB1 4

RT-UCB

4 BE

2

BE-DT 1

ǫ-G 1

ǫRT-G 1

ǫLOG-G 1

ǫLIN-G

6 MC

4 PB

2

TD 2

MaxMC 1

SMC 1

QL

(31)

Motivation THTS Framework THTS Algorithms Summary

Implementation: Prost

The Prostplanner implements THTS framework mixing and matching of ingredients very simple

to add new ingredients, just inherit from the corresponding class

https://bitbucket.org/tkeller/prost/

(32)

Motivation THTS Framework THTS Algorithms Summary

Summary

(33)

Motivation THTS Framework THTS Algorithms Summary

Summary

MCTS, AOand RTDP havecomplementary strengths But also a similar structure

THTS allows to combine ideas from MCTS, Heuristic Search and DP

Mixing and matching ingredients leads to novel and sometimes better algorithms

Referenzen

ÄHNLICHE DOKUMENTE

forward search from initial state to goal based on progression backward search from goal to initial state based on regression bidirectional searcho. Today we will look into

The last missing piece is a definition of regression through operators, describing exactly in which states s applying a given operator o leads to a state satisfying a given formula

When applying regression in practice, we need to consider emptiness testing to prune dead-end search states subsumption testing to prune dominated search states logical

The merge steps combine two abstract transition systems by replacing them with their synchronized product. The shrink steps make an abstract system smaller by abstracting

Theorem (Uniform Cost Partitioning Heuristic is Admissible) Let L be a set of disjunctive action landmarks for state s of Π. Then h UCP (L) is an admissible heuristic estimate

Hitting sets over all cut landmarks yield a perfect heuristic for delete-free planning tasks. The LM-cut heuristic is an admissible heuristic based on

Performs Bellman backup in each encountered state Uses admissible heuristic for states not updated before.. Motivation Asynchronous VI RTDP

Each decision of MCTS tree policy can be viewed as multi-armed