Planning and Optimization
G2. Heuristic Search: AO∗ & LAO∗ Part II
Gabriele R¨oger and Thomas Keller
Universit¨at Basel
December 3, 2018
Content of this Course
Planning
Classical
Tasks Progression/
Regression Complexity Heuristics
Probabilistic
MDPs Blind Methods Heuristic Search
Monte-Carlo Methods
AO ∗
From A
∗with Backward Induction to AO
∗A∗ with backward induction already very similar to AO∗ Support for uncertain outcomesmissing
We focus on SSPs in these slides Adaption to FH-MDPs simple
Careful: admissible heuristic in reward setting must not underestimatetrue reward
Still two steps ahead:
restrict toacyclic probabilistic tasks⇒AO∗ allowgeneral probabilistic tasks⇒LAO∗
Transition Systems
AO∗ distinguishes three transition systems:
The acyclic SSPT =hS,L,c,T,s0,S?i
⇒ given implicitly
The explicated graphTˆt =hSˆt,L,c,Tˆt,s0,S?i
⇒ the part of T explicitly considered during search The partial solution graphTˆt? =hSˆt?,L,c,Tˆt?,s0,S?i
⇒ The part of ˆTt that contains best solution
s0 Tˆt? Tˆt T
Explicated Graph
Expanding a state s at time stept explicates all outcomes s0 ∈succ(s, `) for all `∈L(s) by adding them to explicated graph:
Tˆt=hSˆt−1∪succ(s),L,c,Tˆt,s0,S?}, where ˆTt= ˆTt−1 except that ˆTt(s, `,s0) =T(s, `,s0) for all `∈L(s) ands0 ∈succ(s, `)
Explicated states are annotated withstate-value estimate Vˆt(s) that describesestimated expected cost to goal at step t When state s0 is explicated ands0 ∈/ Sˆt−1, its state-value estimate is initializedto ˆVt(s0) :=h(s0)
We callleaf states of ˆTt fringe states
Partial Solution Graph
The partial solution graphTˆt? is the subgraph of ˆTt that is spanned by the smallest set of states ˆSt? that satisfies:
s0∈Sˆt?
ifs∈Sˆt?,s0∈Sˆt and ˆTt(s,aVˆt(s),s0)>0, thens0 in ˆSt? The partial solution graph forms a partial acyclic policy defined in the initial states0 and allnon-leaf statesthat can be reached by its execution
Leaf states that can be reached by the policy described by the partial solution graph are thestates in the greedy fringe
Bellman backups
AO∗ does not maintain static open list
State-value estimatesdetermine partial solution graph Partial solution graph determines which state is acandidate for expansion
Different strategies to select among candidates exist (Some) state-value estimates are updated in time step t by Bellman backups:
Vˆt(s) = min
l∈L c(l) + X
s0∈Sˆt
Tˆt(s,l,s0)·Vˆt(s0)
AO
∗AO∗ for acyclic SSPT explicates0
whilethere is a greedy fringe state not inS?: select a greedy fringe states ∈/ S? expand s
perform Bellman backups of states in ˆTt−1? in reverse order returnTˆt?
AO
∗: Example (Blackboard)
s0
5
a1
a2
s1
10 s2
6 s3
3
a3
a4
a5
a6
s4
s5
3 s6
s7
4
a7
a8
s8
1 1
.5 .5 .25 .75
12 12 1 2
.8 .2 .5 .5
5 4
h(s) = 0 for goal states, otherwise inblue above or belows
Theoretical properties
Theorem
Using an admissible heuristic,AO∗ converges to an optimal solution without (necessarily) explicating all states.
Proof omitted.
LAO ∗
LAO
∗A∗ with backward induction findssequential solutions(a plan) in classical planning tasks
AO∗ findsacyclic solutions with branches(an acyclic policy) in acyclic SSPs
LAO∗ is the generalization of AO∗ tocyclic solutions in cyclic SSPs
AO∗ LAO∗ Summary
LAO
∗From plans to acyclic policies, we only changed backup procedure frombackward induction to Bellman backups When solutions may be cyclic, we cannot perform updates in reverse order
iteration
replacing Bellman backups withvalue iterationis LAO∗ variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead
AO∗ LAO∗ Summary
LAO
∗From plans to acyclic policies, we only changed backup procedure frombackward induction to Bellman backups When solutions may be cyclic, we cannot perform updates in reverse order
Bellman backups are essentially acyclic version of value iteration
replacing Bellman backups withvalue iterationis LAO variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead
LAO
∗From plans to acyclic policies, we only changed backup procedure frombackward induction to Bellman backups When solutions may be cyclic, we cannot perform updates in reverse order
Bellman backups are essentially acyclic version of value iteration
replacing Bellman backups withvalue iterationis LAO∗ variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead
LAO
∗LAO∗ for SSP T explicates0
whilethere is a greedy fringe state not inS?: select a greedy fringe states ∈/ S? expand s
perform policy iteration in ˆTt returnTˆt?
LAO
∗: Optimizations
Several optimizations for LAO∗ have been proposed:
Usevalue iterationinstead of policy iteration
Terminate VI when the partial solution graphchanges Expandall statesin greedy fringe before backup
Order states (arbitrarily within cycles) and usebackward induction for updates
⇒last two combine to famous variant iLAO∗
Theoretical properties
Theorem
Using an admissible heuristic,LAO∗ converges to an optimal solution without (necessarily) explicating all states.
Proof omitted.
Summary
Summary
AO∗ finds optimal solutionsfor acyclic SSPs LAO∗ finds optimal solutionsfor SSPs
Both algorithms differ from A∗ with backward induction in waybackups are performed
Unlike previous optimal algorithms, both are able to find optimal solution without explicating all states