Planning and Optimization G2. Heuristic Search: AO

(1)

Planning and Optimization

G2. Heuristic Search: AO^∗ & LAO^∗ Part II

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

December 3, 2018

(2)

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo Methods

(3)

AO ^∗

(4)

From A

^∗

with Backward Induction to AO

^∗

A^∗ with backward induction already very similar to AO^∗ Support for uncertain outcomesmissing

We focus on SSPs in these slides Adaption to FH-MDPs simple

Careful: admissible heuristic in reward setting must not underestimatetrue reward

Still two steps ahead:

restrict toacyclic probabilistic tasks⇒AO^∗ allowgeneral probabilistic tasks⇒LAO^∗

(5)

Transition Systems

AO^∗ distinguishes three transition systems:

The acyclic SSPT =hS,L,c,T,s₀,S^?i

⇒ given implicitly

The explicated graphTˆ_t =hSˆt,L,c,Tˆt,s0,S^?i

⇒ the part of T explicitly considered during search The partial solution graphTˆ_t^? =hSˆ_t^?,L,c,Tˆ_t^?,s₀,S^?i

⇒ The part of ˆT_t that contains best solution

s0 Tˆ_t^? Tˆ_t T

(6)

Explicated Graph

Expanding a state s at time stept explicates all outcomes s⁰ ∈succ(s, `) for all `∈L(s) by adding them to explicated graph:

Tˆ_t=hSˆt−1∪succ(s),L,c,Tˆ_t,s₀,S^?}, where ˆT_t= ˆTt−1 except that ˆT_t(s, `,s⁰) =T(s, `,s⁰) for all `∈L(s) ands⁰ ∈succ(s, `)

Explicated states are annotated withstate-value estimate Vˆt(s) that describesestimated expected cost to goal at step t When state s⁰ is explicated ands⁰ ∈/ Sˆt−1, its state-value estimate is initializedto ˆVt(s⁰) :=h(s⁰)

We callleaf states of ˆT_t fringe states

(7)

Partial Solution Graph

The partial solution graphTˆ_t^? is the subgraph of ˆT_t that is spanned by the smallest set of states ˆS_t^? that satisfies:

s₀∈Sˆ_t^?

ifs∈Sˆ_t^?,s⁰∈Sˆ_t and ˆT_t(s,aVˆ_t(s),s⁰)>0, thens⁰ in ˆS_t^? The partial solution graph forms a partial acyclic policy defined in the initial states₀ and allnon-leaf statesthat can be reached by its execution

Leaf states that can be reached by the policy described by the partial solution graph are thestates in the greedy fringe

(8)

Bellman backups

AO^∗ does not maintain static open list

State-value estimatesdetermine partial solution graph Partial solution graph determines which state is acandidate for expansion

Different strategies to select among candidates exist (Some) state-value estimates are updated in time step t by Bellman backups:

Vˆ_t(s) = min

l∈L c(l) + X

s⁰∈Sˆt

Tˆ_t(s,l,s⁰)·Vˆ_t(s⁰)

(9)

AO

^∗

AO^∗ for acyclic SSPT explicates₀

whilethere is a greedy fringe state not inS?: select a greedy fringe states ∈/ S_? expand s

perform Bellman backups of states in ˆT_t−1^? in reverse order returnTˆ_t^?

(10)

AO

^∗

: Example (Blackboard)

s0

5

a1

a2

s1

10 s2

6 s3

3

a3

a4

a5

a6

s4

s5

3 s6

s7

4

a7

a8

s8

1 1

.5 .5 .25 .75

12 12 1 2

.8 .2 .5 .5

5 4

h(s) = 0 for goal states, otherwise inblue above or belows

(11)

Theoretical properties

Theorem

Using an admissible heuristic,AO^∗ converges to an optimal solution without (necessarily) explicating all states.

Proof omitted.

(12)

LAO ^∗

(13)

LAO

^∗

A^∗ with backward induction findssequential solutions(a plan) in classical planning tasks

AO^∗ findsacyclic solutions with branches(an acyclic policy) in acyclic SSPs

LAO^∗ is the generalization of AO^∗ tocyclic solutions in cyclic SSPs

(14)

AO^∗ LAO^∗ Summary

LAO

^∗

From plans to acyclic policies, we only changed backup procedure frombackward induction to Bellman backups When solutions may be cyclic, we cannot perform updates in reverse order

iteration

replacing Bellman backups withvalue iterationis LAO^∗ variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead

(15)

AO^∗ LAO^∗ Summary

LAO

^∗

Bellman backups are essentially acyclic version of value iteration

replacing Bellman backups withvalue iterationis LAO variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead

(16)

LAO

^∗

Bellman backups are essentially acyclic version of value iteration

replacing Bellman backups withvalue iterationis LAO^∗ variant the original algorithm of Hansen & Zilberstein (1998) uses policy iteration instead

(17)

LAO

^∗

LAO^∗ for SSP T explicates0

whilethere is a greedy fringe state not inS?: select a greedy fringe states ∈/ S_? expand s

perform policy iteration in ˆT_t returnTˆ_t^?

(18)

LAO

^∗

: Optimizations

Several optimizations for LAO^∗ have been proposed:

Usevalue iterationinstead of policy iteration

Terminate VI when the partial solution graphchanges Expandall statesin greedy fringe before backup

Order states (arbitrarily within cycles) and usebackward induction for updates

⇒last two combine to famous variant iLAO^∗

(19)

Theoretical properties

Theorem

Using an admissible heuristic,LAO^∗ converges to an optimal solution without (necessarily) explicating all states.

Proof omitted.

(20)

Summary

(21)

Summary

AO^∗ finds optimal solutionsfor acyclic SSPs LAO^∗ finds optimal solutionsfor SSPs

Both algorithms differ from A^∗ with backward induction in waybackups are performed

Unlike previous optimal algorithms, both are able to find optimal solution without explicating all states

Planning and Optimization G2. Heuristic Search: AO