• Keine Ergebnisse gefunden

Planning and Optimization F4. Value Iteration Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization F4. Value Iteration Malte Helmert and Gabriele R¨oger"

Copied!
6
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

F4. Value Iteration

Malte Helmert and Gabriele R¨ oger

Universit¨ at Basel

December 02, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 1 / 24

Planning and Optimization

December 02, 2020 — F4. Value Iteration

F4.1 Introduction F4.2 Value Iteration F4.3 Asynchronous VI F4.4 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 2 / 24

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Explicit MDPs

Explicit MDPs

Foundations Linear Programing

Policy Iteration

Value

Iteration

(2)

F4. Value Iteration Introduction

F4.1 Introduction

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 5 / 24

F4. Value Iteration Introduction

From Policy Iteration to Value Iteration

I Policy Iteration:

I search over policies

I by evaluating their state-values I Value Iteration:

I search directly over state-values

I optimal policy induced by final state-values

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 6 / 24

F4. Value Iteration Value Iteration

F4.2 Value Iteration

F4. Value Iteration Value Iteration

Value Iteration: Idea

I Value Iteration (VI) was first proposed by Bellman in 1957 I computes estimates ˆ V 0 , V ˆ 1 , . . . of V ? in an iterative process I starts with arbitrary ˆ V 0

I bases estimate ˆ V i+1 on values of estimate ˆ V i by treating Bellman equation as update rule on all states:

V ˆ i +1 (s) := min

a∈A(s) c(a) + X

s

0

∈S

T (s, a, s 0 ) · V ˆ i (s 0 )

!

(for SSPs; for MDPs accordingly)

I converges to state-values of optimal policy

I terminates when difference of estimates is small

(3)

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 0

V ˆ 29

s

0

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 9 / 24

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 1

V ˆ 29

s

0

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.00 1.00 1.00 1.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 10 / 24

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 2

V ˆ 29

s

0

2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 5.20 1.60 2.00 2.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 5

V ˆ 29

s

0

5.00 5.00 5.00 4.97 5.00 5.00 4.84 4.76 5.00 4.00 4.49 3.96 4.60 3.00 7.79 2.31 3.96 2.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise)

I moving from gray cells unsuccessful with probability 0.6

(4)

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 10

V ˆ 29

s

0

8.18 7.31 7.00 8.50 8.30 6.38 6.00 6.95 6.38 4.00 5.00 4.87 5.43 3.00 8.44 2.48 4.46 2.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 13 / 24

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 20

V ˆ 29

s

0

8.50 7.50 7.00 9.49 8.99 6.50 6.00 7.49 6.50 4.00 5.00 5.00 5.50 3.00 8.50 2.50 4.50 2.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 14 / 24

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 29

V ˆ 29

s

0

8.50 7.50 7.00 9.50 9.00 6.50 6.00 7.50 6.50 4.00 5.00 5.00 5.50 3.00 8.50 2.50 4.50 2.00 1.00 0.00

s ?

I cost of 3 to move from striped cells (cost is 1 otherwise) I moving from gray cells unsuccessful with probability 0.6

F4. Value Iteration Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

π ?

V ˆ 29

s

0

8.50 7.50 7.00 9.50 9.00 6.50 6.00 7.50 6.50 4.00 5.00 5.00 5.50 3.00 8.50 2.50 4.50 2.00 1.00 0.00

⇒ ⇑ ⇑ ⇐

⇑ ⇑ ⇑ ⇑

⇒ ⇑ ⇐ ⇑

⇒ ⇑ ⇑ ⇑

⇒ ⇒ ⇒ s ?

I cost of 3 to move from striped cells (cost is 1 otherwise)

I moving from gray cells unsuccessful with probability 0.6

(5)

F4. Value Iteration Value Iteration

Value Iteration for SSPs

Value Iteration for SSP T = hS, A, c, T , s 0 , S ? i and > 0 initialize ˆ V 0 for 0 for goal states, otherwise arbitrarily for i = 1, 2, . . . :

for all states s ∈ S \ S ? : V ˆ i+1 (s ) := min a∈A(s)

c (a) + P

s

0

∈S T (s , a, s 0 ) · V ˆ i (s 0 ) if max s∈S | V ˆ i +1 (s) − V ˆ i (s)| < :

return a greedy policy π V ˆ

i+1

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 17 / 24

F4. Value Iteration Value Iteration

Value Iteration for MDPs

Value Iteration for MDP T = hS , A, R , T , s 0 , γi and > 0 initialize ˆ V 0 arbitrarily

for i = 1, 2, . . . : for all states s ∈ S:

V ˆ i+1 (s) := max a∈A(s)

R (s) + γ · P

s

0

∈S T (s , a, s 0 ) · V ˆ i (s 0 )

if max s∈S | V ˆ i+1 (s ) − V ˆ i (s )| < : return π V ˆ

i+1

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 18 / 24

F4. Value Iteration Asynchronous VI

F4.3 Asynchronous VI

F4. Value Iteration Asynchronous VI

Asynchronous Value Iteration

I Updating all states simultaneously is called synchronous backup

I Asynchronous VI performs backups for individual states I Different approaches lead to different backup orders I Can significantly reduce computation

I Guaranteed to converge if all states are selected repeatedly

⇒ Optimal VI with asynchronous backups possible

⇒ No obvious termination criterion

⇒ often used in any-time setting (run until you need the result)

(6)

F4. Value Iteration Asynchronous VI

In-place Value Iteration

I Synchronous value iteration creates new copy of value function (two are required simultaneously)

V ˆ i+1 (s ) := min

a∈A(s) c (a) + X

s

0

∈S

T (s , a, s 0 ) · V ˆ i (s 0 )

!

I In-place value iteration only requires a single copy of value function

V ˆ (s ) := min

a∈A(s) c (a) + X

s

0

∈S

T (s , a, s 0 ) · V ˆ (s 0 )

!

I In-place VI is asynchronous because some backups are based on “old” values, some on “new” values

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 21 / 24

F4. Value Iteration Summary

F4.4 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 02, 2020 22 / 24

F4. Value Iteration Summary

Linear Programming, Policy Iteration, or Value Iteration?

I Linear Programming is the only technique where the solution is guaranteed to be optimal (independent from )

I PI and VI are often faster than LP

I Policy evaluation is slighly cheaper than a VI iteration I PI faster than VI if few iterations required

I VI faster than PI if number of PI iterations outweighs difference of policy evaluation compared to VI

I Asynchronous VI is basis of more sophisticated algorithm that can be applied in large MDPs and SSPs

F4. Value Iteration Summary

Summary

I Value Iteration searches in the space of state-values I VI applies Bellman equation as update rule iteratively I VI converges to optimal state-values

I VI remains optimal with asynchronous backups

as long as all states are selected repeatedly

Referenzen

ÄHNLICHE DOKUMENTE

Landmarks, network flows and potential heuristics are based on constraints that can be specified for a planning task.... Constraint-based Heuristics Multiple

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

For abstraction heuristics and disjunctive action landmarks, we know how to determine an optimal cost partitioning, using linear programming. Although solving a linear program

Asynchronous VI performs backups for individual states Different approaches lead to different backup orders Can significantly reduce computation. Guaranteed to converge if all

I Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs I Impossible

Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs Impossible

Real-time Dynamic Programming (RTDP) generates hash map with state-value estimates of relevant states.. uses admissible heuristic to achieve convergence albeit not updating