• Keine Ergebnisse gefunden

F4. Blind Methods: Value Iteration & Linear Programming

N/A
N/A
Protected

Academic year: 2022

Aktie "F4. Blind Methods: Value Iteration & Linear Programming"

Copied!
6
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

F4. Blind Methods: Value Iteration & Linear Programming

Gabriele R¨ oger and Thomas Keller

Universit¨ at Basel

November 26, 2018

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 1 / 21

Planning and Optimization

November 26, 2018 — F4. Blind Methods: Value Iteration & Linear Programming

F4.1 Value Iteration F4.2 Linear Programming F4.3 Summary

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 2 / 21

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo

From Policy Iteration to Value Iteration

I Policy Iteration:

I

search over policies

I

by evaluating their state-values

I Value Iteration:

I

search directly over state-values

I

optimal policy induced by final state-values

(2)

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

F4.1 Value Iteration

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 5 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Value Iteration: Idea

I Value Iteration (VI) was first proposed by Bellman in 1957

I computes estimates ˆ V 0 , V ˆ 1 , . . . of V ? in an iterative process

I starts with arbitrary ˆ V 0

I bases estimate ˆ V i+1 on values of estimate ˆ V i by applying Bellman optimality equation on all states:

V ˆ i+1 (s ) := min

`∈L(s) c (`) + X

s

0

∈S

T (s, `, s 0 ) · V ˆ i (s 0 )

(for SSPs; for FH-MDPs and DR-MDPs accordingly)

I converges to state-values of optimal policy

I terminates when difference of estimates is small

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 6 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 0

V ˆ 19

s

0

1.15 3.94 0.13 4.06 3.95 0.47 3.58 3.64 1.92 1.13 2.92 0.87 2.51 3.43 0.12 0.68 3.85 4.83 1.32 4.1

s ?

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 1

V ˆ 19

s

0

2.15 1.13 1.13 3.49 3.56 1.73 1.13 3.53 2.60 1.47 1.12 1.79 3.27 1.12 3.34 1.46 4.24 2.32 1.12 0.0

s ?

(3)

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 2

V ˆ 19

s

0

2.13 2.13 2.13 3.55 3.83 2.49 2.12 3.57 3.15 2.12 2.13 2.52 3.41 2.47 5.45 1.88 4.46 2.12 1.0 0.0

s ?

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 9 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 5

V ˆ 19

s

0

5.12 5.12 5.12 5.43 5.50 5.20 5.12 5.31 5.35 4.0 4.81 4.20 5.0 3.0 7.84 2.37 4.49 2.0 1.0 0.0

s ?

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 10 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 10

V ˆ 19

s

0

8.22 7.33 7.0 8.76 8.39 6.40 6.0 7.08 6.41 4.0 5.0 4.90 5.46 3.0 8.45 2.49 4.49 2.0 1.0 0.0

s ?

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

V ˆ 19

V ˆ 19

s

0

8.49 7.49 7.0 9.49 8.98 6.49 6.0 7.47 6.49 4.0 5.0 4.98 5.49 3.0 8.49 2.49 4.49 2.0 1.0 0.0

s ?

(4)

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Example: Value Iteration

1 2 3 4

1 2 3 4 5

π ?

V ˆ 19

s

0

8.49 7.49 7.0 9.49 8.98 6.49 6.0 7.47 6.49 4.0 5.0 4.98 5.49 3.0 8.49 2.49 4.49 2.0 1.0 0.0

⇒ ⇑ ⇑ ⇐

⇑ ⇑ ⇑ ⇑

⇒ ⇑ ⇐ ⇑

⇒ ⇑ ⇑ ⇑

⇒ ⇒ ⇒ s ?

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 13 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Value Iteration

Value Iteration for SSP T and > 0 initialize ˆ V 0 arbitarily

for i = 1, 2, . . . :

for all states s ∈ S:

V ˆ i+1 (s ) := min `∈L(s) c(`) + P

s

0

∈S T (s , `, s 0 ) · V ˆ i (s 0 ) if max s∈S | V ˆ i+1 (s ) − V ˆ i (s)| < :

return π V ˆ

i+1

Note: VI for FH-MDPs and DR-MDPs obtained by replacing Bellman optimality equation with corresponding version.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 14 / 21

F4. Blind Methods: Value Iteration & Linear Programming Value Iteration

Policy Iteration or Value Iteration?

I PI and VI both have their advantages:

I

often, PI requires only few iterations

I

VI iterations significantly cheaper

I Better versions of both PI and VI exist

I

Modified PI (approximate policy evaluation)

I

Asynchronous VI (update subset of states in each iteration)

I However, both suffer from the problem that the whole state space must eventually be visited

I Impossible in large MDPs / SSPs

F4. Blind Methods: Value Iteration & Linear Programming Linear Programming

F4.2 Linear Programming

(5)

F4. Blind Methods: Value Iteration & Linear Programming Linear Programming

Linear Programming for SSPs

I VI iteratively computes solution to the set of Bellman optimality equations

I Linear Programming offers an alternative way to solve optimization problems (see E3)

I Get solution to

V ? (s) := min

`∈L(s) c (`) + X

s

0

∈S

T (s , `, s 0 ) · V ? (s 0 ) without iterative process

I Problem: equations are not linear due to minimization

I But: can be moved to objective function

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 17 / 21

F4. Blind Methods: Value Iteration & Linear Programming Linear Programming

Linear Programming

The solution to the following LP provides

the state-values V ? (s ) (through the variables X s ) of an optimal policy for an SSP T = hS , L, c , T , s 0 , S ? i:

maximize P

s∈S X s subject to X s = 0 for all s ∈ S ? X s ≤ c (`) + X

s

0

∈S

T (s , `, s 0 ) · X s

0

for all s ∈ S and ` ∈ L(s ) X s ≥ 0 for all s ∈ S

Note: Versions for FH-MDPs and DR-MDPs exist.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 18 / 21

F4. Blind Methods: Value Iteration & Linear Programming Linear Programming

Linear Programming

I Allows to solve SSPs with existing LP solver

I But: |S| many variables, |S | · |L| many constraints

I Interesting problems usually not solvable with LP solvers (but neither with PI or VI)

⇒ For large SSPs and MDPs, we need different techniques.

F4. Blind Methods: Value Iteration & Linear Programming Summary

F4.3 Summary

(6)

F4. Blind Methods: Value Iteration & Linear Programming Summary

Summary

I Value Iteration searches in the space of state-values

I VI applies Bellman optimality equation iteratively

I VI converges to optimal state-values

I Alternative to compute state-values is by compilation to LP

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 26, 2018 21 / 21

Referenzen

ÄHNLICHE DOKUMENTE

Graduate School of Computational Engineering at Technische Universität

Such a solution will indicate the lowest possible cost, but additional information provided by the basis variables report will indicate violations of good plant practices

Das Board mySTM32 F4 Discovery PLUS ist eine Kombinati- on des Boards “STM32F4-Discovery” mit dem fertig bestück- ten Board “mySTM32-Board-F4D”.. Zusätzlich verfügt das

Phase II: If the original problem is feasible, apply the simplex algorithm to the initial feasible tableau obtained from Phase I above.. Again, two outcomes

F4.1 Introduction F4.2 Value Iteration F4.3 Asynchronous VI F4.4 Summary.. R¨ oger (Universit¨ at Basel) Planning and Optimization December 02, 2020 2

Asynchronous VI performs backups for individual states Different approaches lead to different backup orders Can significantly reduce computation. Guaranteed to converge if all

A feasible maximization (resp. minimization) problem is unbounded if the objective function can assume arbitrarily large positive (resp. negative) values at feasible

International Institute for Applied Systems Analysis A-2361