• Keine Ergebnisse gefunden

Planning and Optimization F1. Markov Decision Processes Gabriele R¨oger and Thomas Keller

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization F1. Markov Decision Processes Gabriele R¨oger and Thomas Keller"

Copied!
29
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

F1. Markov Decision Processes

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

November 21, 2018

(2)

Motivation Markov Decision Processes Summary

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo Methods

(3)

Motivation

(4)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

timetable for astronauts on ISS

concurrency required for some experiments optimizemakespan

(5)

Generalization of Classical Planning: Temporal Planning

timetable for astronauts on ISS

concurrency required for some experiments optimizemakespan

(6)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

kinematics of robotic arm

state space is continuous

preconditions and effects described by complex functions

(7)

Generalization of Classical Planning: Numeric Planning

kinematics of robotic arm state space is continuous

preconditions and effects described by complex functions

(8)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

1 2 3 4 5

1 2 3 4 5

satellite takes images of patches on earth

weather forecast is uncertain

find solution with lowestexpected cost

(9)

Generalization of Classical Planning: MDPs

1 2 3 4 5

1 2 3 4 5

satellite takes images of patches on earth weather forecast is uncertain

find solution with lowestexpected cost

(10)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

Chess

there is an opponentwith contradictory objective

(11)

Generalization of Classical Planning: Multiplayer Games

Chess

there is an opponentwith contradictory objective

(12)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

Solitaire

some state information cannot be observed must reason over belieffor good behaviour

(13)

Generalization of Classical Planning: POMDPs

Solitaire

some state information cannot be observed must reason overbelief for good behaviour

(14)

Motivation Markov Decision Processes Summary

Limitations of Classical Planning

many applications arecombinations of these all of these areactive research areas

we focus on one of them:

probabilistic planningwith Markov decision processes MDPs are closely related togames(Why?)

(15)

Markov Decision Processes

(16)

Motivation Markov Decision Processes Summary

Markov Decision Processes

Markov decision processes (MDPs) studied since the 1950s Work up to 1980s mostly on theory and basic algorithms for small to medium sized MDPs

Today, focus on large (typically factored) MDPs Fundamental datastructure forreinforcement learning (not covered in this course)

and forprobabilistic planning different variants exist

(17)

Reminder: Transition Systems

Definition (Transition System)

A transition system is a 6-tupleT =hS,L,c,T,s0,S?i where S is a finite set of states,

L is a finite set of (transition) labels, c :L→R+0 is a label cost function, T ⊆S ×L×S is the transition relation, s0∈S is the initial state, and

S? ⊆S is the set of goal states.

(18)

Motivation Markov Decision Processes Summary

Reminder: Transition System Example

LR

LL TL

RL

TR RR

Logistics problem with one package, one trucks, two locations:

location ofpackage: {L,R,T} location oftruck: {L,R}

(19)

Stochastic Shortest Path Problem

Definition (Stochastic Shortest Path Problem) Astochastic shortest path problem(SSP) is a 6-tuple T =hS,L,c,T,s0,S?i, where

S is a finite set of states,

L is a finite set of (transition) labels, c :L→R+0 is a label cost function,

T :S×L×S 7→[0,1]is the transition function, s0∈S is the initial state, and

S? ⊆S is the set of goal states.

For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP

s0∈ST(s, `,s0) = 1.

Note: An SSP is the probabilistic pendant of a transition system.

(20)

Motivation Markov Decision Processes Summary

Reminder: Transition System Example

LR

LL TL

RL

TR RR

.2 .8

.2 .8

Logistics problem with one package, one trucks, two locations:

location ofpackage: {L,R,T} location oftruck: {L,R}

if truck moves with package, 20% chance of losing package

(21)

Terminology (1)

Ifp :=T(s, `,s0)>0, we writes −p:`−→s0 ors −→p s0 if not interested in `.

IfT(s, `,s0) = 1, we also writes −→` s0 ors →s0 if not interested in `.

IfT(s, `,s0)>0 for somes0 we say that` isapplicable in s.

The set of applicable labels ins is L(s).

(22)

Motivation Markov Decision Processes Summary

Terminology (2)

thesuccessor set of s and`is

succ(s, `) ={s0 ∈S |T(s, `,s0)>0}

s0 is a successor of s ifs0 ∈succ(s, `) for some ` s is predecessorof s0 if s0 ∈succ(s, `) for some`

with s0∼succ(s, `) we denote that successors0 ∈succ(s, `) of s and` issampled according to probability distributionT

(23)

Terminology (3)

s0 is reachable from s if there exists a sequence of transitions s0−−−→p1:`1 s1, . . . , sn−1−−−→pn:`n sn s.t. s0=s and sn=s0

Note: n= 0 possible; thens=s0

s0, . . . ,sn is called (state) path froms tos0

`1, . . . , `n is called (label) path froms tos0

s0`1 s1, . . . ,sn−1`n sn is called trace froms tos0 length of path/trace isn

cost of label path/trace isPn i=1c(`i) probabilityof path/trace isQn

i=1pi

(24)

Motivation Markov Decision Processes Summary

Finite-horizon Markov Decision Process

Definition (Finite-horizon Markov Decision Process)

Afinite-horizon Markov decision process (FH-MDP) is a 6-tuple T =hS,L,R,T,s0,Hi, where

S is a finite set of states,

L is a finite set of (transition) labels, R :S×L→Ris the reward function,

T :S×L×S 7→[0,1] is the transition function, s0∈S is the initial state, and

H ∈Nis the finite horizon.

For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP

s0∈ST(s, `,s0) = 1.

(25)

Example: Push Your Luck

. . . . . .

0

0

1/6

1/6

1/6

0

2

1/6

1/6

1/6

(26)

Motivation Markov Decision Processes Summary

Discounted Reward Markov Decision Process

Definition (Discounted Reward Markov Decision Process) Adiscounted reward Markov decision process(DR-MDP) is a 6-tupleT =hS,L,R,T,s0, γi, where

S is a finite set of states,

L is a finite set of (transition) labels, R :S×L→Ris the reward function,

T :S×L×S 7→[0,1] is the transition function, s0∈S is the initial state, and

γ ∈(0,1)is the discount factor.

For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP

s0∈ST(s, `,s0) = 1.

(27)

Example: Grid World

1 2 3 4

1 2 3

s0

−1 +1

each move goes in orthogonal direction with some probability (4,3) gives reward of +1 and sets position back to (1,1) (4,2) gives reward of -1

(28)

Motivation Markov Decision Processes Summary

Summary

(29)

Summary

Many planning scenarios beyond classical planning We focus on probabilistic planning

SSPs are classical planning + probabilistic transition function FH-MDPs and DR-MDPs allowstate-dependent rewards FH-MDPs consider finite number of steps

DR-MDPsdiscount rewards overinfinitehorizon

Referenzen

ÄHNLICHE DOKUMENTE

Propositional planning tasks compactly represent transition systems and are suitable as inputs for planning algorithms. They are based on concepts from propositional logic, enhanced

progression: forward from initial state to goal regression: backward from goal states to initial state bidirectional search.. Planning

The last missing piece is a definition of regression through operators, describing exactly in which states s applying a given operator o leads to a state satisfying a given formula

decision problem analogue of satisficing planning Definition (Bounded-Cost Plan Existence). The bounded-cost plan existence problem ( BCPlanEx ) is the following

Planning tasks in positive normal form without delete effects are called relaxed planning tasks. Plans for relaxed planning tasks are called

The objective value of an integer program that minimizes this cost subject to the flow constraints is a lower bound on the plan cost (i.e., an admissible heuristic estimate)..

Related Bellman optimality equation describes optimal policy Compact descriptions that induce SSPs and MDPs analogous to

Single-outcome determinizations: important parts of state space can become unreachable ⇒ poor policy or unsolvable All-outcomes determinization: utterly optimistic. exponential in