Planning and Optimization F1. Markov Decision Processes Gabriele R¨oger and Thomas Keller

(1)

F1. Markov Decision Processes

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

November 21, 2018

(2)

Motivation Markov Decision Processes Summary

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo Methods

(3)

Motivation

(4)

Limitations of Classical Planning

timetable for astronauts on ISS

concurrency required for some experiments optimizemakespan

(5)

Generalization of Classical Planning: Temporal Planning

timetable for astronauts on ISS

concurrency required for some experiments optimizemakespan

(6)

Limitations of Classical Planning

kinematics of robotic arm

state space is continuous

preconditions and effects described by complex functions

(7)

Generalization of Classical Planning: Numeric Planning

kinematics of robotic arm state space is continuous

preconditions and effects described by complex functions

(8)

Limitations of Classical Planning

1 2 3 4 5

satellite takes images of patches on earth

weather forecast is uncertain

find solution with lowestexpected cost

(9)

Generalization of Classical Planning: MDPs

1 2 3 4 5

satellite takes images of patches on earth weather forecast is uncertain

find solution with lowestexpected cost

(10)

Limitations of Classical Planning

Chess

there is an opponentwith contradictory objective

(11)

Generalization of Classical Planning: Multiplayer Games

Chess

there is an opponentwith contradictory objective

(12)

Limitations of Classical Planning

Solitaire

some state information cannot be observed must reason over belieffor good behaviour

(13)

Generalization of Classical Planning: POMDPs

Solitaire

some state information cannot be observed must reason overbelief for good behaviour

(14)

Limitations of Classical Planning

many applications arecombinations of these all of these areactive research areas

we focus on one of them:

probabilistic planningwith Markov decision processes MDPs are closely related togames(Why?)

(15)

Markov Decision Processes

(16)

Markov Decision Processes

Markov decision processes (MDPs) studied since the 1950s Work up to 1980s mostly on theory and basic algorithms for small to medium sized MDPs

Today, focus on large (typically factored) MDPs Fundamental datastructure forreinforcement learning (not covered in this course)

and forprobabilistic planning different variants exist

(17)

Reminder: Transition Systems

Definition (Transition System)

A transition system is a 6-tupleT =hS,L,c,T,s0,S?i where S is a finite set of states,

L is a finite set of (transition) labels, c :L→R⁺₀ is a label cost function, T ⊆S ×L×S is the transition relation, s0∈S is the initial state, and

S? ⊆S is the set of goal states.

(18)

Reminder: Transition System Example

LR

LL TL

RL

TR RR

Logistics problem with one package, one trucks, two locations:

location ofpackage: {L,R,T} location oftruck: {L,R}

(19)

Stochastic Shortest Path Problem

Definition (Stochastic Shortest Path Problem) Astochastic shortest path problem(SSP) is a 6-tuple T =hS,L,c,T,s0,S?i, where

S is a finite set of states,

L is a finite set of (transition) labels, c :L→R⁺₀ is a label cost function,

T :S×L×S 7→[0,1]is the transition function, s0∈S is the initial state, and

S_? ⊆S is the set of goal states.

For alls ∈S and`∈Lwith T(s, `,s⁰)>0 for somes⁰ ∈S, we requireP

s⁰∈ST(s, `,s⁰) = 1.

Note: An SSP is the probabilistic pendant of a transition system.

(20)

Reminder: Transition System Example

LR

LL TL

RL

TR RR

.2 .8

Logistics problem with one package, one trucks, two locations:

location ofpackage: {L,R,T} location oftruck: {L,R}

if truck moves with package, 20% chance of losing package

(21)

Terminology (1)

Ifp :=T(s, `,s⁰)>0, we writes −^p:`−→s⁰ ors −→^p s⁰ if not interested in `.

IfT(s, `,s⁰) = 1, we also writes −→^` s⁰ ors →s⁰ if not interested in `.

IfT(s, `,s⁰)>0 for somes⁰ we say that` isapplicable in s.

The set of applicable labels ins is L(s).

(22)

Terminology (2)

thesuccessor set of s and`is

succ(s, `) ={s⁰ ∈S |T(s, `,s⁰)>0}

s⁰ is a successor of s ifs⁰ ∈succ(s, `) for some ` s is predecessorof s⁰ if s⁰ ∈succ(s, `) for some`

with s⁰∼succ(s, `) we denote that successors⁰ ∈succ(s, `) of s and` issampled according to probability distributionT

(23)

Terminology (3)

s⁰ is reachable from s if there exists a sequence of transitions s⁰−−−→^p¹^:`¹ s¹, . . . , sⁿ⁻¹−−−→^pⁿ^:`ⁿ sⁿ s.t. s⁰=s and sⁿ=s⁰

Note: n= 0 possible; thens=s⁰

s⁰, . . . ,sⁿ is called (state) path froms tos⁰

`1, . . . , `n is called (label) path froms tos⁰

s⁰−^`→¹ s¹, . . . ,sⁿ⁻¹−^`→ⁿ sⁿ is called trace froms tos⁰ length of path/trace isn

cost of label path/trace isPn i=1c(`_i) probabilityof path/trace isQn

i=1p_i

(24)

Finite-horizon Markov Decision Process

Definition (Finite-horizon Markov Decision Process)

Afinite-horizon Markov decision process (FH-MDP) is a 6-tuple T =hS,L,R,T,s₀,Hi, where

L is a finite set of (transition) labels, R :S×L→Ris the reward function,

T :S×L×S 7→[0,1] is the transition function, s₀∈S is the initial state, and

H ∈Nis the finite horizon.

s⁰∈ST(s, `,s⁰) = 1.

(25)

Example: Push Your Luck

. . . . . .

0

1/6

0

2

1/6

(26)

Discounted Reward Markov Decision Process

Definition (Discounted Reward Markov Decision Process) Adiscounted reward Markov decision process(DR-MDP) is a 6-tupleT =hS,L,R,T,s₀, γi, where

L is a finite set of (transition) labels, R :S×L→Ris the reward function,

T :S×L×S 7→[0,1] is the transition function, s₀∈S is the initial state, and

γ ∈(0,1)is the discount factor.

s⁰∈ST(s, `,s⁰) = 1.

(27)

Example: Grid World

1 2 3 4

1 2 3

s0

−1 +1

each move goes in orthogonal direction with some probability (4,3) gives reward of +1 and sets position back to (1,1) (4,2) gives reward of -1

(28)

Summary

(29)

Summary

Many planning scenarios beyond classical planning We focus on probabilistic planning

SSPs are classical planning + probabilistic transition function FH-MDPs and DR-MDPs allowstate-dependent rewards FH-MDPs consider finite number of steps

DR-MDPsdiscount rewards overinfinitehorizon