F1. Markov Decision Processes
Gabriele R¨oger and Thomas Keller
Universit¨at Basel
November 21, 2018
Motivation Markov Decision Processes Summary
Content of this Course
Planning
Classical
Tasks Progression/
Regression Complexity Heuristics
Probabilistic
MDPs Blind Methods Heuristic Search
Monte-Carlo Methods
Motivation
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
timetable for astronauts on ISS
concurrency required for some experiments optimizemakespan
Generalization of Classical Planning: Temporal Planning
timetable for astronauts on ISS
concurrency required for some experiments optimizemakespan
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
kinematics of robotic arm
state space is continuous
preconditions and effects described by complex functions
Generalization of Classical Planning: Numeric Planning
kinematics of robotic arm state space is continuous
preconditions and effects described by complex functions
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
1 2 3 4 5
1 2 3 4 5
satellite takes images of patches on earth
weather forecast is uncertain
find solution with lowestexpected cost
Generalization of Classical Planning: MDPs
1 2 3 4 5
1 2 3 4 5
satellite takes images of patches on earth weather forecast is uncertain
find solution with lowestexpected cost
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
Chess
there is an opponentwith contradictory objective
Generalization of Classical Planning: Multiplayer Games
Chess
there is an opponentwith contradictory objective
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
Solitaire
some state information cannot be observed must reason over belieffor good behaviour
Generalization of Classical Planning: POMDPs
Solitaire
some state information cannot be observed must reason overbelief for good behaviour
Motivation Markov Decision Processes Summary
Limitations of Classical Planning
many applications arecombinations of these all of these areactive research areas
we focus on one of them:
probabilistic planningwith Markov decision processes MDPs are closely related togames(Why?)
Markov Decision Processes
Motivation Markov Decision Processes Summary
Markov Decision Processes
Markov decision processes (MDPs) studied since the 1950s Work up to 1980s mostly on theory and basic algorithms for small to medium sized MDPs
Today, focus on large (typically factored) MDPs Fundamental datastructure forreinforcement learning (not covered in this course)
and forprobabilistic planning different variants exist
Reminder: Transition Systems
Definition (Transition System)
A transition system is a 6-tupleT =hS,L,c,T,s0,S?i where S is a finite set of states,
L is a finite set of (transition) labels, c :L→R+0 is a label cost function, T ⊆S ×L×S is the transition relation, s0∈S is the initial state, and
S? ⊆S is the set of goal states.
Motivation Markov Decision Processes Summary
Reminder: Transition System Example
LR
LL TL
RL
TR RR
Logistics problem with one package, one trucks, two locations:
location ofpackage: {L,R,T} location oftruck: {L,R}
Stochastic Shortest Path Problem
Definition (Stochastic Shortest Path Problem) Astochastic shortest path problem(SSP) is a 6-tuple T =hS,L,c,T,s0,S?i, where
S is a finite set of states,
L is a finite set of (transition) labels, c :L→R+0 is a label cost function,
T :S×L×S 7→[0,1]is the transition function, s0∈S is the initial state, and
S? ⊆S is the set of goal states.
For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP
s0∈ST(s, `,s0) = 1.
Note: An SSP is the probabilistic pendant of a transition system.
Motivation Markov Decision Processes Summary
Reminder: Transition System Example
LR
LL TL
RL
TR RR
.2 .8
.2 .8
Logistics problem with one package, one trucks, two locations:
location ofpackage: {L,R,T} location oftruck: {L,R}
if truck moves with package, 20% chance of losing package
Terminology (1)
Ifp :=T(s, `,s0)>0, we writes −p:`−→s0 ors −→p s0 if not interested in `.
IfT(s, `,s0) = 1, we also writes −→` s0 ors →s0 if not interested in `.
IfT(s, `,s0)>0 for somes0 we say that` isapplicable in s.
The set of applicable labels ins is L(s).
Motivation Markov Decision Processes Summary
Terminology (2)
thesuccessor set of s and`is
succ(s, `) ={s0 ∈S |T(s, `,s0)>0}
s0 is a successor of s ifs0 ∈succ(s, `) for some ` s is predecessorof s0 if s0 ∈succ(s, `) for some`
with s0∼succ(s, `) we denote that successors0 ∈succ(s, `) of s and` issampled according to probability distributionT
Terminology (3)
s0 is reachable from s if there exists a sequence of transitions s0−−−→p1:`1 s1, . . . , sn−1−−−→pn:`n sn s.t. s0=s and sn=s0
Note: n= 0 possible; thens=s0
s0, . . . ,sn is called (state) path froms tos0
`1, . . . , `n is called (label) path froms tos0
s0−`→1 s1, . . . ,sn−1−`→n sn is called trace froms tos0 length of path/trace isn
cost of label path/trace isPn i=1c(`i) probabilityof path/trace isQn
i=1pi
Motivation Markov Decision Processes Summary
Finite-horizon Markov Decision Process
Definition (Finite-horizon Markov Decision Process)
Afinite-horizon Markov decision process (FH-MDP) is a 6-tuple T =hS,L,R,T,s0,Hi, where
S is a finite set of states,
L is a finite set of (transition) labels, R :S×L→Ris the reward function,
T :S×L×S 7→[0,1] is the transition function, s0∈S is the initial state, and
H ∈Nis the finite horizon.
For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP
s0∈ST(s, `,s0) = 1.
Example: Push Your Luck
. . . . . .
0
0
1/6
1/6
1/6
0
2
1/6
1/6
1/6
Motivation Markov Decision Processes Summary
Discounted Reward Markov Decision Process
Definition (Discounted Reward Markov Decision Process) Adiscounted reward Markov decision process(DR-MDP) is a 6-tupleT =hS,L,R,T,s0, γi, where
S is a finite set of states,
L is a finite set of (transition) labels, R :S×L→Ris the reward function,
T :S×L×S 7→[0,1] is the transition function, s0∈S is the initial state, and
γ ∈(0,1)is the discount factor.
For alls ∈S and`∈Lwith T(s, `,s0)>0 for somes0 ∈S, we requireP
s0∈ST(s, `,s0) = 1.
Example: Grid World
1 2 3 4
1 2 3
s0
−1 +1
each move goes in orthogonal direction with some probability (4,3) gives reward of +1 and sets position back to (1,1) (4,2) gives reward of -1
Motivation Markov Decision Processes Summary
Summary
Summary
Many planning scenarios beyond classical planning We focus on probabilistic planning
SSPs are classical planning + probabilistic transition function FH-MDPs and DR-MDPs allowstate-dependent rewards FH-MDPs consider finite number of steps
DR-MDPsdiscount rewards overinfinitehorizon