Planning and Optimization
G1. Factored MDPs
Malte Helmert and Gabriele R¨oger
Universit¨at Basel
December 7, 2020
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 1 / 34
Planning and Optimization
December 7, 2020 — G1. Factored MDPs
G1.1 Factored MDPs
G1.2 Probabilistic Planning Tasks G1.3 Complexity
G1.4 Estimated Policy Evaluation G1.5 Summary
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 2 / 34
Content of this Course
Planning
Classical
Foundations Logic Heuristics Constraints
Probabilistic
Explicit MDPs Factored MDPs
Content of this Course: Factored MDPs
Factored MDPs
Foundations
Heuristic Search Monte-Carlo
Methods
G1. Factored MDPs Factored MDPs
G1.1 Factored MDPs
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 5 / 34
G1. Factored MDPs Factored MDPs
Factored MDPs
We would like to specify MDPs and SSPs with large state spaces.
In classical planning, we introducedplanning tasksto represent large transition systems compactly.
I represent aspects of the world in terms ofstate variables I states are avaluation of state variables
I n (propositional) state variables induce 2n states
exponentially more compactthan “explicit” representation
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 6 / 34
G1. Factored MDPs Factored MDPs
Finite-Domain State Variables
Definition (Finite-Domain State Variable)
A finite-domain state variableis a symbolv with an associated domain dom(v), which is a finite non-empty set of values.
Let V be a finite set of finite-domain state variables.
A states overV is an assignment s :V →S
v∈V dom(v) such that s(v)∈dom(v) for allv ∈V.
A formulaover V is a propositional logic formula whose atomic propositions are of the form v =d where v ∈V andd ∈dom(v).
For simplicity, we only consider finite-domain state variables here.
G1. Factored MDPs Factored MDPs
Syntax of Operators
Definition (SSP and MDP Operators)
AnSSP operator o over a set of state variablesV has three components:
I aprecondition pre(o), a logical formula overV
I an effecteff(o) over V, defined on the following slides I acost cost(o)∈R+0
AnMDP operator o over a set of state variables V has three components:
I aprecondition pre(o), a logical formula overV
I an effecteff(o) over V, defined on the following slides I arewardreward(o) over V, defined on the following slides Whenever we just say operator (without SSP or MDP),
G1. Factored MDPs Factored MDPs
Syntax of Effects
Definition (Effect)
Effectsover state variablesV are inductively defined as follows:
I If v ∈V is a finite-domain state variable and d ∈dom(v), thenv :=d is an effect (atomic effect).
I If e1, . . . ,en are effects, then (e1∧ · · · ∧en)is an effect (conjunctive effect).
The special case withn = 0 is the empty effect>.
I If e1, . . . ,en are effects andp1, . . . ,pn ∈[0,1] such that Pn
i=1pi = 1, then (p1 :e1|. . .|pn :en) is an effect (probabilistic effect).
Note: To simplify definitions, conditional effects are omitted.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 9 / 34
G1. Factored MDPs Factored MDPs
Effects: Intuition
Intuition for effects:
I Atomic effectscan be understood as assignments that update the value of a state variable.
I Aconjunctive effecte = (e1∧ · · · ∧en) means that all subeffects e1, . . . , en take place simultaneously.
I Aprobabilistic effect e = (p1:e1|. . .|pn :en) means that exactly one subeffectei ∈ {e1, . . . ,en} takes place with probabilitypi.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 10 / 34
G1. Factored MDPs Factored MDPs
Semantics of Effects
Definition
Theeffect set[e] of an effect e is a set of pairshp,wi, where p is a probability 0<p ≤1 andw is a partial assignment. The effect set [e] is the set obtained recursively as
[v :=d] ={h1.0,{v 7→d}i}, [e∧e0] = ]
hp,wi∈[e],hp0,w0i∈[e0]
{hp·p0,w∪w0i},
[p1 :e1|. . .|pn :en] =
n
]
i=1
{hpi ·p,wi | hp,wi ∈[ei]}.
where U
is likeS
but merges hp,w0i andhp0,w0ito hp+p0,w0i.
G1. Factored MDPs Factored MDPs
Semantics of Operators
Definition (Applicable, Outcomes)
LetV be a set of finite-domain state variables.
Lets be a state over V, and leto be an operator overV. Operator o is applicableins ifs |=pre(o).
Theoutcomesof applying an operator o in s, writtensJoK, are sJoK= ]
hp,wi∈[eff(o)]
{hp,sw0 i},
withsw0 (v) =d ifv =d ∈w andsw0 (v) =s(v) otherwise andU
is likeS
but mergeshp,s0i andhp0,s0i tohp+p0,s0i.
G1. Factored MDPs Factored MDPs
Rewards
Definition (Reward)
A rewardover state variablesV is inductively defined as follows:
I c ∈Ris a reward
I If χis a propositional formula over V, [χ] is a reward
I If r andr0 are rewards,r+r0,r−r0,r·r0 and rr0 are rewards Applying an MDP operator o ins induces rewardreward(o)(s), i.e., the value of the arithmetic functionreward(o) where all occurrences of v ∈V are replaced withs(v).
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 13 / 34
G1. Factored MDPs Probabilistic Planning Tasks
G1.2 Probabilistic Planning Tasks
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 14 / 34
G1. Factored MDPs Probabilistic Planning Tasks
Probabilistic Planning Tasks
Definition (SSP and MDP Planning Task)
An SSP planning taskis a 4-tuple Π =hV,I,O, γi where I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofSSP operators over V, and I γ is a formula over V called the goal.
An MDP planning taskis a 4-tuple Π =hV,I,O,diwhere I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofMDP operators over V, and I d ∈(0,1) is thediscount factor.
A probabilistic planning taskis an SSP or MDP planning task.
G1. Factored MDPs Probabilistic Planning Tasks
Mapping SSP Planning Tasks to SSPs
Definition (SSP Induced by an SSP Planning Task) The SSP planning task Π =hV,I,O, γiinduces the SSPT =hS,A,c,T,s0,S?i, where
I S is the set of all states over V, I Ais the set of operators O, I c(o) =cost(o) for allo ∈O, I T(s,o,s0) =
(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise
I s0 =I, and
I S? ={s ∈S |s |=γ}.
G1. Factored MDPs Probabilistic Planning Tasks
Mapping MDP Planning Tasks to MDPs
Definition (MDP Induced by an MDP Planning Task) The MDP planning task Π =hV,I,O,di induces the MDP T =hS,A,R,T,s0, γi, where
I S is the set of all states overV, I Ais the set of operatorsO,
I R(s,o) =reward(o)(s) for allo ∈O ands ∈S, I T(s,o,s0) =
(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise
I s0 =I, and I γ =d.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 17 / 34
G1. Factored MDPs Complexity
G1.3 Complexity
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 18 / 34
G1. Factored MDPs Complexity
Complexity of Probabilistic Planning
Definition (Policy Existence)
Policy existence (PolicyEx) is the following decision problem:
Given: SSP planning task Π
Question: Is there a proper policy for Π?
G1. Factored MDPs Complexity
Membership in EXP
Theorem
PolicyEx∈EXP Proof.
The number of states in an SSP planning task is exponential in the number of variables. The induced SSP can be solved in time polynomial in|S| · |A|via linear programming and hence in time exponential in the input size.
G1. Factored MDPs Complexity
EXP-completeness of Probabilistic Planning
Theorem
PolicyEx isEXP-complete.
Proof Sketch.
Membership for PolicyEx: see previous slide.
Hardness is shown by Littman (1997) by reducing the EXP-complete game G4 to PolicyEx.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 21 / 34
G1. Factored MDPs Estimated Policy Evaluation
G1.4 Estimated Policy Evaluation
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 22 / 34
G1. Factored MDPs Estimated Policy Evaluation
Large SSPs and MDPs
I Before: optimal policiesandexact state-values forsmallSSPs and MDPs.
I Now: focus onlarge SSPs and MDPs I Further algorithms not necessarilyoptimal
(may generatesuboptimalpolicies)
G1. Factored MDPs Estimated Policy Evaluation
Interleaved Planning & Execution
I Number of reachable states of a policy usuallyexponential in the number of state variables
I For large SSPs and MDPs, policies cannot be provided explicitly.
I Solution: (possibly approximate)compact representationof policy required to describe solution
⇒ not part of this lecture.
I Alternative solution: interleave planning and execution
G1. Factored MDPs Estimated Policy Evaluation
Interleaved Planning & Execution for SSPs
Plan-execute-monitor cycle for SSPT: I plan actiona for the current states I execute a
I observe new current states0 I sets :=s0
I repeat untils ∈S?
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 25 / 34
G1. Factored MDPs Estimated Policy Evaluation
Interleaved Planning & Execution for MDPs
Plan-execute-monitor cyclefor MDP T: I plan action afor the current states I execute a
I observe new current state s0 I set s :=s0
I repeat until discounted reward sufficiently small
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 26 / 34
G1. Factored MDPs Estimated Policy Evaluation
Interleaved Planning & Execution in Practice
I avoidsloss of precisionthat often comes with compact description of policy
I does not waste time with planning for states that are never reachedduring execution I poor decisions can be avoided by
spending more time with planning before execution I in SSPs, this can even mean that computed policy is
not properand execution never reaches the goal I in MDPs, it is not clear when the
discounted reward is sufficiently small
G1. Factored MDPs Estimated Policy Evaluation
Estimated Policy Evaluation
I Thequality of a policy is described by the state-value of the initial stateVπ(s0)
I Quality of given policy π can be computed (via LPor backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs I Impossible if planning and execution are interleaved
as policy is incomplete
⇒Estimate quality of policy π by executingitn∈N times
G1. Factored MDPs Estimated Policy Evaluation
Executing a Policy
Definition (Run in SSP)
Let T be an SSP and π be a proper policy forT. A sequence of transitions
ρπ =s0 −p−−−−1:π(s0→) s1, . . . ,sn−1
pn:π(sn−1)
−−−−−−→sn is a run ρπ ofπ ifsi+1∼siJπ(si)Kandsn ∈S?. Thecost of runρπ is cost(ρπ) =Pn−1
i=0 cost(π(si)).
A run in an SSP can easily be generated by executing π from s0 until a states ∈S? is encountered.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 29 / 34
G1. Factored MDPs Estimated Policy Evaluation
Executing a Policy
Definition (Run in MDP)
LetT be an MDP and π be a policy for T. A sequence of transitions
ρπ=s0−p−−−−1:π(s0→) s1, . . . ,sn−1−−−−−−→pn:π(sn−1) sn is arun ρπ of π ifsi+1∼siJπ(si)K.
Therewardof runρπ is reward(ρπ) =Pn−1
i=0 γi·reward(si, π(si)).
To generate a run, a termination criterion (e.g., based on the change of the accumulated reward) must be specified.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 30 / 34
G1. Factored MDPs Estimated Policy Evaluation
Estimated Policy Evaluation
Definition (Estimated Policy Evaluation)
Let T be an SSP, π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs of π.
Theestimated quality ofπ via estimated policy evaluationis V˜π := 1
n ·
n
X
i=1
cost(ρiπ).
G1. Factored MDPs Estimated Policy Evaluation
Convergence of Estimated Policy Evaluation in SSPs
Theorem
LetT be an SSP, π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs ofπ.
ThenV˜π →Vπ(s0) for n→ ∞.
Proof.
Holds due to thestrong law of large numbers.
⇒V˜π is a good approximationof vπ(s0) ifn sufficiently large.
G1. Factored MDPs Summary
G1.5 Summary
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 33 / 34
G1. Factored MDPs Summary
Summary
I MDP and SSP planning tasks represent MDPs and SSPs compactly.
I Policy existence in SSPs is EXP-complete.
I Interleaving planning and executionavoids representation issues of (typically exponentially sized) policy.
I Quality of such an incomplete policy can be estimated by executing it a fixed number of times.
I In SSPs, estimated policy evaluationconverges to the true quality of the policy.
M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 34 / 34