Planning and Optimization G1. Factored MDPs Malte Helmert and Gabriele R¨oger

(1)

Planning and Optimization

G1. Factored MDPs

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

December 7, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 1 / 34

Planning and Optimization

December 7, 2020 — G1. Factored MDPs

G1.1 Factored MDPs

G1.2 Probabilistic Planning Tasks G1.3 Complexity

G1.4 Estimated Policy Evaluation G1.5 Summary

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

(2)

G1. Factored MDPs Factored MDPs

G1.1 Factored MDPs

Factored MDPs

We would like to specify MDPs and SSPs with large state spaces.

In classical planning, we introducedplanning tasksto represent large transition systems compactly.

I represent aspects of the world in terms ofstate variables I states are avaluation of state variables

I n (propositional) state variables induce 2ⁿ states

exponentially more compactthan “explicit” representation

Finite-Domain State Variables

Definition (Finite-Domain State Variable)

A finite-domain state variableis a symbolv with an associated domain dom(v), which is a finite non-empty set of values.

Let V be a finite set of finite-domain state variables.

A states overV is an assignment s :V →S

v∈V dom(v) such that s(v)∈dom(v) for allv ∈V.

A formulaover V is a propositional logic formula whose atomic propositions are of the form v =d where v ∈V andd ∈dom(v).

For simplicity, we only consider finite-domain state variables here.

Syntax of Operators

Definition (SSP and MDP Operators)

AnSSP operator o over a set of state variablesV has three components:

I aprecondition pre(o), a logical formula overV

I an effecteff(o) over V, defined on the following slides I acost cost(o)∈R⁺₀

AnMDP operator o over a set of state variables V has three components:

I aprecondition pre(o), a logical formula overV

I an effecteff(o) over V, defined on the following slides I arewardreward(o) over V, defined on the following slides Whenever we just say operator (without SSP or MDP),

(3)

Syntax of Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a finite-domain state variable and d ∈dom(v), thenv :=d is an effect (atomic effect).

I If e₁, . . . ,e_n are effects, then (e₁∧ · · · ∧e_n)is an effect (conjunctive effect).

The special case withn = 0 is the empty effect>.

I If e₁, . . . ,e_n are effects andp₁, . . . ,p_n ∈[0,1] such that P_n

i=1p_i = 1, then (p₁ :e₁|. . .|p_n :e_n) is an effect (probabilistic effect).

Note: To simplify definitions, conditional effects are omitted.

Effects: Intuition

Intuition for effects:

I Atomic effectscan be understood as assignments that update the value of a state variable.

I Aconjunctive effecte = (e₁∧ · · · ∧e_n) means that all subeffects e₁, . . . , e_n take place simultaneously.

I Aprobabilistic effect e = (p₁:e₁|. . .|p_n :e_n) means that exactly one subeffecte_i ∈ {e₁, . . . ,e_n} takes place with probabilityp_i.

Semantics of Effects

Definition

Theeffect set[e] of an effect e is a set of pairshp,wi, where p is a probability 0<p ≤1 andw is a partial assignment. The effect set [e] is the set obtained recursively as

[v :=d] ={h1.0,{v 7→d}i}, [e∧e⁰] = ]

hp,wi∈[e],hp⁰,w⁰i∈[e⁰]

{hp·p⁰,w∪w⁰i},

[p₁ :e₁|. . .|p_n :e_n] =

n

]

i=1

{hp_i ·p,wi | hp,wi ∈[e_i]}.

where U

is likeS

but merges hp,w⁰i andhp⁰,w⁰ito hp+p⁰,w⁰i.

Semantics of Operators

Definition (Applicable, Outcomes)

LetV be a set of finite-domain state variables.

Lets be a state over V, and leto be an operator overV. Operator o is applicableins ifs |=pre(o).

Theoutcomesof applying an operator o in s, writtensJoK, are sJoK= ]

hp,wi∈[eff(o)]

{hp,s_w⁰ i},

withs_w⁰ (v) =d ifv =d ∈w ands_w⁰ (v) =s(v) otherwise andU

is likeS

but mergeshp,s⁰i andhp⁰,s⁰i tohp+p⁰,s⁰i.

(4)

Rewards

Definition (Reward)

A rewardover state variablesV is inductively defined as follows:

I c ∈Ris a reward

I If χis a propositional formula over V, [χ] is a reward

I If r andr⁰ are rewards,r+r⁰,r−r⁰,r·r⁰ and _r^r0 are rewards Applying an MDP operator o ins induces rewardreward(o)(s), i.e., the value of the arithmetic functionreward(o) where all occurrences of v ∈V are replaced withs(v).

G1. Factored MDPs Probabilistic Planning Tasks

G1.2 Probabilistic Planning Tasks

Probabilistic Planning Tasks

Definition (SSP and MDP Planning Task)

An SSP planning taskis a 4-tuple Π =hV,I,O, γi where I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofSSP operators over V, and I γ is a formula over V called the goal.

An MDP planning taskis a 4-tuple Π =hV,I,O,diwhere I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofMDP operators over V, and I d ∈(0,1) is thediscount factor.

A probabilistic planning taskis an SSP or MDP planning task.

Mapping SSP Planning Tasks to SSPs

Definition (SSP Induced by an SSP Planning Task) The SSP planning task Π =hV,I,O, γiinduces the SSPT =hS,A,c,T,s₀,S_?i, where

I S is the set of all states over V, I Ais the set of operators O, I c(o) =cost(o) for allo ∈O, I T(s,o,s⁰) =

(p ifo applicable in s andhp,s⁰i ∈sJoK 0 otherwise

I s₀ =I, and

I S_? ={s ∈S |s |=γ}.

(5)

Mapping MDP Planning Tasks to MDPs

Definition (MDP Induced by an MDP Planning Task) The MDP planning task Π =hV,I,O,di induces the MDP T =hS,A,R,T,s₀, γi, where

I S is the set of all states overV, I Ais the set of operatorsO,

I R(s,o) =reward(o)(s) for allo ∈O ands ∈S, I T(s,o,s⁰) =

(p ifo applicable in s andhp,s⁰i ∈sJoK 0 otherwise

I s₀ =I, and I γ =d.

G1. Factored MDPs Complexity

G1.3 Complexity

Complexity of Probabilistic Planning

Definition (Policy Existence)

Policy existence (PolicyEx) is the following decision problem:

Given: SSP planning task Π

Question: Is there a proper policy for Π?

Membership in EXP

Theorem

PolicyEx∈EXP Proof.

The number of states in an SSP planning task is exponential in the number of variables. The induced SSP can be solved in time polynomial in|S| · |A|via linear programming and hence in time exponential in the input size.

(6)

EXP-completeness of Probabilistic Planning

Theorem

PolicyEx isEXP-complete.

Proof Sketch.

Membership for PolicyEx: see previous slide.

Hardness is shown by Littman (1997) by reducing the EXP-complete game G₄ to PolicyEx.

G1. Factored MDPs Estimated Policy Evaluation

G1.4 Estimated Policy Evaluation

Large SSPs and MDPs

I Before: optimal policiesandexact state-values forsmallSSPs and MDPs.

I Now: focus onlarge SSPs and MDPs I Further algorithms not necessarilyoptimal

(may generatesuboptimalpolicies)

Interleaved Planning & Execution

I Number of reachable states of a policy usuallyexponential in the number of state variables

I For large SSPs and MDPs, policies cannot be provided explicitly.

I Solution: (possibly approximate)compact representationof policy required to describe solution

⇒ not part of this lecture.

I Alternative solution: interleave planning and execution

(7)

Interleaved Planning & Execution for SSPs

Plan-execute-monitor cycle for SSPT: I plan actiona for the current states I execute a

I observe new current states⁰ I sets :=s⁰

I repeat untils ∈S_?

Interleaved Planning & Execution for MDPs

Plan-execute-monitor cyclefor MDP T: I plan action afor the current states I execute a

I observe new current state s⁰ I set s :=s⁰

I repeat until discounted reward sufficiently small

Interleaved Planning & Execution in Practice

I avoidsloss of precisionthat often comes with compact description of policy

I does not waste time with planning for states that are never reachedduring execution I poor decisions can be avoided by

spending more time with planning before execution I in SSPs, this can even mean that computed policy is

not properand execution never reaches the goal I in MDPs, it is not clear when the

discounted reward is sufficiently small

Estimated Policy Evaluation

I Thequality of a policy is described by the state-value of the initial stateV_π(s₀)

I Quality of given policy π can be computed (via LPor backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs I Impossible if planning and execution are interleaved

as policy is incomplete

⇒Estimate quality of policy π by executingitn∈N times

(8)

Executing a Policy

Definition (Run in SSP)

Let T be an SSP and π be a proper policy forT. A sequence of transitions

ρ_π =s₀ −^p−−−−¹^:π(s⁰→⁾ s₁, . . . ,sn−1

pn:π(sn−1)

−−−−−−→s_n is a run ρ_π ofπ ifs_i+1∼s_iJπ(s_i)Kands_n ∈S_?. Thecost of runρ_π is cost(ρ_π) =Pn−1

i=0 cost(π(s_i)).

A run in an SSP can easily be generated by executing π from s₀ until a states ∈S_? is encountered.

Executing a Policy

Definition (Run in MDP)

LetT be an MDP and π be a policy for T. A sequence of transitions

ρ_π=s₀−^p−−−−¹^:π(s⁰→⁾ s₁, . . . ,s_n−1−−−−−−→^pⁿ^:π(sⁿ⁻¹⁾ s_n is arun ρ_π of π ifs_i₊₁∼s_iJπ(s_i)K.

Therewardof runρ_π is reward(ρ_π) =Pn−1

i=0 γⁱ·reward(s_i, π(s_i)).

To generate a run, a termination criterion (e.g., based on the change of the accumulated reward) must be specified.

Estimated Policy Evaluation

Definition (Estimated Policy Evaluation)

Let T be an SSP, π be a policy for T andhρ¹_π, . . . , ρⁿ_πi be a sequence of runs of π.

Theestimated quality ofπ via estimated policy evaluationis V˜_π := 1

n ·

n

X

i=1

cost(ρⁱ_π).

Convergence of Estimated Policy Evaluation in SSPs

Theorem

LetT be an SSP, π be a policy for T andhρ¹_π, . . . , ρⁿ_πi be a sequence of runs ofπ.

ThenV˜_π →V_π(s₀) for n→ ∞.

Proof.

Holds due to thestrong law of large numbers.

⇒V˜_π is a good approximationof v_π(s₀) ifn sufficiently large.

(9)

G1. Factored MDPs Summary

G1.5 Summary

G1. Factored MDPs Summary

Summary

I MDP and SSP planning tasks represent MDPs and SSPs compactly.

I Policy existence in SSPs is EXP-complete.

I Interleaving planning and executionavoids representation issues of (typically exponentially sized) policy.

I Quality of such an incomplete policy can be estimated by executing it a fixed number of times.

I In SSPs, estimated policy evaluationconverges to the true quality of the policy.