• Keine Ergebnisse gefunden

Planning and Optimization G1. Factored MDPs Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G1. Factored MDPs Malte Helmert and Gabriele R¨oger"

Copied!
9
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

G1. Factored MDPs

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

December 7, 2020

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 1 / 34

Planning and Optimization

December 7, 2020 — G1. Factored MDPs

G1.1 Factored MDPs

G1.2 Probabilistic Planning Tasks G1.3 Complexity

G1.4 Estimated Policy Evaluation G1.5 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 2 / 34

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

Content of this Course: Factored MDPs

Factored MDPs

Foundations

Heuristic Search Monte-Carlo

Methods

(2)

G1. Factored MDPs Factored MDPs

G1.1 Factored MDPs

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 5 / 34

G1. Factored MDPs Factored MDPs

Factored MDPs

We would like to specify MDPs and SSPs with large state spaces.

In classical planning, we introducedplanning tasksto represent large transition systems compactly.

I represent aspects of the world in terms ofstate variables I states are avaluation of state variables

I n (propositional) state variables induce 2n states

exponentially more compactthan “explicit” representation

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 6 / 34

G1. Factored MDPs Factored MDPs

Finite-Domain State Variables

Definition (Finite-Domain State Variable)

A finite-domain state variableis a symbolv with an associated domain dom(v), which is a finite non-empty set of values.

Let V be a finite set of finite-domain state variables.

A states overV is an assignment s :V →S

v∈V dom(v) such that s(v)∈dom(v) for allv ∈V.

A formulaover V is a propositional logic formula whose atomic propositions are of the form v =d where v ∈V andd ∈dom(v).

For simplicity, we only consider finite-domain state variables here.

G1. Factored MDPs Factored MDPs

Syntax of Operators

Definition (SSP and MDP Operators)

AnSSP operator o over a set of state variablesV has three components:

I aprecondition pre(o), a logical formula overV

I an effecteff(o) over V, defined on the following slides I acost cost(o)∈R+0

AnMDP operator o over a set of state variables V has three components:

I aprecondition pre(o), a logical formula overV

I an effecteff(o) over V, defined on the following slides I arewardreward(o) over V, defined on the following slides Whenever we just say operator (without SSP or MDP),

(3)

G1. Factored MDPs Factored MDPs

Syntax of Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a finite-domain state variable and d ∈dom(v), thenv :=d is an effect (atomic effect).

I If e1, . . . ,en are effects, then (e1∧ · · · ∧en)is an effect (conjunctive effect).

The special case withn = 0 is the empty effect>.

I If e1, . . . ,en are effects andp1, . . . ,pn ∈[0,1] such that Pn

i=1pi = 1, then (p1 :e1|. . .|pn :en) is an effect (probabilistic effect).

Note: To simplify definitions, conditional effects are omitted.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 9 / 34

G1. Factored MDPs Factored MDPs

Effects: Intuition

Intuition for effects:

I Atomic effectscan be understood as assignments that update the value of a state variable.

I Aconjunctive effecte = (e1∧ · · · ∧en) means that all subeffects e1, . . . , en take place simultaneously.

I Aprobabilistic effect e = (p1:e1|. . .|pn :en) means that exactly one subeffectei ∈ {e1, . . . ,en} takes place with probabilitypi.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 10 / 34

G1. Factored MDPs Factored MDPs

Semantics of Effects

Definition

Theeffect set[e] of an effect e is a set of pairshp,wi, where p is a probability 0<p ≤1 andw is a partial assignment. The effect set [e] is the set obtained recursively as

[v :=d] ={h1.0,{v 7→d}i}, [e∧e0] = ]

hp,wi∈[e],hp0,w0i∈[e0]

{hp·p0,w∪w0i},

[p1 :e1|. . .|pn :en] =

n

]

i=1

{hpi ·p,wi | hp,wi ∈[ei]}.

where U

is likeS

but merges hp,w0i andhp0,w0ito hp+p0,w0i.

G1. Factored MDPs Factored MDPs

Semantics of Operators

Definition (Applicable, Outcomes)

LetV be a set of finite-domain state variables.

Lets be a state over V, and leto be an operator overV. Operator o is applicableins ifs |=pre(o).

Theoutcomesof applying an operator o in s, writtensJoK, are sJoK= ]

hp,wi∈[eff(o)]

{hp,sw0 i},

withsw0 (v) =d ifv =d ∈w andsw0 (v) =s(v) otherwise andU

is likeS

but mergeshp,s0i andhp0,s0i tohp+p0,s0i.

(4)

G1. Factored MDPs Factored MDPs

Rewards

Definition (Reward)

A rewardover state variablesV is inductively defined as follows:

I c ∈Ris a reward

I If χis a propositional formula over V, [χ] is a reward

I If r andr0 are rewards,r+r0,r−r0,r·r0 and rr0 are rewards Applying an MDP operator o ins induces rewardreward(o)(s), i.e., the value of the arithmetic functionreward(o) where all occurrences of v ∈V are replaced withs(v).

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 13 / 34

G1. Factored MDPs Probabilistic Planning Tasks

G1.2 Probabilistic Planning Tasks

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 14 / 34

G1. Factored MDPs Probabilistic Planning Tasks

Probabilistic Planning Tasks

Definition (SSP and MDP Planning Task)

An SSP planning taskis a 4-tuple Π =hV,I,O, γi where I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofSSP operators over V, and I γ is a formula over V called the goal.

An MDP planning taskis a 4-tuple Π =hV,I,O,diwhere I V is a finite set offinite-domain state variables, I I is a valuation overV called theinitial state, I O is a finite set ofMDP operators over V, and I d ∈(0,1) is thediscount factor.

A probabilistic planning taskis an SSP or MDP planning task.

G1. Factored MDPs Probabilistic Planning Tasks

Mapping SSP Planning Tasks to SSPs

Definition (SSP Induced by an SSP Planning Task) The SSP planning task Π =hV,I,O, γiinduces the SSPT =hS,A,c,T,s0,S?i, where

I S is the set of all states over V, I Ais the set of operators O, I c(o) =cost(o) for allo ∈O, I T(s,o,s0) =

(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise

I s0 =I, and

I S? ={s ∈S |s |=γ}.

(5)

G1. Factored MDPs Probabilistic Planning Tasks

Mapping MDP Planning Tasks to MDPs

Definition (MDP Induced by an MDP Planning Task) The MDP planning task Π =hV,I,O,di induces the MDP T =hS,A,R,T,s0, γi, where

I S is the set of all states overV, I Ais the set of operatorsO,

I R(s,o) =reward(o)(s) for allo ∈O ands ∈S, I T(s,o,s0) =

(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise

I s0 =I, and I γ =d.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 17 / 34

G1. Factored MDPs Complexity

G1.3 Complexity

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 18 / 34

G1. Factored MDPs Complexity

Complexity of Probabilistic Planning

Definition (Policy Existence)

Policy existence (PolicyEx) is the following decision problem:

Given: SSP planning task Π

Question: Is there a proper policy for Π?

G1. Factored MDPs Complexity

Membership in EXP

Theorem

PolicyEx∈EXP Proof.

The number of states in an SSP planning task is exponential in the number of variables. The induced SSP can be solved in time polynomial in|S| · |A|via linear programming and hence in time exponential in the input size.

(6)

G1. Factored MDPs Complexity

EXP-completeness of Probabilistic Planning

Theorem

PolicyEx isEXP-complete.

Proof Sketch.

Membership for PolicyEx: see previous slide.

Hardness is shown by Littman (1997) by reducing the EXP-complete game G4 to PolicyEx.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 21 / 34

G1. Factored MDPs Estimated Policy Evaluation

G1.4 Estimated Policy Evaluation

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 22 / 34

G1. Factored MDPs Estimated Policy Evaluation

Large SSPs and MDPs

I Before: optimal policiesandexact state-values forsmallSSPs and MDPs.

I Now: focus onlarge SSPs and MDPs I Further algorithms not necessarilyoptimal

(may generatesuboptimalpolicies)

G1. Factored MDPs Estimated Policy Evaluation

Interleaved Planning & Execution

I Number of reachable states of a policy usuallyexponential in the number of state variables

I For large SSPs and MDPs, policies cannot be provided explicitly.

I Solution: (possibly approximate)compact representationof policy required to describe solution

⇒ not part of this lecture.

I Alternative solution: interleave planning and execution

(7)

G1. Factored MDPs Estimated Policy Evaluation

Interleaved Planning & Execution for SSPs

Plan-execute-monitor cycle for SSPT: I plan actiona for the current states I execute a

I observe new current states0 I sets :=s0

I repeat untils ∈S?

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 25 / 34

G1. Factored MDPs Estimated Policy Evaluation

Interleaved Planning & Execution for MDPs

Plan-execute-monitor cyclefor MDP T: I plan action afor the current states I execute a

I observe new current state s0 I set s :=s0

I repeat until discounted reward sufficiently small

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 26 / 34

G1. Factored MDPs Estimated Policy Evaluation

Interleaved Planning & Execution in Practice

I avoidsloss of precisionthat often comes with compact description of policy

I does not waste time with planning for states that are never reachedduring execution I poor decisions can be avoided by

spending more time with planning before execution I in SSPs, this can even mean that computed policy is

not properand execution never reaches the goal I in MDPs, it is not clear when the

discounted reward is sufficiently small

G1. Factored MDPs Estimated Policy Evaluation

Estimated Policy Evaluation

I Thequality of a policy is described by the state-value of the initial stateVπ(s0)

I Quality of given policy π can be computed (via LPor backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs I Impossible if planning and execution are interleaved

as policy is incomplete

⇒Estimate quality of policy π by executingitn∈N times

(8)

G1. Factored MDPs Estimated Policy Evaluation

Executing a Policy

Definition (Run in SSP)

Let T be an SSP and π be a proper policy forT. A sequence of transitions

ρπ =s0p−−−−1:π(s0) s1, . . . ,sn−1

pn:π(sn−1)

−−−−−−→sn is a run ρπ ofπ ifsi+1∼siJπ(si)Kandsn ∈S?. Thecost of runρπ is cost(ρπ) =Pn−1

i=0 cost(π(si)).

A run in an SSP can easily be generated by executing π from s0 until a states ∈S? is encountered.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 29 / 34

G1. Factored MDPs Estimated Policy Evaluation

Executing a Policy

Definition (Run in MDP)

LetT be an MDP and π be a policy for T. A sequence of transitions

ρπ=s0p−−−−1:π(s0) s1, . . . ,sn−1−−−−−−→pn:π(sn−1) sn is arun ρπ of π ifsi+1∼siJπ(si)K.

Therewardof runρπ is reward(ρπ) =Pn−1

i=0 γi·reward(si, π(si)).

To generate a run, a termination criterion (e.g., based on the change of the accumulated reward) must be specified.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 30 / 34

G1. Factored MDPs Estimated Policy Evaluation

Estimated Policy Evaluation

Definition (Estimated Policy Evaluation)

Let T be an SSP, π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs of π.

Theestimated quality ofπ via estimated policy evaluationis V˜π := 1

n ·

n

X

i=1

cost(ρiπ).

G1. Factored MDPs Estimated Policy Evaluation

Convergence of Estimated Policy Evaluation in SSPs

Theorem

LetT be an SSP, π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs ofπ.

ThenV˜π →Vπ(s0) for n→ ∞.

Proof.

Holds due to thestrong law of large numbers.

⇒V˜π is a good approximationof vπ(s0) ifn sufficiently large.

(9)

G1. Factored MDPs Summary

G1.5 Summary

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 33 / 34

G1. Factored MDPs Summary

Summary

I MDP and SSP planning tasks represent MDPs and SSPs compactly.

I Policy existence in SSPs is EXP-complete.

I Interleaving planning and executionavoids representation issues of (typically exponentially sized) policy.

I Quality of such an incomplete policy can be estimated by executing it a fixed number of times.

I In SSPs, estimated policy evaluationconverges to the true quality of the policy.

M. Helmert, G. R¨oger (Universit¨at Basel) Planning and Optimization December 7, 2020 34 / 34

Referenzen

ÄHNLICHE DOKUMENTE

Landmarks, network flows and potential heuristics are based on constraints that can be specified for a planning task.... Constraint-based Heuristics Multiple

I The minimum hitting set over all cut landmarks is a perfect heuristic for delete-free planning tasks. I The LM-cut heuristic is an admissible heuristic based on

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

F4.1 Introduction F4.2 Value Iteration F4.3 Asynchronous VI F4.4 Summary.. R¨ oger (Universit¨ at Basel) Planning and Optimization December 02, 2020 2

Asynchronous VI performs backups for individual states Different approaches lead to different backup orders Can significantly reduce computation. Guaranteed to converge if all

Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs Impossible

Real-time Dynamic Programming (RTDP) generates hash map with state-value estimates of relevant states.. uses admissible heuristic to achieve convergence albeit not updating