• Keine Ergebnisse gefunden

Planning and Optimization G1. Factored MDPs Malte Helmert and Gabriele R¨oger

N/A
N/A
Protected

Academic year: 2022

Aktie "Planning and Optimization G1. Factored MDPs Malte Helmert and Gabriele R¨oger"

Copied!
33
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

G1. Factored MDPs

Malte Helmert and Gabriele R¨oger

Universit¨at Basel

(2)

Content of this Course

Planning

Classical

Foundations Logic Heuristics Constraints

Probabilistic

Explicit MDPs Factored MDPs

(3)

Content of this Course: Factored MDPs

Factored MDPs

Foundations Heuristic

Search Monte-Carlo

Methods

(4)

Factored MDPs

(5)

Factored MDPs

We would like to specify MDPs and SSPs with large state spaces.

In classical planning, we introducedplanning tasks to represent large transition systems compactly.

represent aspects of the world in terms of state variables states are avaluation of state variables

n (propositional) state variables induce 2n states

exponentially more compactthan “explicit” representation

(6)

Finite-Domain State Variables

Definition (Finite-Domain State Variable)

Afinite-domain state variableis a symbolvwith an associated domaindom(v), which is a finite non-empty set of values.

LetV be a finite set of finite-domain state variables.

Astate s overV is an assignments :V →S

v∈Vdom(v) such thats(v)∈dom(v) for all v ∈V.

AformulaoverV is a propositional logic formula whose atomic propositions are of the formv =d wherev ∈V andd ∈dom(v).

For simplicity, we only consider finite-domain state variables here.

(7)

Syntax of Operators

Definition (SSP and MDP Operators)

AnSSP operator o over a set of state variablesV has three components:

a preconditionpre(o), a logical formula over V

an effect eff(o) over V, defined on the following slides a costcost(o)∈R+0

AnMDP operator o over a set of state variablesV has three components:

a preconditionpre(o), a logical formula over V

an effect eff(o) over V, defined on the following slides a rewardreward(o) over V, defined on the following slides Whenever we just sayoperator(without SSP or MDP),

both kinds of operators are allowed.

(8)

Syntax of Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

Ifv ∈V is a finite-domain state variable andd ∈dom(v), then v :=d is an effect (atomic effect).

Ife1, . . . ,en are effects, then (e1∧ · · · ∧en) is an effect (conjunctive effect).

The special case withn = 0 is theempty effect >.

Ife1, . . . ,en are effects andp1, . . . ,pn∈[0,1] such that Pn

i=1pi = 1, then(p1 :e1|. . .|pn:en) is an effect (probabilistic effect).

Note: To simplify definitions, conditional effects are omitted.

(9)

Effects: Intuition

Intuition for effects:

Atomic effectscan be understood as assignments that update the value of a state variable.

A conjunctive effecte = (e1∧ · · · ∧en) means that all subeffects e1, . . . , en take place simultaneously.

A probabilistic effecte = (p1 :e1|. . .|pn:en) means that exactly one subeffect ei ∈ {e1, . . . ,en}takes place with probability pi.

(10)

Semantics of Effects

Definition

Theeffect set[e] of an effecte is a set of pairshp,wi, wherep is a probability 0<p≤1 and w is a partial assignment. The effect set [e] is the set obtained recursively as

[v :=d] ={h1.0,{v 7→d}i}, [e∧e0] = ]

hp,wi∈[e],hp0,w0i∈[e0]

{hp·p0,w ∪w0i},

[p1:e1|. . .|pn:en] =

n

]

i=1

{hpi·p,wi | hp,wi ∈[ei]}.

whereU

is like S

but mergeshp,w0i andhp0,w0i tohp+p0,w0i.

(11)

Semantics of Operators

Definition (Applicable, Outcomes)

LetV be a set of finite-domain state variables.

Lets be a state over V, and let o be an operator overV. Operatoro isapplicable in s ifs |=pre(o).

Theoutcomesof applying an operatoro in s, written sJoK, are sJoK= ]

hp,wi∈[eff(o)]

{hp,sw0 i},

withsw0 (v) =d ifv =d ∈w andsw0 (v) =s(v) otherwise andU

is like S

but mergeshp,s0i andhp0,s0i to hp+p0,s0i.

(12)

Rewards

Definition (Reward)

Arewardover state variables V is inductively defined as follows:

c ∈Ris a reward

Ifχ is a propositional formula over V, [χ] is a reward

Ifr andr0 are rewards,r+r0,r−r0,r·r0 and rr0 are rewards Applying an MDP operatoro in s induces rewardreward(o)(s), i.e., the value of the arithmetic functionreward(o) where all occurrences ofv ∈V are replaced withs(v).

(13)

Probabilistic Planning Tasks

(14)

Probabilistic Planning Tasks

Definition (SSP and MDP Planning Task)

AnSSP planning task is a 4-tuple Π =hV,I,O, γiwhere V is a finite set of finite-domain state variables, I is a valuation over V called theinitial state, O is a finite set ofSSP operators overV, and γ is a formula overV called thegoal.

AnMDP planning task is a 4-tuple Π =hV,I,O,di where V is a finite set of finite-domain state variables, I is a valuation over V called theinitial state, O is a finite set ofMDP operators overV, and d ∈(0,1) is the discount factor.

Aprobabilistic planning task is an SSP or MDP planning task.

(15)

Mapping SSP Planning Tasks to SSPs

Definition (SSP Induced by an SSP Planning Task) The SSP planning task Π =hV,I,O, γi induces the SSPT =hS,A,c,T,s0,S?i, where

S is the set of all states over V, A is the set of operatorsO, c(o) =cost(o) for all o∈O, T(s,o,s0) =

(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise

s0=I, and

S? ={s ∈S |s |=γ}.

(16)

Mapping MDP Planning Tasks to MDPs

Definition (MDP Induced by an MDP Planning Task) The MDP planning task Π =hV,I,O,di induces the MDPT =hS,A,R,T,s0, γi, where

S is the set of all states over V, A is the set of operatorsO,

R(s,o) =reward(o)(s) for allo ∈O and s ∈S, T(s,o,s0) =

(p ifo applicable in s andhp,s0i ∈sJoK 0 otherwise

s0=I, and γ =d.

(17)

Complexity

(18)

Complexity of Probabilistic Planning

Definition (Policy Existence)

Policy existence (PolicyEx)is the following decision problem:

Given: SSP planning task Π

Question: Is there a proper policy for Π?

(19)

Membership in EXP

Theorem

PolicyEx∈EXP Proof.

The number of states in an SSP planning task is exponential in the number of variables. The induced SSP can be solved in time polynomial in|S| · |A|via linear programming and hence in time exponential in the input size.

(20)

EXP-completeness of Probabilistic Planning

Theorem

PolicyExisEXP-complete.

Proof Sketch.

Membership forPolicyEx: see previous slide.

Hardness is shown by Littman (1997) by reducing the EXP-complete gameG4 to PolicyEx.

(21)

Estimated Policy Evaluation

(22)

Large SSPs and MDPs

Before: optimal policies andexact state-values forsmall SSPs and MDPs.

Now: focus onlarge SSPs and MDPs Further algorithms not necessarily optimal (may generatesuboptimal policies)

(23)

Interleaved Planning & Execution

Number of reachable states of a policy usually exponentialin the number of state variables

For large SSPs and MDPs, policies cannot be provided explicitly.

Solution: (possibly approximate)compact representation of policy required to describe solution

⇒ not part of this lecture.

Alternative solution: interleave planning and execution

(24)

Interleaved Planning & Execution for SSPs

Plan-execute-monitor cyclefor SSP T: plan actiona for the current states execute a

observe new current states0 set s :=s0

repeat until s ∈S?

(25)

Interleaved Planning & Execution for MDPs

Plan-execute-monitor cyclefor MDP T: plan actiona for the current states execute a

observe new current states0 set s :=s0

repeat until discounted reward sufficiently small

(26)

Interleaved Planning & Execution in Practice

avoids loss of precisionthat often comes with compact description of policy

does not waste time with planning for states that are never reachedduring execution poor decisions can be avoided by

spending more time with planning before execution in SSPs, this can even mean that computed policy is not properand execution never reaches the goal in MDPs, it is not clear when the

discounted reward is sufficiently small

(27)

Estimated Policy Evaluation

The qualityof a policy is described by the state-value of the initial stateVπ(s0)

Quality of given policy π can be computed (viaLP or backward induction) or approximated arbitrarily closely (viaiterative policy evaluation) in small SSPs or MDPs Impossibleif planning and execution are interleaved as policy is incomplete

⇒Estimate quality of policy π byexecuting itn∈Ntimes

(28)

Executing a Policy

Definition (Run in SSP)

LetT be an SSP andπ be a proper policy for T. A sequence of transitions

ρπ =s0p−−−−1:π(s0) s1, . . . ,sn−1

pn:π(sn−1)

−−−−−−→sn is arunρπ ofπ ifsi+1∼siJπ(si)Kandsn∈S?. Thecostof run ρπ is cost(ρπ) =Pn−1

i=0 cost(π(si)).

A run in an SSP can easily be generated by executingπ froms0 until a states ∈S? is encountered.

(29)

Executing a Policy

Definition (Run in MDP)

LetT be an MDP andπ be a policy for T. A sequence of transitions

ρπ =s0

p1:π(s0)

−−−−−→s1, . . . ,sn−1

pn:π(sn−1)

−−−−−−→sn

is arunρπ ofπ ifsi+1∼siJπ(si)K.

Therewardof run ρπ is reward(ρπ) =Pn−1

i=0 γi·reward(si, π(si)).

To generate a run, a termination criterion (e.g., based on the change of the accumulated reward) must be specified.

(30)

Estimated Policy Evaluation

Definition (Estimated Policy Evaluation)

LetT be an SSP,π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs ofπ.

Theestimated qualityof π via estimated policy evaluationis V˜π := 1

n ·

n

X

i=1

cost(ρiπ).

(31)

Convergence of Estimated Policy Evaluation in SSPs

Theorem

LetT be an SSP,π be a policy for T andhρ1π, . . . , ρnπi be a sequence of runs ofπ.

ThenV˜π →Vπ(s0) for n→ ∞.

Proof.

Holds due to thestrong law of large numbers.

⇒V˜π is a good approximationof vπ(s0) ifn sufficiently large.

(32)

Summary

(33)

Summary

MDP and SSP planning tasks represent MDPs and SSPs compactly.

Policy existence in SSPs is EXP-complete.

Interleaving planning and execution avoids representation issues of (typically exponentially sized) policy.

Quality of such an incomplete policy can be estimatedby executing it a fixed number of times.

In SSPs, estimated policy evaluationconverges to the true quality of the policy.

Referenzen

ÄHNLICHE DOKUMENTE

Landmarks, network flows and potential heuristics are based on constraints that can be specified for a planning task.... Constraint-based Heuristics Multiple

I The minimum hitting set over all cut landmarks is a perfect heuristic for delete-free planning tasks. I The LM-cut heuristic is an admissible heuristic based on

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

Same algorithm can be used for disjunctive action landmarks, where we also have a minimal saturated cost function. Definition (MSCF for Disjunctive

F4.1 Introduction F4.2 Value Iteration F4.3 Asynchronous VI F4.4 Summary.. R¨ oger (Universit¨ at Basel) Planning and Optimization December 02, 2020 2

Asynchronous VI performs backups for individual states Different approaches lead to different backup orders Can significantly reduce computation. Guaranteed to converge if all

I Quality of given policy π can be computed (via LP or backward induction) or approximated arbitrarily closely (via iterative policy evaluation) in small SSPs or MDPs I Impossible

Real-time Dynamic Programming (RTDP) generates hash map with state-value estimates of relevant states.. uses admissible heuristic to achieve convergence albeit not updating