• Keine Ergebnisse gefunden

F2.1Policies&ValueFunctions F2.1Policies&ValueFunctionsF2.2FactoredMDPsF2.3Summary PlanningandOptimization PlanningandOptimization ContentofthisCourse

N/A
N/A
Protected

Academic year: 2022

Aktie "F2.1Policies&ValueFunctions F2.1Policies&ValueFunctionsF2.2FactoredMDPsF2.3Summary PlanningandOptimization PlanningandOptimization ContentofthisCourse"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

F2. Policies & Compact Description

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

November 21, 2018

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 1 / 26

Planning and Optimization

November 21, 2018 — F2. Policies & Compact Description

F2.1 Policies & Value Functions F2.2 Factored MDPs

F2.3 Summary

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 2 / 26

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo

F2. Policies & Compact Description Policies & Value Functions

F2.1 Policies & Value Functions

(2)

F2. Policies & Compact Description Policies & Value Functions

Solutions in SSPs

LR

LL TL

RL

TR RR

move-L, pickup, move-R, drop

I solution in deterministic transition systems isplan, i.e., a goal path froms0 to somes? ∈S?

I cheapestplan isoptimal solution

I deterministic agent thatexecutes plan will reach goal

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 5 / 26

F2. Policies & Compact Description Policies & Value Functions

Solutions in SSPs

LR

LL TL

RL

TR RR

move-L, pickup, move-R, drop

.2 .8can’t drop!

.2 .8

I probabilistic agentwill not reach goal orcannot executeplan

I non-determinism can lead to different outcomethan anticipatedin plan

I require a more general solution: a policy

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 6 / 26

F2. Policies & Compact Description Policies & Value Functions

Solutions in SSPs

LR move-L

LL pickup

TL move-R

RL TR drop

RR

move-L, pickup, move-R, drop

.2 .8

.2 .8

I policy must be allowed to becyclic

I policy must be able to branchover outcomes

I policy assigns applicable labelsto states

F2. Policies & Compact Description Policies & Value Functions

Policy for SSPs

Definition (Policy for SSPs)

LetT =hS,L,c,T,s0,S?i be an SSP. Apolicyfor T is a mapping π:S →L∪ {⊥} such that π(s)∈L(s)∪ {⊥}for all s.

The set ofreachable statesSπ(s) from s underπ is defined recursively as the smallest set satisfying the rules

I s ∈Sπ(s) and

I succ(s0, π(s0))⊆Sπ(s) for alls0∈Sπ(s)\S? where π(s0)6=⊥.

If π(s0)6=⊥for all s0 ∈Sπ(s), then π isexecutable in s.

(3)

F2. Policies & Compact Description Policies & Value Functions

Policy Representation

I size ofexplicit representationof executable policyπ is |Sπ(s0)|

I often,|Sπ(s0)|similar to|S|

I compact policy representation, e.g. via value function approximation or neural networks, is active research area

⇒not covered in this course

I instead, we considersmall state spaces for basic algorithms

I oronline planning where planning for the current states0 is interleaved withexecution ofπ(s0)

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 9 / 26

F2. Policies & Compact Description Policies & Value Functions

Value Functions of SSPs

Definition (Value Functions of SSPs)

LetT =hS,L,c,T,s0,S?i be an SSP and π be an executable policy forT. The state-valueVπ(s) ofs underπ is defined as

Vπ(s) :=

(0 if s ∈S? Qπ(s, π(s)) otherwise, where theaction-value Qπ(s, `) underπ is defined as

Qπ(s, `) :=c(`) + X

s0∈succ(s,`)

T(s, `,s0)·Vπ(s0) .

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 10 / 26

F2. Policies & Compact Description Policies & Value Functions

Example: Value Functions of SSPs

Example

Consider example task and π with π(LR) =move-L, π(LL) =pickup,π(TL) =move-Randπ(TR) =drop.

V?(LR) = 1 +V?(LL) V?(LL) = 1 +V?(TL)

V?(TL) = 1 + (0.8·V?(RR)) + (0.2·V?(LR)) V?(TR) = 1 +V?(RR)

V?(RL) = 0 V?(RR) = 0

What is the solution of this? ⇒ next week!

F2. Policies & Compact Description Policies & Value Functions

Bellman Optimality Equation

Definition (Optimal Policy in SSPs)

Let theBellman optimality equationfor a states of an SSP be the set of equations that describesV?(s), where

V?(s) :=

(0 ifs ∈S? min`∈L(s)Q?(s, `) otherwise, Q?(s, `) :=c(`) + X

s0∈succ(s,`)

T(s, `,s0)·V?(s0) .

A policyπ? is an optimal policyifπ?(s)∈arg min`∈L(s)Q?(s, `) for alls ∈S, and the expected cost ofπ? in T is V?(s0).

(4)

Dead-end States

I dead-end statesare a problem with our formalization

I each policy withnon-zero probability of reaching a dead-end has infinite state-value

I one solution is to search for policy withhighest probability to reach the goal

I unfortunately, thisignores costs

I there is also research ondead-end detection

I in this course, we only consider SSPs, FH-MDPs and DR-MDPs that aredead-end free

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 13 / 26

Policies for FH-MDPs

I What is the optimal policy for the SSP at the blackboard?

I Can we do better if we regard this as an FH-MDP?

I Yes, by acting differently close to the horizon.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 14 / 26

F2. Policies & Compact Description Policies & Value Functions

Policy for FH-MDPs

Definition (Policy for FH-MDPs)

Let T =hS,L,R,T,s0,Hibe an FH-MDP. A policy for T is a mapping π:S× {1, . . . ,H} →L∪ {⊥}such that

π(s,d)∈L(s)∪ {⊥}for all s.

The set of reachable states Sπ(s,d) from s with d steps-to-go under π is defined recursively as the smallest set satisfying the rules

I hs,di ∈Sπ(s,d) and

I hs00,d0−1i ∈Sπ(s,d) for alls00∈succ(s0, π(s0)) and hs0,d0i ∈Sπ(s) withd0 >0 and π(s0,d0)6=⊥.

If π(s0,d0)6=⊥for all hs0,d0i ∈Sπ(s,d) withd0 >0, thenπ is executable in s.

F2. Policies & Compact Description Policies & Value Functions

Value Functions for FH-MDPs

Definition (Value Functions for FH-MDPs)

LetT =hS,L,c,T,s0,Hi be an FH-MDP andπ be an executable policy forT. The state-valueVπ(s,d) ofs andd under π is defined as

Vπ(s,d):=

(0 if d = 0 Qπ(s,d, π(s)) otherwise, where the action-valueQπ(s,d, `) underπ is defined as

Qπ(s,d, `):=R(s, `) + X

s0∈succ(s,`)

T(s, `,s0)·Vπ(s0,d −1) .

(5)

F2. Policies & Compact Description Policies & Value Functions

Bellman Optimality Equation

Definition (Optimal Policy in FH-MDPs)

Let the Bellman optimality equation for a state s of an FH-MDP be the set of equations that describes V?(s,d), where

V?(s,d) :=

(0 ifd = 0 max`∈L(s)Q?(s,d, `) otherwise, Q?(s,d, `) :=R(s, `) + X

s0∈succ(s,`)

T(s, `,s0)·V?(s0,d −1) .

A policyπ? is an optimal policy if

π?(s,d)∈arg max`∈L(s)Q?(s,d, `) for alls ∈S and

d ∈ {1, . . . ,H}, and the expected reward ofπ? in T is V?(s0,H).

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 17 / 26

F2. Policies & Compact Description Policies & Value Functions

(Optimal) Policy and Value Functions for DR-MDPs

I policy does not distinguish statesbased on steps-to-go (or rather the reverse “distance-from-init”)

I value functions have no“terminal case”

I value functions discount reward withγ

I Bellman optimality equation derived from value functions as for FH-MDP

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 18 / 26

F2. Policies & Compact Description Factored MDPs

F2.2 Factored MDPs

F2. Policies & Compact Description Factored MDPs

Factored SSPs

We would like to specify huge SSPs without enumerating states. In classical planning, we achieved this via propositional planning tasks:

I represent different aspects of the world in terms of different Boolean state variables

I treat state variables as atomic propositions a state is avaluation of state variables

I n state variables induce 2n states

exponentially more compactthan “flat” representations

⇒can also be used for SSPs

(6)

Reminder: Syntax of Operators

Definition (Operator)

An operator o over state variablesV is an object with three properties:

I apreconditionpre(o), a logical formula over V

I aneffect eff(o) overV, defined on the following slides

I acost cost(o)∈R+0

⇒ can also be used for SSPs

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 21 / 26

Reminder: Syntax of Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a state variable, thenv and¬v are effects (atomic effect).

I If e1, . . . ,en are effects, then (e1∧ · · · ∧en) is an effect (conjunctive effect).

The special case with n= 0 is the empty effect>.

I If χ is a logical formula ande is an effect, then(χBe)is an effect (conditional effect).

Parentheses can be omitted when this does not cause ambiguity.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 22 / 26

F2. Policies & Compact Description Factored MDPs

Syntax of Probabilistic Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a state variable, then v and¬v are effects (atomic effect).

I If e1, . . . ,en are effects, then (e1∧ · · · ∧en)is an effect (conjunctive effect).

The special case withn = 0 is the empty effect>.

I If χis a logical formula ande is an effect, then(χBe) is an effect (conditional effect).

I If e1, . . . ,en are effects andp1, . . . ,pn ∈[0,1] such that Pn

i=1pi = 1, then (p1 :e1|. . .|pn :en) is an effect (probabilistic effect).

Parentheses can be omitted when this does not cause ambiguity.

F2. Policies & Compact Description Factored MDPs

I FDR taskscan be generalized to SSPs in the same way

I both propositional and FDR tasks can be generalized to FH-MDPs and DR-MDPs

(7)

F2. Policies & Compact Description Summary

F2.3 Summary

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 25 / 26

F2. Policies & Compact Description Summary

Summary

I Policies consider branching and cycles

I State-value of a policy describes expected reward of following that policy

I RelatedBellman optimality equation describes optimal policy

I Compact descriptions that induce SSPs and MDPs analogous to classical planning

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 26 / 26

Referenzen

ÄHNLICHE DOKUMENTE

A6.2 Positive Normal Form A6.3 Example and Discussion A6.4 STRIPS..

A7.3 FDR Planning Tasks A7.4 FDR Task Semantics A7.5 SAS + Planning Tasks A7.6 Summary.. Keller (Universit¨ at Basel) Planning and Optimization October 8, 2018 2

I backward search from goal to initial state based on regression?. I

Definition (Delete Relaxation of Propositional Planning Tasks) The delete relaxation Π + of a propositional planning task Π = hV , I, O, γi in positive normal form is the planning

Keller (Universit¨ at Basel) Planning and Optimization November 21, 2018 2 / 25G. Content of

DR-MDPs with finite state set are always cyclic ⇒ acyclic policy evaluation not applicable But: existence of goal state not required for iterative policy evaluation albeit traces

I Single-outcome determinizations: important parts of state space can become unreachable ⇒ poor policy or unsolvable. I All-outcomes determinization:

Keller (Universit¨ at Basel) Planning and Optimization December 3, 2018 2 / 32. Content of