F2.1Policies&ValueFunctions F2.1Policies&ValueFunctionsF2.2FactoredMDPsF2.3Summary PlanningandOptimization PlanningandOptimization ContentofthisCourse

(1)

Planning and Optimization

F2. Policies & Compact Description

Gabriele R¨oger and Thomas Keller

Universit¨at Basel

November 21, 2018

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 1 / 26

Planning and Optimization

November 21, 2018 — F2. Policies & Compact Description

F2.1 Policies & Value Functions F2.2 Factored MDPs

F2.3 Summary

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo

F2. Policies & Compact Description Policies & Value Functions

F2.1 Policies & Value Functions

(2)

Solutions in SSPs

LR

LL TL

RL

TR RR

move-L, pickup, move-R, drop

I solution in deterministic transition systems isplan, i.e., a goal path froms₀ to somes_? ∈S_?

I cheapestplan isoptimal solution

I deterministic agent thatexecutes plan will reach goal

Solutions in SSPs

LR

LL TL

RL

TR RR

.2 .8can’t drop!

.2 .8

I probabilistic agentwill not reach goal orcannot executeplan

I non-determinism can lead to different outcomethan anticipatedin plan

I require a more general solution: a policy

Solutions in SSPs

LR move-L

LL pickup

TL move-R

RL TR drop

RR

.2 .8

I policy must be allowed to becyclic

I policy must be able to branchover outcomes

I policy assigns applicable labelsto states

Policy for SSPs

Definition (Policy for SSPs)

LetT =hS,L,c,T,s₀,S_?i be an SSP. Apolicyfor T is a mapping π:S →L∪ {⊥} such that π(s)∈L(s)∪ {⊥}for all s.

The set ofreachable statesS_π(s) from s underπ is defined recursively as the smallest set satisfying the rules

I s ∈S_π(s) and

I succ(s⁰, π(s⁰))⊆S_π(s) for alls⁰∈S_π(s)\S_? where π(s⁰)6=⊥.

If π(s⁰)6=⊥for all s⁰ ∈S_π(s), then π isexecutable in s.

(3)

Policy Representation

I size ofexplicit representationof executable policyπ is |S_π(s₀)|

I often,|S_π(s₀)|similar to|S|

I compact policy representation, e.g. via value function approximation or neural networks, is active research area

⇒not covered in this course

I instead, we considersmall state spaces for basic algorithms

I oronline planning where planning for the current states₀ is interleaved withexecution ofπ(s₀)

Value Functions of SSPs

Definition (Value Functions of SSPs)

LetT =hS,L,c,T,s₀,S_?i be an SSP and π be an executable policy forT. The state-valueV_π(s) ofs underπ is defined as

V_π(s) :=

(0 if s ∈S_? Q_π(s, π(s)) otherwise, where theaction-value Q_π(s, `) underπ is defined as

Q_π(s, `) :=c(`) + X

s⁰∈succ(s,`)

T(s, `,s⁰)·V_π(s⁰) .

Example: Value Functions of SSPs

Example

Consider example task and π with π(LR) =move-L, π(LL) =pickup,π(TL) =move-Randπ(TR) =drop.

V_?(LR) = 1 +V_?(LL) V_?(LL) = 1 +V_?(TL)

V_?(TL) = 1 + (0.8·V_?(RR)) + (0.2·V_?(LR)) V_?(TR) = 1 +V_?(RR)

V_?(RL) = 0 V_?(RR) = 0

What is the solution of this? ⇒ next week!

Bellman Optimality Equation

Definition (Optimal Policy in SSPs)

Let theBellman optimality equationfor a states of an SSP be the set of equations that describesV_?(s), where

V_?(s) :=

(0 ifs ∈S_? min_`∈L(s)Q_?(s, `) otherwise, Q_?(s, `) :=c(`) + X

s⁰∈succ(s,`)

T(s, `,s⁰)·V_?(s⁰) .

A policyπ^? is an optimal policyifπ^?(s)∈arg min_`∈L(s)Q_?(s, `) for alls ∈S, and the expected cost ofπ^? in T is V_?(s₀).

(4)

Dead-end States

I dead-end statesare a problem with our formalization

I each policy withnon-zero probability of reaching a dead-end has infinite state-value

I one solution is to search for policy withhighest probability to reach the goal

I unfortunately, thisignores costs

I there is also research ondead-end detection

I in this course, we only consider SSPs, FH-MDPs and DR-MDPs that aredead-end free

Policies for FH-MDPs

I What is the optimal policy for the SSP at the blackboard?

I Can we do better if we regard this as an FH-MDP?

I Yes, by acting differently close to the horizon.

Policy for FH-MDPs

Definition (Policy for FH-MDPs)

Let T =hS,L,R,T,s₀,Hibe an FH-MDP. A policy for T is a mapping π:S× {1, . . . ,H} →L∪ {⊥}such that

π(s,d)∈L(s)∪ {⊥}for all s.

The set of reachable states S_π(s,d) from s with d steps-to-go under π is defined recursively as the smallest set satisfying the rules

I hs,di ∈S_π(s,d) and

I hs⁰⁰,d⁰−1i ∈S_π(s,d) for alls⁰⁰∈succ(s⁰, π(s⁰)) and hs⁰,d⁰i ∈S_π(s) withd⁰ >0 and π(s⁰,d⁰)6=⊥.

If π(s⁰,d⁰)6=⊥for all hs⁰,d⁰i ∈S_π(s,d) withd⁰ >0, thenπ is executable in s.

Value Functions for FH-MDPs

Definition (Value Functions for FH-MDPs)

LetT =hS,L,c,T,s₀,Hi be an FH-MDP andπ be an executable policy forT. The state-valueV_π(s,d) ofs andd under π is defined as

V_π(s,d):=

(0 if d = 0 Q_π(s,d, π(s)) otherwise, where the action-valueQ_π(s,d, `) underπ is defined as

Q_π(s,d, `):=R(s, `) + X

s⁰∈succ(s,`)

T(s, `,s⁰)·V_π(s⁰,d −1) .

(5)

Bellman Optimality Equation

Definition (Optimal Policy in FH-MDPs)

Let the Bellman optimality equation for a state s of an FH-MDP be the set of equations that describes V_?(s,d), where

V_?(s,d) :=

(0 ifd = 0 max_`∈L(s)Q_?(s,d, `) otherwise, Q_?(s,d, `) :=R(s, `) + X

s⁰∈succ(s,`)

T(s, `,s⁰)·V_?(s⁰,d −1) .

A policyπ^? is an optimal policy if

π^?(s,d)∈arg max_`∈L(s)Q_?(s,d, `) for alls ∈S and

d ∈ {1, . . . ,H}, and the expected reward ofπ^? in T is V_?(s₀,H).

(Optimal) Policy and Value Functions for DR-MDPs

I policy does not distinguish statesbased on steps-to-go (or rather the reverse “distance-from-init”)

I value functions have no“terminal case”

I value functions discount reward withγ

I Bellman optimality equation derived from value functions as for FH-MDP

F2. Policies & Compact Description Factored MDPs

F2.2 Factored MDPs

Factored SSPs

We would like to specify huge SSPs without enumerating states. In classical planning, we achieved this via propositional planning tasks:

I represent different aspects of the world in terms of different Boolean state variables

I treat state variables as atomic propositions a state is avaluation of state variables

I n state variables induce 2ⁿ states

exponentially more compactthan “flat” representations

⇒can also be used for SSPs

(6)

Reminder: Syntax of Operators

Definition (Operator)

An operator o over state variablesV is an object with three properties:

I apreconditionpre(o), a logical formula over V

I aneffect eff(o) overV, defined on the following slides

I acost cost(o)∈R⁺₀

⇒ can also be used for SSPs

Reminder: Syntax of Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a state variable, thenv and¬v are effects (atomic effect).

I If e₁, . . . ,e_n are effects, then (e₁∧ · · · ∧e_n) is an effect (conjunctive effect).

The special case with n= 0 is the empty effect>.

I If χ is a logical formula ande is an effect, then(χBe)is an effect (conditional effect).

Parentheses can be omitted when this does not cause ambiguity.

Syntax of Probabilistic Effects

Definition (Effect)

Effectsover state variablesV are inductively defined as follows:

I If v ∈V is a state variable, then v and¬v are effects (atomic effect).

I If e₁, . . . ,e_n are effects, then (e₁∧ · · · ∧e_n)is an effect (conjunctive effect).

The special case withn = 0 is the empty effect>.

I If χis a logical formula ande is an effect, then(χBe) is an effect (conditional effect).

I If e₁, . . . ,e_n are effects andp₁, . . . ,p_n ∈[0,1] such that Pn

i=1p_i = 1, then (p₁ :e₁|. . .|p_n :e_n) is an effect (probabilistic effect).

Parentheses can be omitted when this does not cause ambiguity.

I FDR taskscan be generalized to SSPs in the same way

I both propositional and FDR tasks can be generalized to FH-MDPs and DR-MDPs

(7)

F2. Policies & Compact Description Summary

F2.3 Summary

F2. Policies & Compact Description Summary

Summary

I Policies consider branching and cycles

I State-value of a policy describes expected reward of following that policy

I RelatedBellman optimality equation describes optimal policy

I Compact descriptions that induce SSPs and MDPs analogous to classical planning