• Keine Ergebnisse gefunden

F1.1Motivation F1.1MotivationF1.2MarkovDecisionProcessesF1.3Summary PlanningandOptimization PlanningandOptimization ContentofthisCourse

N/A
N/A
Protected

Academic year: 2022

Aktie "F1.1Motivation F1.1MotivationF1.2MarkovDecisionProcessesF1.3Summary PlanningandOptimization PlanningandOptimization ContentofthisCourse"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Planning and Optimization

F1. Markov Decision Processes

Gabriele R¨ oger and Thomas Keller

Universit¨ at Basel

November 21, 2018

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 1 / 25

Planning and Optimization

November 21, 2018 — F1. Markov Decision Processes

F1.1 Motivation

F1.2 Markov Decision Processes F1.3 Summary

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 2 / 25

Content of this Course

Planning

Classical

Tasks Progression/

Regression Complexity Heuristics

Probabilistic

MDPs Blind Methods Heuristic Search

Monte-Carlo Methods

F1. Markov Decision Processes Motivation

F1.1 Motivation

(2)

F1. Markov Decision Processes Motivation

Limitations of Classical PlanningGeneralization of Classical Planning: Temporal Planning

I timetable for astronauts on ISS

I concurrency required for some experiments

I optimize makespan

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 5 / 25

F1. Markov Decision Processes Motivation

Limitations of Classical PlanningGeneralization of Classical Planning: Numeric Planning

I kinematics of robotic arm

I state space is continuous

I preconditions and effects described by complex functions

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 6 / 25

F1. Markov Decision Processes Motivation

Limitations of Classical PlanningGeneralization of Classical Planning: MDPs

1 2 3 4 5

1 2 3 4 5

I satellite takes images of patches on earth

I weather forecast is uncertain

I find solution with lowest expected cost

F1. Markov Decision Processes Motivation

Limitations of Classical PlanningGeneralization of Classical Planning: Multiplayer Games

I Chess

I there is an opponent with

(3)

F1. Markov Decision Processes Motivation

Limitations of Classical PlanningGeneralization of Classical Planning: POMDPs

I Solitaire

I some state information cannot be observed

I must reason over belief for good behaviour

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 9 / 25

F1. Markov Decision Processes Motivation

Limitations of Classical Planning

I many applications are combinations of these

I all of these are active research areas

I we focus on one of them:

probabilistic planning with Markov decision processes

I MDPs are closely related to games (Why?)

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 10 / 25

F1. Markov Decision Processes Markov Decision Processes

F1.2 Markov Decision Processes

F1. Markov Decision Processes Markov Decision Processes

Markov Decision Processes

I Markov decision processes (MDPs) studied since the 1950s

I Work up to 1980s mostly on theory and basic algorithms for small to medium sized MDPs

I Today, focus on large (typically factored) MDPs

I Fundamental datastructure for reinforcement learning (not covered in this course)

I and for probabilistic planning

I different variants exist

(4)

F1. Markov Decision Processes Markov Decision Processes

Reminder: Transition Systems

Definition (Transition System)

A transition system is a 6-tuple T = hS , L, c, T , s 0 , S ? i where

I S is a finite set of states,

I L is a finite set of (transition) labels,

I c : L → R + 0 is a label cost function,

I T ⊆ S × L × S is the transition relation,

I s 0 ∈ S is the initial state, and

I S ? ⊆ S is the set of goal states.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 13 / 25

F1. Markov Decision Processes Markov Decision Processes

Reminder: Transition System Example

LR

LL TL

RL

TR RR

Logistics problem with one package, one trucks, two locations:

I location of package: {L, R , T }

I location of truck: {L, R }

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 14 / 25

F1. Markov Decision Processes Markov Decision Processes

Stochastic Shortest Path Problem

Definition (Stochastic Shortest Path Problem) A stochastic shortest path problem (SSP) is a 6-tuple T = hS , L, c, T , s 0 , S ? i, where

I S is a finite set of states,

I L is a finite set of (transition) labels,

I c : L → R + 0 is a label cost function,

I T : S × L × S 7→ [0, 1] is the transition function,

I s 0 ∈ S is the initial state, and

I S ? ⊆ S is the set of goal states.

For all s ∈ S and ` ∈ L with T (s, `, s 0 ) > 0 for some s 0 ∈ S, we require P

s

0

∈S T (s , `, s 0 ) = 1.

Note: An SSP is the probabilistic pendant of a transition system.

F1. Markov Decision Processes Markov Decision Processes

Reminder: Transition System Example

LR

LL TL

RL

TR RR

.2 .8

.2 .8

Logistics problem with one package, one trucks, two locations:

I location of package: {L, R , T }

I location of truck: {L, R }

I if truck moves with package, 20% chance of losing package

(5)

F1. Markov Decision Processes Markov Decision Processes

Terminology (1)

I If p := T (s , `, s 0 ) > 0, we write s − p:` − → s 0 or s − → p s 0 if not interested in `.

I If T (s , `, s 0 ) = 1, we also write s − → ` s 0 or s → s 0 if not interested in `.

I If T (s , `, s 0 ) > 0 for some s 0 we say that ` is applicable in s .

I The set of applicable labels in s is L(s ).

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 17 / 25

F1. Markov Decision Processes Markov Decision Processes

Terminology (2)

I the successor set of s and ` is succ(s , `) = {s 0 ∈ S | T (s, `, s 0 ) > 0}

I s 0 is a successor of s if s 0 ∈ succ(s , `) for some `

I s is predecessor of s 0 if s 0 ∈ succ(s, `) for some `

I with s 0 ∼ succ(s , `) we denote that successor s 0 ∈ succ(s , `) of s and ` is sampled according to probability distribution T

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 18 / 25

F1. Markov Decision Processes Markov Decision Processes

Terminology (3)

I s 0 is reachable from s if there exists a sequence of transitions s 0 −−−→ p

1

:`

1

s 1 , . . . , s n−1 −−−→ p

n

:`

n

s n s.t. s 0 = s and s n = s 0

I

Note: n = 0 possible; then s = s

0

I

s

0

, . . . , s

n

is called (state) path from s to s

0

I

`

1

, . . . , `

n

is called (label) path from s to s

0

I

s

0

`

1

s

1

, . . . , s

n−1

`

n

s

n

is called trace from s to s

0

I

length of path/trace is n

I

cost of label path/trace is P

n i=1

c(`

i

)

I

probability of path/trace is Q

n i=1

p

i

F1. Markov Decision Processes Markov Decision Processes

Finite-horizon Markov Decision Process

Definition (Finite-horizon Markov Decision Process)

A finite-horizon Markov decision process (FH-MDP) is a 6-tuple T = hS, L, R , T , s 0 , H i, where

I S is a finite set of states,

I L is a finite set of (transition) labels,

I R : S × L → R is the reward function,

I T : S × L × S 7→ [0, 1] is the transition function,

I s 0 ∈ S is the initial state, and

I H ∈ N is the finite horizon.

For all s ∈ S and ` ∈ L with T (s, `, s 0 ) > 0 for some s 0 ∈ S , we require P

s

0

∈S T (s , `, s 0 ) = 1.

(6)

F1. Markov Decision Processes Markov Decision Processes

Example: Push Your Luck

. . . .

0

0

1/6

1/6

1/6

0

2

1/6

1/6

1/6

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 21 / 25

F1. Markov Decision Processes Markov Decision Processes

Discounted Reward Markov Decision Process

Definition (Discounted Reward Markov Decision Process) A discounted reward Markov decision process (DR-MDP) is a 6-tuple T = hS , L, R , T , s 0 , γi, where

I S is a finite set of states,

I L is a finite set of (transition) labels,

I R : S × L → R is the reward function,

I T : S × L × S 7→ [0, 1] is the transition function,

I s 0 ∈ S is the initial state, and

I γ ∈ (0, 1) is the discount factor.

For all s ∈ S and ` ∈ L with T (s, `, s 0 ) > 0 for some s 0 ∈ S , we require P

s

0

∈S T (s , `, s 0 ) = 1.

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 22 / 25

F1. Markov Decision Processes Markov Decision Processes

Example: Grid World

1 2 3 4

1 2 3

s 0

−1 +1

I each move goes in orthogonal direction with some probability

I (4,3) gives reward of +1 and sets position back to (1,1)

F1. Markov Decision Processes Summary

F1.3 Summary

(7)

F1. Markov Decision Processes Summary

Summary

I Many planning scenarios beyond classical planning

I We focus on probabilistic planning

I SSPs are classical planning + probabilistic transition function

I FH-MDPs and DR-MDPs allow state-dependent rewards

I FH-MDPs consider finite number of steps

I DR-MDPs discount rewards over infinite horizon

G. R¨oger, T. Keller (Universit¨at Basel) Planning and Optimization November 21, 2018 25 / 25

Referenzen

ÄHNLICHE DOKUMENTE

A6.2 Positive Normal Form A6.3 Example and Discussion A6.4 STRIPS..

A7.3 FDR Planning Tasks A7.4 FDR Task Semantics A7.5 SAS + Planning Tasks A7.6 Summary.. Keller (Universit¨ at Basel) Planning and Optimization October 8, 2018 2

I backward search from goal to initial state based on regression?. I

Definition (Delete Relaxation of Propositional Planning Tasks) The delete relaxation Π + of a propositional planning task Π = hV , I, O, γi in positive normal form is the planning

I Compact descriptions that induce SSPs and MDPs analogous to classical

DR-MDPs with finite state set are always cyclic ⇒ acyclic policy evaluation not applicable But: existence of goal state not required for iterative policy evaluation albeit traces

I Single-outcome determinizations: important parts of state space can become unreachable ⇒ poor policy or unsolvable. I All-outcomes determinization:

Keller (Universit¨ at Basel) Planning and Optimization December 3, 2018 2 / 32. Content of