Learning Behaviors of Stochastic Automata and Some Applications

(1)

NOT FOR QUOTATION WITHOUT PERMISSION OF THE AUTHOR

LFXRNING BEHAVIORS OF E T O C W C AUTOMATA

AND

SOME APPLICATIONS

Norio Baba

November 1983

WP-83-119

Working Papers a r e interim reports on work of the International Institute for Applied Systems Analysis and have received only limited review. Views or opinions expressed herein do not necessarily represent those of the Institute or of its National Member Organiza- tions.

INTERNATIONAL INSTITUTE FOR APPLIED SYSTJ3MS ANALYSIS 2361 Laxenburg, Austria

(2)

PREFACE

I t is known t h a t stochastic a u t o m a t a can be applied t o describe t h e behavior of a decision m a k e r or manager in t h e condition of uncertainty. This paper discusses learning behaviors of stochastic a u t o m a t a u n d e r unknown nonstationary multi-teacher environment. The consistency of sequential decision making procedures i s proved under some mild conditions.

V. Fedorov

(3)

The a u t h o r would like t o t h a n k Prof. Fedorov, Prof. Holling, Prof. Kindler, Prof. Sawaragi, Prof. Shoman, Prof. Soeda, Dr. Staley, Prof. Walters, Prof. Wets, Prof. Wierzbicki, and Prof. Yashin for their kind advice a n d encouragement.

(4)

-

^vii

-

(5)

LEARNING BEHAVIORS OF STOCHASTIC AUTOWLATA AND

SOME

APPLICATIONS

Norio Baba

INTRODUCTION

The concept of learning a u t o m a t a operating in a n unknown r a n d o m environment was first i n t r o d u c e d by Tsetlin (1961). He studied t h e learning behaviors of deterministic a u t o m a t a a n d showed t h a t t h e y a r e asymptotically optimal u n d e r some conditions. Later, Varshavskii a n d Vorontsova (1963) found t h a t stochastic a u t o m a t a also have learning properties. Since t h e n , t h e learning behaviors of stochastic a u t o m a t a have been studied extensively by m a n y r e s e a r c h e r s . Chandrasekaran a n d Shen (1968). Norman (1968; 1972).

Lakshmivarahan a n d T h a t h a c h a r (1973), Narendra and T h a t h a c h a r (1974), a n d o t h e r s , have contributed fruitful r e s u l t s t o t h e l i t e r a t u r e of learning auto- m a t a .

Despite active r e s e a r c h i n t h i s field, almost all r e s e a r c h so f a r h a s dealt with learning behaviors of a single a u t o m a t o n operating i n a s t a t i o n a r y single-teacher environment, although Koditschek a n d Narendra (1977) considered t h e learning behavior of fixed-structure a u t o m a t a operating in a sta- t i o n a r y multi-teacher environment. Thathachar a n d Bhakthavathsalam (1978) t h e n studied variable-structure stochastic a u t o m a t a operating i n two distinct t e a c h e r environments. Recently, Baba (1983) studied t h e learning

(6)

behaviors of v a r i a b l e - s t r u c t u r e stochastic aut.ornata under t h e general n- t e a c h e r environment. He proposed t h e G,4E reinforcement s c h e m e as a learning algorithm and proved t h a t this reinforcement s c h e m e has good learning properties s u c h as E-optimality and absolute expediency in t h e g e n e r a l n- t e a c h e r environment.

In t h i s paper, we consider learning behaviors of variable-structure sto- c h a s t i c a u t o m a t a operating in a nonstationary multi-teacher environment from which s t o c h a s t i c a u t o m a t a receive responses having a n a r b i t r a r y n u m b e r between 0 a n d 1. As a generalized form of t h e GAE reinforcement s c h e m e , we propose t h e MGAF: s c h e m e a n d show t h a t this s c h e m e e n s u r e s E-

optimality in t h e n o n s t a t i o n a r y multi-teacher environment of a n S-model. We also consider t h e p a r a m e t e r self-optimization problem with noise-corrupted, multi-objective functions by s t o c h a s t i c automata.

Since t h e theory of t h e learning behavior of stochastic a u t o m a t a operating in t h e NMT e n v i r o n m e n t h a s been developed only recently, i t s application t o r e a l problems has n o t b e e n discussed in t h e l i t e r a t u r e . However, t h e a u t h o r believes t h a t i t could be applied t o t h e problems where one i n p u t eli- cits multi-responses from multi-criteria environments. In t h e following, we shall suggest two applications:

Commercial Game

Suppose t h a t n players ( A

l,....4)

a r e taking p a r t in a g a m e in which t h e y wish t o open a s t o r e somewhere in r regions

( B I,...,B,).

The m t h player (&) will choose t h e region

Bk

with a probability pmk ( m = l

,...,

n;

k = l , ...,

^{r ) .}I t i s a s s u m e d t h a t we c a n n o t obtain any information about t h e s e probabilities.

However, if a player i s t o s u c c e e d , h e m u s t avoid regions containing a lot of o t h e r players. The MGAE r e i n f o r c e m e n t s c h e m e , which will be proposed in this paper. c a n be used t o find a n appropriate region where t h e r e is a m i n i m u m of overlapping. The l e a r n i n g behavior of a u t o m a t a using t h e MGAE s c h e m e in various c o m m e r c i a l g a m e s h a s been simulated by c o m p u t e r a n d r e s u l t s indi- cating t h e effectiveness of t h e s c h e m e have been obtained.

(7)

FLY hing

Suppose t h a t there are r sea-areas in which a group of ships ( S I,....Sn) must catch fish. The learning behaviors of stochastic automata under multi- teacher environments can also be applied to find an appropriate sea-area. In this case, n ships and r sea-areas become n t e a c h e r s and t h e r s t a t e s of t h e stochastic automaton, respectively. If the numbers (or volume) of the catches of the i t h ship S1 are low,

Si

emits a penalty response. On the contrary, if great numbers of catches have been obtained, then

Si

emits a reward response. Depending upon t h e n responses from n teachers, t h e stochastic automaton changes its s t a t e probability vector.

BASIC MODEL OF A

UEARNING

AUTOMATON OPERATING

IN AN m o m

ENVIRONMENT

The learning behaviors of a variable-structure stochastic automaton operating in an unknown random environment have been discussed extensively under t h e model shown in Figure 1. First, l e t us briefly explain t h e learning mechanism of t h e stochastic automaton A under t h e unknown random environment (teacher environment) R(CI.

...

, C, ). Then, we will explain the basic norms of t h e learning behaviors of t h e stochastic automaton A .

The stochastic automaton A is defined by t h e sextuple IS, W,Y,g,P(t),Tj.

S denotes t h e s e t of two inputs (0.1), where 0 indicates t h e reward response from R(C

]....,

C,) and 1 indicates t h e penalty response. (If t h e s e t S consists of only two elements 0 and 1, t h e environment is said t o be a P-model. When t h e input into A assumes a finite number of values in t h e closed interval [0,1], i t is said to be a Q-model. An S-model is one in which t h e input into A takes an arbitrary number in t h e closed line segment [0, I]. In the next section, we will deal with t h e Srnodel envirnment.) W denotes t h e s e t of r internal s t a t e s (url,..,ul,). Y denotes t h e s e t of r outputs ( y l....,yr). g denotes t h e output function ( t )=g [w ( t )], t h a t is, one to one deterministic mapping. P ( t ) denotes t h e probability vector b l ( t ) ,...,p,( t ) ] ' a t time t , and its i t h com- ponent p i ( t ) indicates t h e probability with which t h e i t h s t a t e wi is chosen a t time t ( i = 1

,...,

r ) :

(8)

Figure 1. Basic model of a learning automaton operating in an unknown random environment.

+ Teacher (Random Environment) R ( C l ,.v-.,Cr)

Y i

Stochastic Automaton A (w,

,....,

wr)

(9)

T

denotes the reinforcement scheme which generates ~ ( t + I ) from ~ ( t ) . Suppose t h a t the state w, is chosen a t time t . Then, t h e stochastic automaton A performs action yi on the random environment R(C I,...,C,). In response to t h e action y i , t h e environment emits output s ( t ) = l (penalty) with probability Ci and output s ( t ) = 0 (reward) with probability 1-Ci(i

=

l , . . . , ~ ) . If all of t h e C i ( i = l

...

T ) are constant, t h e random environment R(C1, ..., C,) is said to be a stationary random environment. (The t e r m "single t e a c h e r environment" is also used instead of the t e r m "random environment.") On t h e other hand, if C i ( i = l ,

...,

r ) a r e not constant, it is said to be a nonstationary random environment. Depending upon t h e action of the stochastic automaton A and t h e e n v i r ~ n m e ~ t a l response to it, t h e reinforcement scheme T changes t h e probability vector ~ ( t ) to P ( t + I ) .

The values of C i ( i = l

,...,

r ) a r e n o t known a priori. Therefore, it is neces- sary t o reduce t h e average penalty,

by selecting an appropriate reinforcement scheme. To judge t h e effectiveness of a learning automaton operating in a stationary random environment R(C l,...,Cr), various performance measures have been s e t up. (See Chan- drasekaran and Shen 1968; Lakshmivarahan and Thathachar 1973; Narendra and Thathachar 1974.)

DEFINITION 1. A reinforcement scheme is said to be ezpedient if lim E t M ( t ) j

c

[ - 1 Ci]

f * = i = l

(El.]

.is the mathematical ezpectation.)

DEFINITION 2. A reinforcement scheme is said to be optimal if lim E t p a ( t ) j

=

1

t-.=

where C,=minl Ci j

a

(10)

DEFINITION 3. A reinforcemsnt scheme isscid t o be &-optimal if lim lim ~ l , p , ( t ) ] = 1

9-0 t -+-

where i? is a parameter included i n the reinforcem,ent scheme.

DEFlNlTION 4. A reinforcement scheme is said to be absolutely ezpedient if

f o r all t , aLl p i ( ) ( 0 1 ) 1 r ) and all possible values of

ci

( i =l....,r ) . (Et M ( t + I ) / P ( t ) ] is the conditional ezpectation.)

Remarks.

(a) The definition of E-optimality can also be described by using

1.

(b) In Definition 4, t h e trivial case in which all t h e values of Ci(i=l ,..., ^{T )}are equal is precluded.

The learning behaviors of a variable-structure stochastic automaton operating in the stationary random environment R(C I , . . . , C,) have been extensively studied by many researchers. Norman (1968) proved t h a t t h e LRml scheme ensures E-optimality in the two s t a t e case. Sawaragi and Baba (1973) showed t h a t this property also holds in t h e general T-state case. Laksh- mivarahan and Thathachar (1973) introduced the concept of absolutely expedient learning algorithms.

R e m a ~ k .

(c)

LR-I

scheme (Reward-Inaction scheme) Assume y ( t )=yi.

If s ( t )=O, then

(11)

If ~ ( t ) = l , t h e n

p , ( t + l ) = p , ( t ) ( m = l ,..., T )

Compared with t h e g r e a t n u m b e r of studies r e l a t e d t o t h e behavior of learning a u t o m a t a in a stationary environment, only a few a n d specialized results have been obtained concerning those in a nonstationary environment.

Baba and Sawaragi (1975) considered t h e nonstationary r a n d o m environment which has t h e property t h a t

(holds f o r some

a ,

some bl>O, all j ( ~ a ) , all

t ,

and all w; w is a point of a basic w-space

R.)

They showed t h a t t h e

LR-l

s c h e m e e n s u r e s E-optimality u n d e r t h e above environment. Recently, Srikantakumar and Narendra (1982) studied t h e learning behaviors of stochastic a u t o m a t a u n d e r t h e following nonstationary random environment:

(i) c i [ p ( n ) ] ( i = l

....,

T ; n

= O

....) a r e continuous functions of p i ( i = l

,...,

T )

a ci

(ii) -

>

0 for

Vi

'pi

ac. ac,

(iii)

- >> -

^for^V ( i t j )

api apj

This work h a s a very interesting application in t h e a r e a of telephone network routing.

(12)

U A R N I N G AUTOMATON MODEL UKDER THE NONETATIONMIY MULTI-TEACHER ENVIROh7t;ENT

In this section, we generalize the model given in Figure 1 and discuss t h e learning behaviors of the variable-structure stochastic automaton B in t h e nonstationary multi-teacher environment (NMT) as illustrated in Figure 2.

The stochastic automaton B is defined by the s e t

IS, w , Y , g

, P ( ~ ) , T ] . S is the s e t of inputs ( * . ^,^,. , $) where ,!$(j=l.

....

n ) is the response from t h e j t h t e a c h e r Rj(j = l .

...,

n ) and the value of

,$

can be an arbitrary number i n t h e closed line segment [0,1]. (We a r e dealing with an S-model multi-teacher environment.) In t h i s model, the definitions of

W , Y , g

, P ( t ) , and T a r e t h e same as in t h e last section.

Assume now t h a t the s t a t e w i is chosen a t time t . Then, the stochastic automaton B performs action yi on t h e nonstationary multi-teacher environ- m e n t (NMT). In response to yi, the j t h teacher Rj emits output

$.

In this section, we shall deal with t h e case in which

2

is a function of t and o. ( w ^ER:

R

is t h e basic w-space of the probability measure space

( O , E , ~ ) ,

^{a n d}

E

^{is the}

00

smallest Bore1 field including

u

^{F , ,}where &=u[(P(o)

,....

P(t),C(O)

,....

C(t)].) t = O

Consequently. from now on we shall use t h e notation $(t.o) t o r e p r e s e n t t h e input into t h e stochastic automaton B.

Depending upon t h e action yi and t h e n responses %(t,w)

,...,%(

t ,w) from n t e a c h e r s RIB. ..,R,, t h e stochastic automaton B changes t h e probability vec- t o r

P ( t

) by t h e reinforcement scheme T.

The nonstationary multi-teacher environment (NMT) considered in this paper h a s t h e property t h a t t h e relation

where

,

s ) 1 T ) is the distribution function of

s f ( t , w ) + ^.^, ^# +s:(t,o)

, holds for some state w,, some 6>0, all t i m e

t ,

all n

j ($a), and all ~ ( E R ) .

(13)

Nonstationary Multi-Teacher Environment NMT

I I

jth Teacher

I

I I

I

I s i I

I

1 st Teacher

I

Stochastic Automaton B (wl

,...,

w r )

1

Figure 2. Stochastic automaton

B

operating in t h e nonstationary multi-teacher environment (NMT).

(14)

n

The objective of t h e stochastic automaton 4 is t o reduce

Et

q ( t , o ) j ,

j = 1

t h e expectation of t h e s u m of t h e penalty strengths. Therefore, condition (5) means t h a t t h e a t h action y, is t h e best among r actions y l , ...,y, _sincey, receives t h e least s u m of t h e penalty strengths in t h e sense of mathematical expectation.

Before we proceed to introduce t h e norms of learning behaviors of stochastic automata under an NMT environment, we will explain several basic norms of t h e learning behaviors of stochastic automata under a stationary multi-teacher environment of a P-model. Baba (1983) discussed t h e learning behaviors of stochastic automata operating in t h e general stationary n - teacher environment in which t h e r e exists a pth s t a t e w such t h a t

B

for all l<i<r

(itp)

He gave t h e following definitions:

DEFINITION 1. The a v e r a g e w e i g h t e d r e w a r d in t h e n - t e a c h e r e n v i r o n m e n t

W ( t

) is d e f i n e d as f o l l o w s :

w h e r e

Di

, j is t h e p r o b a b i l i t y that j t e a c h e r s a p p r o v e of t h e i th a c t i o n yi of t h e stochrrstic a u t o m a t o n B . ^(j= l

,...,

n )

DEFINITION 2. The s t o c h a s t i c a u t o m a t o n B is s a i d t o b e " a b s o l u t e l y e z p e d i e n t in t h e g e n e r a l n - t e a c h e r e n v i r o n m e n t " i f

(15)

DEFINITION 3. The s t ~ c h a s t i c a u t o m a t o n B is s a i d t o be " e x p e d i e n t in t h e g e n e r a l n - t e a c h e r e n v i r o n m e n t " if

lirn E I

W ( t ) j > Fo

t +=

I 1

w h e r e

=

_{i = ]}

x ^-I

r _{j = ]}

x

j ~ : , ~ ] .

DEFINITION 4. The s t o c h a s t i c a u t o m a t o n B is s a i d t o b e " o p t i m a l in t h e g e n e r a l n - t e a c h e r e n v i r o n m e n t " i f

lim p g ( t )

=

1 t+-

with

pro b a b i l i f y 1 .

DEFINITION 5. The s t o c h a s t i c a u t o m a t o n

B

is s a i d t o b e & - o p t i m a l in t h e g e n e r a l n - t e a c h e r e n v i r o n m e n t " i f o n e c a n choose p a r a m e t e r 19 i n c l u d e d in t h e r e i n f o r c e m e n t s c h e m e of s t o c h a s t i c a u t o m a t o n B s u c h t h a t

lirn lim E i p a ( t ) ]

=

1

a+o t ^+- ( 1 1 )

Baba proposed t h e GAE reinforcement s c h e m e and proved t h a t i t e n s u r e s E-optimality and absolute expediency in t h e general n - t e a c h e r environment.

By analogy from Definitions 4 and 5 given above, we can give t h e following definitions concerning learning norms of stochastic a u t o m a t a under nonstationary multi-teacher environment NMT satisfying t h e condition ( 5 ) :

DEFINITION 6. The s t o c h a s t i c a u t o m a t o n

B

^iss a i d t o be o p t i m a l in N M T i f

with p r o b a b i l i t y 1

(16)

DEFINITION 7 . The s t o c h a s t i c a u t o m a t o n B .is s a i d t o b e E - o p t i m a l in NMT i f o n e c a n c h o o s e a p a r a m e t e r 5 i n c l u d e o ~ in

t h ~

r e i n f o r c e m e n t s c h e m e T o f t h e s t o c h a s t i c a u t o m a t o n B s u c h t h a t t h e f o l l o w i n g e q u a l i t y h o l d s :

lim lim

E b a ( t ) ] =

1 .6+0 t ^+-

On the other hand, the extensions of Definitions 2 and 3 can not be easily given. Presumably, we need a different interpretation.

&-OPTIMAL REXWORCQlIENT SCHEME

UNDER

THE NONSTATIONARY KULTI-TEACHER ENYIRONbinmT

The

GAE

reinforcement scheme (Baba 1983) has been introduced as a class of learning algorithms of stochastic automata operating in a multi- teacher environment which e m i t s 0 (reward) or 1 (penalty) responses. Thi.s scheme can not be applied t o the S-model environment in which teachers emit arbitrary responses between 0 and 1.

In the following, l e t u s propose t h e MGAE scheme which can be used for the S-model environment.

(17)

MGAE SCHEME:

Suppose t h a t

y ( t ) = y i

and t h e responses from NMT a r e

(sf , s i , . . . , s k ) .

($(j = 1 , ..., n ) means t h e response from t h e j t h teacher.) Then.

s f

+...+

^s^;

P j ( t + l ) = f i ( t ) - [ ] iPj[P(t)II

( 1 5 )

s f

+...+

^s^;

+

11

-

n

1 t+,[p(t)]I

( j

where

(p,,$,(i=l,..,~)

satisfy t h e following relations.

PI[P(t 11 - - PZ[P(t )I = . . . = ^{P ,} ^{[ P ( t} >I

P l ( t ) p z ( t ) P, ( t

)

= A[P(t ) I

^{( 1 6 )}

p = ( t )

+

2 ( ~ j [ P ( t ) l

> O

j t i

(18)

As t o t h e learning performance of t h e

MGAE

reinforcement scheme, t h e following t h e o r e m can be obtained.

THEOREM 1. S u p p o s e

that

A [ P ( ~

) ]

= d l h [ P ( t

) ]

j

($>o)

( 1 9 ) a n d

p [ ~ ( t ) ] = I J i j i [ ~ ( t ) ]

( 2 0 ) , w h e r e

A [ P ( ~ ) ]

a n d

j i [ P ( t ) ]

a r e b o u n d e d f u n c -

tions

w h i c h s a t i s f y t h e f o l l o w i n g c o n d i t i o n s : - A [ P ( ~

) ]

l 0 ( 2 1 ) ,

j i [ ~ ( t ) ]

I 0

(ZZ),

a n d ) ; [ ~ ( t ) ]

+ P [ P ( ~ ) ] ^<

⁰(23), f o r a l l

~ ( t )

a n d

t .

Then, t h e stochastic automaton I3 with t h e MGAE reinforcement scheme is E-optimal u n d e r t h e nonstationary multi-teacher environment NMT satisfying condition (5).

Since t h e proof of t h e above theorem is r a t h e r lengthy, we will begin by deriving several important lemmas.

LEMMA 1. & . p o s e

that

a l l o f t h e a s s u m p t i o n s o f t h e a b o v e t h e o r e m h o l d . T h e n , t h e

MGAE

r e i n f o r c e m e n t s c h e m e has t h e f o l l o w i n g l e a r n i n g p e r f o r m a n c e u n d e r t h e

NMT

e n v i r o n m e n t s a t i s f y i n g c o n d i t i o n (5):

P r o o f . For notational convenience, let us abbreviate time

t

and probability vector

P ( t )

a s follows:

Let

Filt ( C )

be t h e distribution function of

Then, t h e conditional expectation

E l p a ( t

+ I ) / P ( t ) j can be calculated a s follows:

+

2

^pj

^f

^[Pa

^-

[ ( p a ) +

( I - c ) $ ~ I

^{q , t}

( c )

j # a 0

(19)

Let

Then, using t h e relations (16) and ( 1 7 ) , t h e above equality can be represented as:

From t h e definition of t h e distribution function F k , l ( t ) , and from condition (5),

Let

Then, from t h e relations ( 1 9 ) (29), we can get

E C p a ( t + l ) I / P ( t ) I S P , ( ~ )

+

[ h + ~ I ( l - , ) p , [ ~ , ( t )

-

C g ( t

)I

r p , ( t )

[ ~ , ( t )

-

C e ( t )

<

0 a n d h + p

<

01 ( 3 0 )

Remark. ( 3 0 ) is t h e Semi-Martingale Inequality (Doob 1953). From t h i s inequality, we can g e t E [ p a ( t + l ) ] I E b , ( t ) ] for all. t . This m e a n s t h a t t h e mathematical expectation

p,(t)

increases monotonously with t i m e t .

(20)

LEMMA 2. ,%ppose t h a t all of t h e a s s u m p t i o n s of t h e t h e o r e m hold.

Le t

p , ' ( t )

=

l - P , ( t ) ( 3 2 )

m e n , t h e r e e x i s t s s o m e positive c o n s t a n t z w h i c h s a t i s f i e s t h e i n e q u a l i t y

~ f h , , + [ P ~ ' ( t + l ) ] /

~ ( t ) j

^Ih z , . I P [ p a ' ( t ) ] f o r ^allt and P ( t )

R o o f . The conditional expectation

~{h,,+[P,'(t +

^{l ) ] /}^{P ( t}

) j

can be calculated as follows:

E t h z , ~ [ P a f ( t + I ) ] / P ( t ) j

=

JC-1 ( 3 3 )

where

Assume that

IXt3I

^{< 0 1}

(ol

: positive c o n s t a n t ) ( 3 4 ) Then, by using Taylor's expansion theorem, t h e following two inequalities can be obtained:

(21)

ezP [ z p , ( ~ h - ( l - O ~ ) I 5 1 +zp,(th-(l-OF)

+

^20,

( h t j i

x2pa[ezp ( z o , ~ ) ] From (33), (35), and (36), we can get

where

f

I ( ~ . P ) = 4 z Olp,'[ezp(2~1z)]

From (28),

In t h e above equality (39), lim 4z01exp(201x)

=

0, p a a '

(x+pI

¹^{0, and}

- 6

z _+O n

is a positive constant. Hence, there exists some positive constant z which satisfies t h e inequality

LEMMA

3. S p p o s e that all of the assumptions of the theorem hold.

m e n , the MGAE reinforcement scheme has the following convergence p r o - p e r t y u n d e r the nonstationary multi-teacher environment (NMT) ^satisfying condition (5): p , ( t ) converges with probability 1. f i r t h e r , let lim p,(t )

=

p: with probability 1. m e n , pz

=

1 o r 0 with probability 1.

t +m

(22)

P r o o f . I p a ( t )

1

^I1 for all t . Then, from Lemma 1, p a ( t ) converges with probability 1 (Doob 1953). Now we will prove t h a t p z

=

I o r 0 with probability 1. Assume t h a t t h e r e is a region D such t h a t p(D)$O and O<pz <1 in D. I t follows from (30) t h a t

Since p,(t ) converges with probability 1 t o p f a n d ( p , ( t )

1

^I1 for all t , lim E[p,(t)]

= Ekf]

t - - (42)

Hence,

lim f ~ [ p . ( t + I ) ]

-

E[p,(t)]j

=

l i m . E ' k a ( t + l ) ]

-

lim.E'[pa(t)I

=

0 (43)

t ^+- 1

--

^t^-=

Let

h t u

<

-G (G>O) Then, from (28),

I t is c l e a r f r o m (41) t h a t (43) is incompatible with (44). Therefore p:

=

1 o r 0

with probability 1.

Taking advantage of t h e above t h r e e lemmas, t h e Theorem c a n easily be proved.

Proof. From Lernma 2,

hZsba1(D)1 2 j h z , + b a s ( l ) l d u 2

. . n

Consequently,

(23)

Since

1

h z , + b a ' ( t ) ] ( is bounded above ( < I ) ,

lim

/

h,,g[pa'(t

) ]

d u

= /

^lim^h,, J p a 1 ( t ) ] d u

t ^+- n ^t^+-

Let

p f = 1 7 : Then, from l e m m a 3,

p,B

^{= O W} ¹

with probability 1. Since h z , + ( p ) is a continuous function of p , we obtain t h e following equality:

with probability 1. F u r t h e r m o r e ,

O < h , , * ( p ) < l w h e n O < p < l ( 5 1 ) hz,,,(0)=O2 h , , + ( l ) = l

I t follows f r o m ( 4 9 ) a n d ( 5 0 ) t h a t

lim t ^+mh Z , + b a 1 ( t ) l

= P!

with probability 1. Therefore, from ( 4 6 ) , ( 4 7 ) , a n d ( 5 2 ) , q,+b.'(o)I ²

J p t

d u

n

I t i s c l e a r t h a t

I

z ( 7 - 1 )

1

. I

9 1-1

h m h,,d[pa'(0)]

=

lim

= o

d+o d +o e z p [ z l - l z

Hence, from (53) =and ( 5 4 ) ,

l i m lim

~ b ~ ( t ) l =

1 d + 0 t +-

(24)

APPLICATION TO NOISE-CORRUPTED, LWLTI-OBJECTn'E PROBLEM

In this section, me consider a parameter self-optimization problem with noise-corrupted, multi-objective functions as an application of learning behaviors of stochastic a u t o m a t a operating in an unknown nonstationary multi-teacher environment.

Suppose t h e

J l ( a )

,..., and

J,(a)

a r e unknown objective functins of a parameter

a € [ a l ,

.

. . ,a,]

except t h a t they a r e bounded

(-M<Jl(a), ..., J , ( a ) s ~ ) .

I t is assumed t h a t measurements of

~ ~ ( a ) ( i = l ,

..., n ) can be obtained only from t h e noise-corrupted observations.

Here,

4

^(a)is assumed t o have unique maximum

4 ( a a ( ) :

Ji(aBi) =

max

[ 4 ( a l )

,...,

Ji(a,)] (56)

Each objective function Ji(cx) has t h e claim t o be maximized

( i = l . ...,

^{n ) .}

However, generally, t h e relation

aB1 = aaz = . . . - - "8,

does not h o l d This is one of t h e most difficult points of multi-objective optimization problems.

The learning behaviors of stochastic automata operating in t h e last section c a n be used t o find a n appropriate parameter in this problem. Let u s t r y t o identify t h e i t h action yi of stochastic automaton B with t h e i t h parameter value ( i = 1 ,

...,

^{7 ) .}Choosing t h e i t h parameter

ai

a t time

t

corresponds t o

B

producing t h e output

yi

a t time

t .

For simplicity, we consider t h e stochastic automaton B under P-rnodel environment.

Let ki be a m e a s u r e m e n t of

g j ( a . t )

a t time

t .

Further, let

q ( t

=0.1,

...; j =

1

,...,

n ) be defined a s

(25)

Using these values, we define reward and penalty as follomrs:

Suppose i h a t a ( t )

=

ai ( i = l , ..., r ) . If k j 2 i j - l , t h e n the stochastic automa- ton 3 receives reward response

,$

= 0 from t h e j t h e teacher

Rj

( j = l . ..., n).

(This means t h a t the j t h noise-corrupted, objective function

J j ( a )

gives an affirmative answer to the parameter a,.)

On the contrary, if

kj<F/-],

then t h e stochastic automaton B receives penalty response

,!$

= 1 from Lhe j t h teacher

R,

( j = l , ..., n ) .

The stochastic automaton chages the s t a t e vector ~ ( t ) to ~ (+ I ) t by t h e n responses

(q

,..., s:) which it has received from t h e n teachers

R1, . .

. , and

R,.

Now let u s consider the learning behavior of 3. If the parameter ai is selected a t time t , B receives penalty from t h e j t h teacher

Rj

with t h e probability

From (55),

( p 6 ( ' ) is the distribution function of

t j

( j = l .

....

n ) . ) Since

Ji

( a ) is assumed to have unique maximum $ ( a

Pj

),

f o r all

Rl

^{and all}^a^(aPj#a) G = 1 3 - - . 1 n ) (See Figure 3.)

Let

The reason why we u s e t h e notation q ( t , w ) is t o represent t h e probability t h a t stochastic automaton B receives penalty response from t h e j t h t e a c h e r when i t selects the i t h parameter ai a t time t . (Here,

UER,

R being t h e supporting s e t of the probability m e a s u r e space (

R,E,u).

(26)

Figure 3. The value of (1-p[b(t )

< - Jl(ap,)]].

(27)

-

( ~ [ P ( o ) ,..., k?] is the smallest Borel field of w-sets with respect to which P ( 0 ) ,..., a n d

@

are all measurable.) is t h e smallest Borel field which con-

m

tains

u 4.

^7~ is the probabi1i.t~ measure which satisfies u ( R )

=

1.)

t =O

Therefore, it follows from (56), (58), (59), and (60), t h a t

f o r dL

t ,

dL i (i=l,

...,

r ; i#,f?.), _I all w E R, a n d s o m e positive n u m b e r 6,

b = l ,...,

^{n )} If t h e s t r i c t condition

holds, t h e n i t can be easily derived from (61) t h a t

f o r all t , all i (i= 1,

...,

r ; if.@*), aLL WER, and the positive n u m b e r

6

Therefore, using the theoretical results obtained in the last section, we can prove t h a t

lim lim E[pgc(t

)I =

1

d+O t +-

is e n s u r e d by t h e MGAF, reinforcement scheme.

Even if t h e s t r i c t condition (62) does not hold, the

MGAE

reinforcement scheme finds t h e parameter a which satisfies t h e relation (5). (The result obtained so far is a generalization of the work done by Baba (1978).)

Remark: Although we have used P-model, all of t h e studies done in t h e last section can be applied to this case.

(28)

CONCLUSION

We have discussed t h e learning behavior of stochastic a u t o m a t a under t h e nonstationary multi-teacher environment (NMT) in which penalty s t r e n g t h s a r e functions of

t

a n d o, where

t

r e p r e s e n t s t i m e a n d o is a point of t h e basic w-space R. I t has been proved t h a t t h e MGAE reinforcement s c h e m e , which is an extended form of t h e GAE reinforcement s c h e m e , e n s u r e s E -

optimality u n d e r t h e nonstationary multi-teacher environment (NMT) which satisfies condition (5). We have also considered t h e p a r a m e t e r self- optimization problem with noise-corrupted, multi-objective functions by sto- c h a s t i c a u t o m a t a a n d showed t h a t this problem can be r e d u c e d t o t h e learning behaviors of stochastic a u t o m a t a operating in t h e nonstationary multi- t e a c h e r environment (NMT) satisfying condition (5).

(29)

REFERENCES

Baba, N. and Y. Sawaragi. 1975. On the learning behavior of stochastic auto- m a t a under a non-stationary random environment;. IEEE Trans. Syst., Man, and Cybernetics. 5:273-275.

Baba, N. 1978. Theoretical considerations of t h e p a r a m e t e r self-optimization by stochastic automata. Int. J. Control. 27:271-276.

Baba, N. 1983. The absolutely expedient nonlinear reinforcement schemes u n d e r t h e unknown multi-teacher environment. IEEE Trans. Syst., Man, and Cybernetics. 13: 100-108.

Chandrasekaran, B. and D.W.C. Shen. 1968. On expediency and convergence in variable-structure automata. IEEE Trans. Syst., Sci., and Cybernetics.

4: 52-60.

Doob, J.L. 1953. Stochastic Processes. New York: Academic Press.

Koditschek, D.E. and K.S. Narendra. 1977. Fixed s t r u c t u r e automata i n a multi-teacher environment. IEEE Trans. Syst., Man, and Cybernetics.

7:616-624.

Lakshmivarahan, S. and M.AL. Thathachar. 1973. Absolutely expedient learning algorithms f o r stochastic automata. IEEE Trans. Syst., Man, and Cybernetics. 3:281-286.

Narendra, K.S. a n d M-AL. Thathachar. 1974. Learning automata--a survey.

IEEE Trans. Syst., Man, and Cybernetics. 4:323-334.

Norman, M.F. 1968. On linear models with two absorbing barriers. Journal of Mathematical Psychology. 5:225-241.

Norman, M.F. 1972. Markov Processes and Learning Models. New York:

Academic Press.

Sawaragi. Y. and N. Baba. 1973. A note on t h e learning behavior of variable- s t r u c t u r e stochastic automata. IEEE Syst., Man, and Cybernetics. 3:644- 647.

(30)

Srikantakumar, P.R. and K.S. Narendra. 1982. A learning model for routing in telephone networks. SIAM J. Control and Optimization. 20:34-57.

Thathachar, M.AL. and B. Bhakthavathsalam. 1978. Learning automaton operating in parallel environments. J. of Cybernetics and Information Science. 1: 121-127.

Tsetlin, M.L. 1961. On t h e behavior of finite a u t o m a t a in random media. Auto- m a t i o n a n d Remote Control. 22: 1345-1354.

Varshavskii, V.I. and I.P. Vorantsova. 1963. On t h e behavior of stochastic auto- m a t a with variable-structure. Automation a n d Remote Control. 24:327- 333.

Learning Behaviors of Stochastic Automata and Some Applications

AND

WP-83-119

CONTENTS

-

-

SOME

l,....4)

( B I,...,B,).

Bk

,...,

k = l , ...,

Si

Si

UEARNING

IN AN m o m

...

]....,

,...,

,....,

T

=

...

...,

,...,

c

(El.]

=

ci

1.

LR-I

a ,

t ,

R.)

LR-l

....,

= O

,...,

a ci

>

Vi

ac. ac,

- >> -

IS, w , Y , g

....

...,

,$

W , Y , g

$.

2

R

( O , E , ~ ) ,

E

u

,....

,....

,...,%(

P ( t

,

t ,

I I

I I

I

I I

I

I

I

I

,...,

B

Et

(itp)

W ( t

Di

,...,

W ( t ) j > Fo

=

x -I

x

=

x ^-I

PI[P(t 11 - - PZ[P(t )I = . . . = ^{P ,} ^{[ P ( t} >I

+ P [ P ( ~ ) ] ^<

^f

^-