Some Aspects of Model Tuning Procedure: Information-Theoretic Analysis

(1)

NOT FOR QUOTATION WITHOUT PERMISSION OF

THE

AUTHOR

SDMX ASPECXS OF MODEL TUNING PROCEDURE:

I r n r n T I O N - T H E O r n C A N A L ,

April

1 9 8 5 I*-35-24

Working Papers a r e interim reports on work of t h e International Institute for Applied Systems Analysis a n d have received only limited review. Views or opinions expressed h e r e i n do n o t necessariljr r e p r e s e n t those of t h e Institute or of i t s National Member Organizations.

1NTERNATlONA.L INSTITUTE FOR APPLIED SYSTEMS ANALYSIS 2361 Laxenburg, Austria.

(2)

I

would like t o t h a n k S u s a n n e Stock for h e r help in preparing t h e manuscript.

(3)

S O D

ASPECTS

OF MODEL

TUNING

PROCEDURE:

I N F O R M A T I O N - m o m c

ANALYSrS

1.

LNTRODUCTION

Computer or mathematical models a r e not exact representation of reality: lack of knowledge, technical restrictions and particular modeling goals make i t necessary t o approximate the real system in various ways. Nevertheless, t h e procedures by which t h e models a r e adjusted t o observed data a r e often based on t h e assumptions t h a t the real system h a s t h e same s t r u c t u r e as t h e model and differs only in t h e values of cer- t a i n parameters. These particular values usually should be included in t h e feasible s e t of t h e parameter values, and t h i s fact, together with some additional conditions, usually provides t h e convergence property for many individual algorithms

[I].

However, in reality all of t h e s e assumptions a r e generally false. Even if the s t r u c - t u r e of t h e system corresponds t o t h e s t r u c t u r e of t h e model, t h e real parameters values often do not belong to t h e presupposed feasible set. Moreover, mathematicians often consciously diminish this s e t in order t o simplify t h e estimation algorithms. For i n s t a n c e they approximate t h e bounded compact s e t of p a r a m e t e r values by a s e t con- sisting of a finite number of points, t h u s increasing t h e c h a n c e s t h a t t h e real parame- t e r values will be excluded.

(4)

It is t h e r e f o r e both remarkable a n d surprising t o find t h a t despite t h e s e false assumptions a n d approximations, t h e p a r a m e t e r estimation algorithms often still con- verge! The model r e s u l t i n g from t h i s t u n i n g procedure will of c o u r s e n o t coincide with t h e real system. a n d t h i s r i s e s t h e n a t u r a l question: how f a r i s t h i s c o m p u t e r model from reality?

When considering t h i s question it i s n e c e s s a r y to have some way of measuring t h e distance between individual models. One of m e a s u r e of divergence w a s i n t r o d u c e d by B h a t t a c h a r y a [Z] ; Kullback [3] also formulated s o m e m e a s u r e of information distance.

However t h e s e m e a s u r e s were not p r o p e r metrics. Baram a n d Sandell [4] l a t e r i n t r o - d u c e d a modified version of Kullback measure, which have been shown t o be a proper distance metric. They applied t h i s approach t o l i n e a r Gaussian s y s t e m s a n d models; in t h i s paper i t is generalized t o a wider class of systems.

2. NOTATIONS

AND

DEFINITZON

Assume t h a t t h e variety of models of t h e r e a l system may be c h a r a c t e r i z e d by a p a r a m e t e r

B

^,which takes values from t h e p a r a m e t e r s e t

B.

In view of Bayesian for- mulation of t h e problem, we will assume

fi

t o be a random variable defined on some probabilistic space

(Q.

H , P )

.

Let

tn

(w),nrO be some random p r o c e s s (observation) adapted t o some nondecreasing family of u-algebras

H =

( H , ) , ~ , H..

= H

in

R .

We shall denote by

?i = (i?,)nM, H,

-

=

t h e family of o-algebras g e n e r a t e d by t h e pro- c e s s [, .n > 0 , where

i s a-algebra i n

Q

g e n e r a t e d by t h e process

tt

u p t o time t

.

In t h e c a s e of c o n t i n u o u s time observation process

t t ,

t r O we assume t h e nondecreasing r i g h t continuous family of o-algebras

H =

( H t ) t M t o be given, where H,

= H

a n d Ho is completed by P-zero s e t s from

H .

We also i n t r o d u c e t h e family of a-algebras

H = ( p t

^)tro ^,^where

(5)

If

t h e s e t of t h e p a r a m e t e r values i s finite o r denumerable we will denote by .rrj(n) , ( o r n , ( t ) ) t h e a p o s t e r i o r i probabilities of e v e n t s

18 = 13,

j , j E B given obser- vations

t,.,

^k

^<

ⁿ^,

(t,,

^u

^<

^t ^).

For any A

EH

,z E

B

we d e n o t e by

P

( A ) , t h e family of probability m e a s u r e s

Let P , f ( A ) , P ( A ) , z E B , n

r

0 be t h e r e s t r i c t i o n s of t h e P ( A ) on a-algebras

??,

^,respectively. Assume also t h a t for a n y z.y E

B

we have

Fz -

-

^Pz.

^Define

z . v

as a Radon-Nicodim derivative

a n d l e t

a,l.v

=

&Z.V(&Z,Y,)--l.

I t

i s easy t o s e e t h a t if t h e ~ - l - c o n d i t i o n a l distributions of

cn,

ⁿ^r⁰ have densities f = ( z

I an-1).

^z^E

B

t h e n

3.

SOME _{BAYESIAN P} _-

TIO ON ALGORlTHM

Before deriving o u r main r e s u l t s , we will first consider some Bayesian p a r a m e t e r estimation algorithms for d i f f e r e n t observation schemes.

a ) Assume t h a t

tn,n ^r

⁰i s given by t h e formula

where dn satisfies t h e r e c u r s i v e s t o c h a s t i c equation

(6)

Here E ~ , , . E ~ ~ , n 2 0 a r e t h e sequences of independent Gaussian random variables with zero mean and variance equal to one, and

p

is an unknown parameter. Assuming t h a t /3 takes its values from some finite s e t

Bk = 1p1,p2, .

^,

.

,

p k ]

t h e aposteriori probabilities a r e

where

4

a r e Kalman estimates of

+,

^given

f p = pi

j and D j ( n ) a r e functions of t h e conditional variance y j ( n ) [ 5 ]

b) Consider t h e continuous (in time) observation process

tt

given by the sto- chastic differential equation

where Wt,

t

r 0 is t h e H-adapted Wiener process,

p

is an unknown parameter and Ct is H-adapted positive function. Assuming again t h a t the number of parameter values is finite, we have for rrj(t)

= P(p = pi

IRt) [ 6 ] .

where

(7)

c) Consider a n observation made by a c o n t i n u o u s - s t a t e jumping p r o c e s s with unknown t r a n s i t i o n i n t e n s i t i e s h t j

.

Once again assuming a finite n u m b e r of values for

/I

we have t h e following e q u a t i o n s for a posteriori probability

ni

( t ) _[7]

where

The necessary a n d sufficient conditions of convergence with probability one for a posteriori probabilities t o r e s p e c t i v e indicators were given i n t h e p a p e r s [I, 8.91 in t e r m s of absolute c o n t i n u i t y a n d singularity of some special families of probability distributions. Papers d e m o n s t r a t e d t h e applications of t h e g e n e r a l t h e o r y to various par- t i c u l a r forms of t h e random processes.

One of t h e c e n t r a l places in t h e proof of t h e main convergence r e s u l t in [5,9, I.]

was the relation between a posteriori probabilities a n d likelihood r a t i o in t h e c a s e of denumerable or finite n u m b e r of t h e p a r a m e t e r values. More exactly t h e following lemma i s t r u e :

L e m m a 1 . Let for a n y i

=

j and n 2 o m e a s u r e

-

is e q u i v a l e n t t o the meas- u r e , m e n - a . s . the n e z t e q u a l i t y is t r u e :

The proof of t h i s lemma follows from t h e definition of t h e likelihood ratio

G.j.

The equality (1) yields t h a t

(8)

According t o t h e p a p e r s [ I , 8,9] t h i s property g u a r a n t e e s t h e following r e s u l t of convergence: (remind t h a t we still deal with t h e c a s e when t h e p a r a m e t e r value corresponding t o t h e r e a l system belongs t o t h e feasible s e t of t h e p a r a m e t e r values

B).

- . - .

Theorem

1. Let f o r a n y i

=

j

.

ⁿ²^0,

^P ^A - ^{P i} ^.

Then the condition

1

^is

equivalent to the condition

l i m n j ( n )

= I(#I =

#Ij), P-a.s

n -.-

The proof of t h i s t h e o r e m i s based on t h e p r o p e r t y t h a t singularity s e t for t h e m e a s u r e s a n d

@

coincide @-a. s. with t h e s e t

!#I = #Ij].

If t h e r e a l p a r a m e t e r value

#Ik

does n o t belong t o t h e feasible set variables n i ( n ) ,i E

B

c a l c u l a t e d in s e c t i o n 3 a r e already n o t t h e a p o s t e r i o r i p r o b a b i l i t i e s , b u t some functionals of t h e observable process

t, .

Taking t h e m a s a p o s t e r i o r i probabilities, t h e observer expects t o g e t t h e conver- g e n c e o n e of n i ( n ) ,i E B (say nio(n)) t o 1 a n d i n t e r p r e t t h i s r e s u l t a s if t h e r e a l p a r a m e t e r value is equal t o io

.

However t h i s is a c t u a l l y a false conclusion. The ques- t i o n s which a r i s e i n t h i s r e l a t i o n are: When does t h e convergency f a c t for some of t h e n i ( n ) , i E

B

really t a k e place? What does i t mean when nio(n) t e n d s t o 1 for some i, E

B

? I n o r d e r t o answer t h e s e questions we n e e d some auxiliary results.

Assume t h a t t h e r e a l s y s t e m corresponds t o a p a r a m e t e r value k s u c h t h a t k E B

.

Introduce t h e function g ( z , y )

= I$

In aE.v [4] a n d define t h e m e a s u r e of d i s t a n c e

Lemma

2. Function d , , ( i . j ) ispseudo-metric. m a t is, the following e q d i t i e s hold:

(9)

& ( ~ s k )

+ & ( k ~ y ) 2

& ( Z P Y )

The proof of t h i s lemma is done in [4].

Lemma

3. fir a n y z , y E

B,

n r 0 w e have

G(2.y)

2 0.

Proof.

From t h e definition of t h e

c ( z ,

y )

C ( ~ - Y ) =

Ez(ln

<.*I

^pn-1)

⁼ ^Ez

^(&(In

a,"JI Bn-l)) = ~ , ( q ( @ ( a , " J ) I irn-l)

where

$ ( t ) = t

In

t .

According t o t h e t h e o r e m of t h e m e a n ,

( a )

c a n be r e p r e s e n t e d a s follows:

where 6E.y varies between

a,Z.'

a n d

1.

I t is n o t difficult t o s e e t h a t

1 (a,"." - 1)2

q ^(@(Gmy) I %-,I = +( _a: 1 Bn-l)

²

o

Lemma4.

Let &(k,z) ^I

d,(k,y) .

m e n

I,"(z,y)

2 0

Proof.

From t h e definition of t h e

C ( z , y ) ,

we c a n write

g ( z , y ) =

In a,Z.'

= E~

In

f Z ( t n I H , - ~ ) - E~

In

fy(t, IHn-l)

From Lemma 3 for any z

EB

Ek

ln a,"." 2 0

(10)

a n d t h u s

5.

RESULTS

Assume t h a t t h e process In a,Z*" is ergodic, i.e.,

Theorem2. If

d ( k , z )

>

d ( k , y ) then

z*"

^0,

^P-a.s.

lf

i t is k n o w n that

z-'

⁴⁰

P - a s .

, t h e n

Proof. Note t h a t from Lemma 4, t h e inequality d ( k , z )

>

d ( k , y ) yields P ( z , y )

<

0 a n d c o n s e q u e n t l y

1 "

lim

-

In a%'

<

0

P-a.s.

n - n , = 1

This means t h a t

a n d consequently

t h u s proving t h e first p a r t of t h e theorem.

In o r d e r t o prove t h e second p a r t of t h e t h e o r e m we a s s u m e t h a t

z-'

^-r⁰^{b u t}

t h a t d ( k ,z )

<

d ( k ,y )

.

This yields

(11)

from which

a n d t h e theorem is proved by contradiction.

Example. Assume t h a t t h e sequence

#,

is a finite s t a t e ergodic Markov chain on -

-.

any of t h e probability s p a c e s

(R,H,Pa),

i E B ,where B is a finite set. Let p f m , 1 , m

= l,k

be t h e transition probabilities for one step.

I t

is n o t difficult t o find (see also [8 ] ) t h a t ^{a i j}is given by the formula

Well known results from t h e Markov c h a i n theory (see

[ l o

] for instance)

she;

t h a t t h e process In cxA.j is ergodic. Thus if t h e Bayesian algorithm for ~ ( n ) converges to 1 for some particular j o i t means t h a t t h i s j o is the point from B t h a t is t h e n e a r e s t (in t h e sense of information distance d ( k ,z) ) to the real parameter value k

.

REFERENCES

1. AI. Yashin, Bayesian Approach To Parameter Estimation: Conuergence Analysis, WP-8367, International 'Institute For Applied Systems Analysis , Laxenburg, Aus- t r i a (July 1983).

2. A. Bhattacharya, "On Measure Of Divergence Between Two Statistical Populations Defined By Probability Distributions," h i l e t i n . Calcutta Mathematical Society 35, pp.99-104 (1943).

3. S. Kullback, h z f o m a t i o n Zheory And Statistics, Wiley, New York (1959).

4.

Y.

Baram a n d

N.R.

Sandell , "An Information Theoretic Approach To Dynamical Sys- tem Modeling And Identification," IEEE Transactions Automatic Control AC-23(1), pp. 61-66 (1978).

(12)

5. N.M. Kuznetsov, A.V. Lubkov, a n d A.I. Yashin, "About Consistency Of Bayesian Estimates In Adaptive Kalman Filtration Scheme ," A u t o m a t i c a n d R e m o t e Control

(transLated f r o m R u s s i a n ) ( 4 ) , pp.47-56 (1981).

6. R.S. Liptzer a n d AN. Shiryaev, S t a t i s t i c s of R a n d o m f i o c e s s e s , Springer-Verlag, Berlin a n d New York (1978).

7 . A.I. Yashin, "Filtering of Jumping Processes," A u t o m a t i c a n d R e m o t e Control 5, pp.52-58 (1970).

8. A.I. Yashin, "Sostoyatelnost Bayesovskich Otcenok Parametrov (Consistency of Bayesian P a r a m e t e r Estimates)," R o b l e m i P e r e d a c h i h f o m a c i i ( i n Russian )(I), pp.62-72 (1981).

9.

N.M.

Kuznetsov a n d k I . Yashin, "On t h e Conditions of t h e Identifiability of P a r - tially Observed Systems," Docladi A k a d e m i i N a u k SSSR ( i n Russian) 259(4), pp.790-793 (1981).

10. S. Karlin, A First Course I n S t o c h a s t i c P r o c e s s e s , Academic P r e s s , New York a n d London ( 1 968).

Some Aspects of Model Tuning Procedure: Information-Theoretic Analysis

THE

SDMX ASPECXS OF MODEL TUNING PROCEDURE:

I r n r n T I O N - T H E O r n C A N A L ,

April

I

ASPECTS

TUNING

ANALYSrS

1.

[I].

AND

B

B.

fi

(Q.

.

tn

H =

= H

R .

?i = (i?,)nM, H,

=

Q

tt

.

t t ,

H =

= H

H .

H = ( p t

If

18 = 13,

t,.,

<

(t,,

<

EH

B

P

r

??,

B

Fz -

Pz.

z . v

=

I t

cn,

I an-1).

B

SOME BAYESIAN P -

tn,n r

p

Bk = 1p1,p2, .

.

p k ]

4

+,

f p = pi

tt

t

p

= P(p = pi

.

/I

ni

=

-

G.j.

Theorem

=

.

P A - P i .

1

= I(#I =

@

!#I = #Ij].

#Ik

B

^<

^<

^Pz.

SOME _{BAYESIAN P} _-

tn,n ^r

^P ^A - ^{P i} ^.

⁼ ^Ez

q ^(@(Gmy) I %-,I = +( _a: 1 Bn-l)

^P-a.s.