NOT FOR QUOTATION WITHOUT THE PERMISSION OF THE AUTHORS
Derivative-free Gauss-Newton-like Algorithm for Parameter Egtimation
Sergei Scherbov E k t o r Golubkov
November
1 9 8 6 WP-86-63Working P a p e r s a r e interim r e p o r t s on work of t h e International Institute f o r Applied Systems Analysis and have received only limited review. Views o r opinions e x p r e s s e d h e r e i n d o not necessarily r e p r e s e n t those of t h e Institute o r of i t s National Member Organizations.
INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS A-2361 Laxenburg, Austria
Foreword
The Population Program
at
IIASA deals with t h e analysis of consequences of demographic changes. To estimate t h e consequences one needsto
analyze t h e d a t a from various s o u r c e s ,to
develop t h e models, and t o identify t h e i r p a r a m e t e r .This p a p e r by Dr. Scherbov and Dr. Golubkov develops t h e idea of p a r a m e t e r estimation using t h e derivative-free nonlinear least-squares algorithm. The algo- rithm w a s found efficient and convenient f o r many applications.
Anatoli Yashin Deputy Leader Population Program
Derivative-free Gauss-Newton-like Algorithm for Parameter Ektimation
Sergei Scherbov and R k t o r Golubkov Institute f o r Systems Studies (VNIISI)
Prospect 60 Let Octyabria, 9 117312 Moscow
USSR
An important problem of parameter estimation of different models using sta- tistical data about modeled process very often can be reduced t o t h e least-squares problem. In this p a p e r t h e derivative-free nonlinear least-squares algorithm, which w a s found very efficient in t h e sense of response function calculations, i s presented.
Let
4
(=
N ) be t h e components of a n observed data vector4 =
(31,32,.
,$,I,)-
and $$ ( q ) (i= TN)
be t h e components ofa
vector valued response function $(q)=
($1(~),$2(~)...$N(q))T. Let q=
( f i . ~ ~ . . . . , ~ ~ ) ~ be t h e vector of estimated parameters. Then, according t o t h e Generalized Least Squares Method (GLSM), t h e estimatec =
(cl,c2,. . .
,cnlT
could b e foundas
a solutionto
t h e following nonlinear programming problemwhere P t i s
a
given symmetric, positively half-defined s q u a r e matrix of weights, and <a, b>
means scalar production.Gauss-Newton-like algorithms
are
knownto
be t h e most efficient iterative methods for solving problem (1). Accordingto
those algorithms the linear approx- imation of #(q) about t h e c u r r e n t value of parameter vector g i s calculatedat
each iteration, and a linear least-squares problem is solved t o obtaina
new value of parameter q.
In t h e presented algorithm, as in many o t h e r derivative-free Gauss-Newton- like algorithms [1,2,3,4], t h e linear approximation of #(q) is evaluated according
to
n +1 values of #(q) calculatedat
t h e previous iterations in o r d e r t o estimatet h e new parameter v e c t o r and
to
pass t h e updated set of parametersto
t h e next iteration. Gauss-Newton-like algorithms differ from each o t h e r generally by t h e linear approximation of # ( q ) and by use of t h e information obtained at t h e previous iterations. Let us dwell upon t h e most important f e a t u r e s t h a t distinguish the presented algorithms among o t h e r s of t h e same type.Let us assume t h a t on t h e k-th iteration w e have q! .q$ ,
. . .
,q,k +l -estimates of solution computedat
t h e previous iterations and a correspondentset
of#f,#$. . . .
which are necessary f o r the linear approximation of # ( q ) , where#f =
# ( q f ) . The upper index shows t h e iteration number. Let us assume on t h e f i r s t iterationF0, =
minq
, whereq =
F ( q t ) .1st <n +1
The new estimate of parameters q on t h e k-th iteration q;, will be
where A Q ~ is a n QP n matrix of parameter increments and Aqk i s a v e c t o r t h a t will be evaluated on t h e k-th iteration.
The linear approximation y ( q ) of t h e v e c t o r # ( q ) could be written in a form
whose matrix ( AX - i s derived from n + l previously calculated values AQ
q f ,
# f ,
( i=
1," + 1 ) t o satisfy conditionsSupposing nonsingularity of matrix A@, w e obtain
The c l o s e r qf are
to
each o t h e r , t h e higher is the accuracy of approximation ( 3 ) . Vector is a solution ofand i s calculated from
The new (k +I)-th estimation qk
zi
i swhere h k i s a s t e p length along Aqfj, a direction which is obtained from a one- dimensional function minimization
@
(A)=
F(Q;+
h A&)or
under t h e condition t h a t F(q; f,;)<
F(q; In t h e p r e s e n t version of t h e algorithm, a one- dimensional minimization w a s based on a second-order approximation of@
(A) and use of t h e well-known f a c t t h a t in r e g u l a r cases h k , which minimizes F k ( h ) , ap- p r o a c h e sto
a value closeto
1 during t h e Gauss-Newton algorithm's convergency.The ( k +l)-th i t e r a t i o n begins with t h e calculation of t ( q k + l ) and t h e n t h e
AS
and AQk matrices are calculated. Before describing t h e s e matrix calcula- tions we should mention t h a t during t h e i t e r a t i v e p r o c e s s t h e non-singularity of-
AQk must b e controlled, whereand ( ( Aqf ( ( i s a s t a n d a r d Euclidian norm of t h e v e c t o r Aqf.
That is because ill conditionality of AQ leads -k
to
ill conditionality of AQk and A*, and consequently of A ~ , which, in t u r n , leadsto
t h e worsening of t h e algorithm's convergency.Let Ed b e t h e minimal feasible value of
1
det -1: AQ1
when convergency is still normal. Then AQk (kr
1 ) should b e constructed in a way t h a t( d e t
Gk+ll r E d
when k 2 1 ; O < E d.
(4 ) Letus
assume t h a t (4) i s t r u e . Then AQk+l building s t a r t s with changing in AQk t h e column whose number i s Lk according to:where L i s defined from
Lk
=
Argmin IS: I.
1Si Sn
The values of
sf
( i= G)
are s c a l a r coefficients in t h e r e p r e s e n t a t i o n of AqHwhich may b e obtained from (3'), (3"). and (3"').
Such a choice of Lk provides t h e best conditionality of
L\gk+l.
This could b e easily derived fromFrom this expression and from (5) i t follows t h a t in
&
t h e column whose changing provides t h e best conditionality ofZk+l
and i s substituted by hkAqj$. If1
detzk 1 2 E d , then A$ i s constructed by changing one column with number
Lk according to:
One of t h e distinctions of t h e presented algorithm from o t h e r v a r i a n t s of derivative-free Gauss-Newton-like algorithms i s t h e choice of t h e column in A@
t h a t will be substituted by
qi:i - qk+l
a f t e r t h e k-th iteration. In those algo- rithms t h e column t h a t w a s calculated b e f o r e t h e o t h e r (on iteration with t h e s m a l - lest number) and hence usually c o r r e s p o n d sto
t h e values of p a r a m e t e r s which d i f f e r mostly from t h e i r c u r r e n t values i s substituted byqt fi -
q t + l , or which i s t h e same as ( h A&). Thus t h e v e c t o r s of p a r a m e t e r s which contributeto
t h e com- putation of AQk are all t h e time "pulled up"to
t h e i r latest value. I t means t h a t t h e columns of AQk are generally calculated with t h e p a r a m e t e r s which a r e closeto
t h e i r c u r r e n t value.In t h e algorithms mentioned above, all t h e columns of AQ will b e renewed dur- ing
n
i t e r a t i o n s and t h e approximation c o r r e c t n e s s (3) will t a k e place. For p r a c - tically widely-spread objective functions with long valley, t h e i r lead s u r f a c e in p a r a m e t e r s p a c e i s badly s t r e t c h e d along some directions. It i s a well-known f a c t t h a t descent directions f o r t h e s e functions are e a r l y tangentto
level s u r f a c e , and t h e r e f o r e they are v e r y closeto
t h e direction in which level s u r f a c e i s s t r e t c h e d .Thus, while i t e r a t i o n number k i n c r e a s e s i t c a n o c c u r t h a t t h e columns of
zk
andACk, and t h e direction of descent A& may be calculated i n c o r r e c t l y due
to
compu- tational e r r o r s .In t h e p r e s e n t e d algorithm t h e worsening of conditionality of
sk
o c c u r s essentially r a r e l y t h a n in many o t h e r algorithms of t h e same type. This was proved by numerical experiments. That i s because in t h eGk
of t h e algorithm presented h e r e i t i s not t h e columns which w e r e calculated e a r l i e r are substituted, as in [2,3,4], but r a t h e r those which provide t h e maximum l i n e a r independence of t h e columns inGk
+I. This i s illustrated by Figures 1 and 2. In F'igure 1 , substitution of columns in AGk i s shown as i t i s always done in most of Gauss-Newton-like algo- rithms. In Figure 2 , t h e same i s shown f o r a n algorithm presented h e r e .During functioning of t h e presented algorithm i t could happen t h a t some of t h e columns of AQk w e r e not substituted f o r a long period. That means t h a t in this case t h e s e columns w e r e calculated with p a r a m e t e r s t h a t w e r e f a r from t h e i r c u r r e n t values. In t h i s case l i n e a r approximation (3) may not b e valid, and descent direction A q h will b e calculated i n c o r r e c t l y although condition (4) i s satisfied.
Getting r i d of t h e algorithm o p e r a t e s in such a way t h a t during a given number of iterations all t h e columns of AQk, and hence A x k are renewed. This i s achieved in t h e following way.
Let us assume w e are a t k ' s iteration and h a s a l r e a d y been calculated.
Let us denote n: ( i
=
1 5 ) as t h e number of substitutions of t h e i - t h column of AQk during k iterations, and lk i s t h e s e t of t h e column's numberswhere
n, k
=
max n: , k=
0.1.2,... .
1st sn
Ng-algorithm's c o n t r o l p a r a m e t e r which a c c e p t s i n t e g e r values and satisfies t h e condition NO 2 1 (No
=
0 c o r r e s p o n d s t o t h e case without "pulling up" t h e columns because in this case lk r l o ) . Then t h e column with number Lk t h a t will b e substi-F i g u r e 1.
F i g u r e 2 .
tuted by h k A g k instead of (5) and (5') we shall find from:
Ik
=
Argmax ($
(t € I k
- k + l
If a f t e r substitution of Ik column ( d e t AQ
I
r E d , then w e passto
t h e (k +I)-th iteration as a l r e a d y described.Otherwise t h e ortogonalization p r o c e d u r e i s specified in t h e algorithm. (This i s a l s o t h e important f e a t u r e of t h e algorithm.) According
to
t h i s p r o c e d u r e each column Aqf +I; i E lk \ Ik of i s tested f o r substitution by Agf which i s a n-k +l
ortogonal complement
to
t h e subspace spanned upon t h e o t h e r columns of AQ and f o r e a c h c a s e t h e determinant of t h u s modified matrix i s calculated. Vector Agf i s calculated by:&fS1
, when1
Aq&pll 2r j
66'' =
@bk+'/ ( 1
+
bk +') when1
AqfG1 (< r j
where a and @ a r e positive c o n t r o l p a r a m e t e r s of algorithm;
rj
(j= -
1 . n ) i s a vec- t o r of precisions f o r p a r a m e t e r estimation (iterative p r o c e s s s t o p s when simul-k
-
taneously iqt::,j -qn+l,jl S r j , and JAqk, ( S r j ; j = l , n ) ;
Gf+l
i s a unit vec-tor
which i s ortogonalto
t h e o t h e r columns Aqt+' ( r+
i ,r = i,h);
and j i s t h e v e c t o r ' s component number.If (4) i s t r u e then substitution of corresponding column by ortogonal column i s performed and t h e i t e r a t i o n p r o c e s s continues. If during examination of all t h e columns Aq;+', i E r k \ L k , condition (4) could not be satisfied, t h e substitution t h a t provides a maximum value of ( d e t G k
+' 1
i s made. Then t h e p r o c e d u r e i s r e - peated b u t those columns of A@+' a r e substituted by ortogonal columns which have not been substituted yet.If all Aq;
+',
i Erk
\ Lk a r e substituted but condition (4) i s still not satisfied, then i t is allowed t o examine for substitution the columns whose numbers initially were not included inrk
besides columns with number Lk. The condition (4) will be satisfied not more than forn
-1 substitutions of columns of A@+' by ortogonalas
all columns in AQk+' are ortogonal. During ortogonalization a f t e r substitution of t h e i - t h column in AQk" by the corresponding ortogonal one&;+',
t h e value#(qk
+
&!+') i s calculated and the i-th column of A C ~ + ' i s substituted by#(qk
+
AQt+')-
#(qk) andnt+' = nt +
1. I t should be mentioned t h a t whiletest-
ing (4) during ortogonalization after substituting column by the corresponding or- togonal one, determinant d e t ( Z k + ' ) i s calculated recursively. This reduces the computational efforts.The n + l initial values of parameters q required f o r t h e algorithm are gen-
-
0e r a t e d f r o m starting value Q:+~. For i
=
l , n ,QP
i s computed f r o m q, +' by changing i t s i-th component by non-zero s t e p s h i . The correspondingset
of(10
( i=
1 7 ) is also calculated. The convergency of the algorithm presented h e r e w a s analytically proved.The comparative numerical analysis of this algorithm with several well-known and efficient derivative-free algorithms f o r minimization t h e sum of square w a s made. The total number of funation evaluations
to
find minimum with a given preci- sion w a s consideredas
efficiency c r i t e r i a of algorithms.This c r i t e r i a is computer-independent and c h a r a c t e r i z e s r e a l computational expenditures in cases when #(q) evaluation demands much more efforts than algo- rithm needs itself. In this p a p e r w e present some results of comparison of dis- cussed algorithm with
t w o
derivative-free Gauss-Newton-like algorithms.First i s Powell's method for least-square problems. I t s code w a s taken from the Harwell Subroutine Library (HSL). I t s code i s VAOZA. The second method is
a
compromise between t h r e e different algorithms f o r minimizing a sum of squares, namely Newton-Raphson, Steepest Descent, and Marquardt. I t s code is VA05A, also from HSL.The last two were compared by many authors with different o t h e r algorithms and were discovered
as
being very efficient. Comparison of the discussed algo- rithm with VAOZA and V A 0 5 A w a s performed on standardtest
problems found in the literature.Four
test
problems w e r e considered.1. F ( q
=
100(qf-
q2)+
( 1-
q1126 =
( l , l ) T , F ( C )=
0H e r e
6
i s a v e c t o r of p a r a m e t e r s where minimum occurs. Minimization of functions 1-4 begins from different s t a r t i n g values. The obtained r e s u l t s are p r e s e n t e d in Table 1. The algorithm described in this p a p e r i s denoted MMGN.H e r e NFE i s t h e number of function calculations; log(F) i s t h e logarithm of conver- gency e r r o r : log(F)
=
logF(q k I) where k, i s t h elast
i t e r a t i o n number. The sym- bol*
means t h a t t h e algorithm failedto
find t h e solution.From Table 1 it c a n b e s e e n t h a t o u r algorithm in all cases e x c e p t one needs less function evaluations t h a n t h e o t h e r two
to
converge with t h e same precision.Only in one case f o r function 1 and s t a r t i n g v e c t o r (-1.2,l) algorithm VAOSA p e r - formed less function evaluations than MMGN. But, f o r example, f o r function 3 VAOSA failed once, And VAOZA failed all t h e times while MMGN succeeded. If w e compare all presented r e s u l t s then VAOSA and VAOZA spend, on t h e a v e r a g e , 2.2 times more of function evaluations
to
find t h e solution t h a n MMGN. (The cases where VAOSA and VAOZA failed are not taken into account.)Table 1.
Method NFE log(F) NFE log(F) NFE log(F) NFE log(F) Function 1
MMGN VA05A VAOZA Function 2 MMGN VAO5A VA02A
Function 3 (0.1) (-1.1) (01-1) @lo)
MMGN 35 -14 73 -0 119 -14 7 2 -0
VAOSA 1 2 3 -10 1 4 3 -11
* *
244 -0VAOZA
* * * * * * * *
Function 4 (10110,10,-10 (10,10,10,10)
MMGN 25 -15 35 -15
VA05A 57 -15 57 -15
VA02A 6 3 -15 6 3 -15
Thus MMGN w a s found more efficient than algorithms f o r HSL l i b r a r y . This gives r e a s o n t o e x p e c t t h a t i t will b e m o r e efficient than many of t h e algoritms t h a t w e r e compared with t h e l a t e s t two and w e r e found less efficient. Besides as i t fol- lows from [6] t h e MMGN algorithm was more efficient in many cases f o r demograph- i c d a t a fitting in comparison with t h e algorithms from t h e IMSL l i b r a r y . In conclu- sion i t should b e noted t h a t t h e p r e s e n t e d algorithm w a s widely used f o r d a t a pro- cessing and f o r p a r a m e t e r estimation in demographic, socio-economic, ecological and o t h e r models. Big p r a c t i c a l e x p e r i e n c e dealing with t h i s algorithm proved
its
reliability and efficiency.REFERENCES
[I] Gelovani, V., V. Golubkov, and S. S c h e r b o v (1979) Parameter E s t i m a t i o n b y
Derivative-free Analog of Gauss-Newton Method, Vol. 8 (in Russian). Moscow:
VNIISI.
[2] Ralston, M. and R. J e n r i c h (1978) Dud, a derivative-free algorithm f o r non- l i n e a r l e a s t s q u a r e s . Tecnometrics Vol. 20, No. 1.
131
Peckham, G. (1970) A new method f o r minimizing a sum of s q u a r e s without calcu- lating gradients. Computer Journal, No. 7.[4] Vereskov, A. N. Levin, and V. Fedorov (1980) Regularised derivative-free l e a s t s q u a r e s algorithm. I s s u e s of C y b e t n e t i c s , No. 7 1 (in Russian).
[5] Hoppez, M.J. (Ed.) (1978) H a m e l l S u b r o u t i n e L i b r a r y , WI3.A Report. AERE-R- 9185.
[6] Rogers, A. and F. Planck (1983) A General Program for E s t i m a t i n g P a r a m e t r i z e d Model Schedules of F e r t i l i t y, Mortality, Migration, a n d Mari- t a l a n d Labor Force S t a t u s R a n s i t t o n s . Working P a p e r W-83-102. Laxen- burg, Austria: International Institute f o r Applied Systems Analysis.