Conjugate Gradient Method - Computer Simulations

Conjugate Directions

Steepest Descent often finds itself taking steps in the same direction as earlier steps. The idea is now to find a suitable set ofN orthogonalsearch directions d₀, d₁, . . . , d_N−1 (they will be ”A-orthogonal”). Together they form abasisofR^N, so that the solution vectorxcan be expanded as

x = x₀ +

N−1

i=0

λ_id_i . (2.18)

The goal is to take exactly one step in each search direction, with the right coefficientλ_i. AfterN steps we will be done. To accomplish this with low computational effort will require a special set of directions, and a new scalar product redefining "orthogonality".

For the first step, we setx₁ =x₀+λ₀d₀. At stepn+ 1we choose a new point at the line minimum

x_n+1 =x_n+λ_nd_n (2.19)

etc., untilx_N =xafterN steps.

Let us collect a few simple relations which follow directly from (2.18) and (2.19). It is useful to introduce the deviation (”error vector”) from the exact solutionx(which contains the steps yet to be done)

en+1 := xn+1−x (2.20a)

= e_n + λ_nd_n (2.20b)

= −

N−1

i=n+1

λ_id_i . (2.20c)

Similar relations hold for the residualsr_n, namely

rn+1 := −∇f(xn+1) = b−Axn+1 (2.21a)

= b−Ax

| {z }

−Ae_n+1 (2.21b)

= A

N−1

i=n+1

λ_id_i , (2.21c)

and also the recursion

r_n+1 =r_n − Aλ_nd_n. (2.21d)

x = x x₃

λ d₁ ₁

2d2

λ ₀d = x₀ ₁

e₃ e₂

e₁

Figure 2.3: Sketch of the expansion of the solutionxin directionsd_i (with x₀ = 0). This figure is a projection from a higher dimensional space; there-fore there are more than two "orthogonal" directionsd_i.

The expansion is sketched in Fig.2.3.

Let us now see what would happen if we would demand that the di-rectionsd_n obey the usual 90 degree orthogonalityd^T_i d_j = 0fori6= j. We want the search directiond_nto be orthogonal to all other search directions, including all future directions making up the error vectore_n+1:

0 =^? d^T_ne_n+1 ^(2.20b)= d^T_ne_n + d^T_nλ_nd_n. This would imply

λ_n =^? −d^T_ne_n d^T_nd_n .

However, this equation is not useful, because thisλ_ncannot be calculated without knowing the en; but if we knew en, then the problem would al-ready be solved. This solution for λ_n would also mean that we totally ig-nore the line minima during search (e.g. when we try to reach the solution in Fig.2.1 with 2 vectors that are at 90 degrees to each other.)

Line minimum

The successful strategy is instead to stillfollow each search directiond_iuntil the line minimumist reached. Thus

0 =^! d

dλf(x_n+1) = ∇f(x_n+1)^T d dλx_n+1

= −r^T_n+1d_n (2.22)

(2.21d)

= −d^T_nr_n + d^T_nAd_nλ_n and we get

λ_n = d^T_nr_n d^T_nAdn

. (2.23)

This equation for the line minimum is valid for any set of search directions.

It contains the current search direction d_n and the current residuum r_n, which are both known.

Orthogonality

Eq. (2.22) tells us again that at the line minimum, the current search direc-tion is orthogonal to the residuum. We rewrite this equadirec-tion:

0 = −d^T_nr_n+1 ^(2.21c)= −d^T_nA

N−1

i=n+1

λ_id_i ^(2.20c)= d^T_nAe_n+1 . (2.24) Our goal is that all search directions are "orthogonal". As argued above, this means that d_n should be orthogonal to e_n+1. This is consistent with (2.24), if weuse a new scalar product

(u,v)A := u^T Av. (2.25)

to define orthogonality.

We nowdemandthat all search directions are mutually”A-orthogonal”

d^T_i Ad_j = 0 (i6=j) (2.26) We will construct such a set of”conjugate”directions d_i. Since u^T Avis a scalar product, these vectors form an orthogonalbasisofR^N, in which the solution vectorxcan be expanded as in (2.18).

We are therefore also guaranteed that the solution will takeat mostN steps(up to effects of rounding errors).

From (2.21c) and (2.26) we can deduce d^T_mr_n = d^T_mA

N−1

i=n

λ_id_i = 0 form < n , (2.27) meaning that the residualr_nis orthogonal in the usual sense (90 degrees) to all old search directions.

Gram-Schmidt Conjugation

We still need a set ofN A-orthogonal search directions{d_i}.There is a sim-ple (but inefficient) way to generate them iteratively: Theconjugate Gram-Schmidt process.

Let{u_i} withi = 0, . . . , N −1be a set ofN linearly independent vec-tors, for instance unit vectors in the coordinate directions. Suppose that the search directionsd_k fork < i are already mutually A-orthogonal. To construct the next direction d_i, take u_i and subtract out all components that are notA-orthogonal to the previousd-vectors, as is demonstrated in Fig. 2.4. Thus, we setd₀ =u₀ and fori >0we choose

d_i =u_i+

i−1

k=0

β_ikd_k, (2.28)

u₀

d₀ d

d₁ u

u^*

Figure 2.4: Gram-Schmidt conjugation of two vectors. Begin with two lin-early independent vectors u₀ and u₁. Set d₀ = u₀. The vector u₁ is com-posed of two components: u^∗ which is A-orthogonal (in this figure: 90 degrees, in general: A-orthogonal) to d0, and u⁺ which is parallel to d0. We subtractu⁺, so that only theA-orthogonal portion remains:d₁ =u^∗ = u₁−u⁺ =: u₁+β₁₀d₀.

with the β_ik defined for k < i. To find their values we impose the A-orthogonality of the new directiond_iwith respect to the previous ones:

0 ^i>j= d^T_i Ad_j = u^T_i Ad_j+

i−1

k=0

β_ikd^T_kAd_j

= u^T_i Ad_j+β_ijd^T_jAd_j . (2.29) β_ij = −u^T_i Ad_j

d^T_jAd_j. (2.30)

Note that in the first line, the sum reduces to the term k = j because of mutualA-orthogonality of the previous search vectorsd_k,k < i.

Equation (2.30) provides the necessary coefficients for (2.28). However, there is a difficulty in using this method, namely that all the old search vectors must be kept and processed to construct the new search vector.

Construction from Gradients

This method is an efficient version of Gram-Schmidt. We will not need to keep the old search vectors. The new ansatz here is to choose a specific set of directionsui, namely

u_i := r_i. (2.31)

First we now use the fact that the residual is orthogonal to all previous search directions, (2.27). Together with (2.28) we get, fori < j and for as yet generalui:

0 ^i<j= d^T_i r_j ^(2.28)= u^T_ir_j+

i−1

k=0

β_ikd^T_kr_j

| {z }

=0(2.27)

0 = u^Tr_j . (2.32a)

In the same way one gets

d^T_i r_i =u^T_i r_i. (2.32b) With our particular choiceu_i =r_i, (2.32a) becomes

r^T_i r_j = 0, i6=j . (2.33) We see that all residue vectorsr_i will actually be orthogonal to each other.

We can now compute the Gram-Schmidt coefficients (2.30). The recur-sion (2.21d) implies

r^T_i r_j+1 = r^T_ir_j−λ_jr^T_i Ad_j .

The last term corresponds to the nominator in (2.30). Because of the or-thogonality (2.33), it simplifies to

r^T_i Adj =

Thus we obtain all the coefficients needed in (2.28):

β_ij =

Most of theβ_ij terms have now become zero. It is no longer necessary to store old search vectors to ensure A-orthogonality. We now denote, for simplification,β_i :=β_i,i−1and plug inλ_i−1 from (2.23) to get the final form

Putting all our results together we obtain theConjugate Gradient algo-rithm, which is presented in symbolic form as algorithm 4.

Comments

Note that the first step of the Conjugate Gradient method is the same as for steepest descent. Convergence is again estimated by the norm ofr_n.

The computationally most expensive part of the algorithm is one Matrix-vector multiplicationAd_nper step. It can be implemented efficiently, espe-cially for sparse matrices. All other parts are vector-vector multiplications.

Because of rounding errors, orthogonality can get lost during the iteration.

Algorithm 4Conjugate Gradient method for quadratic functions Choose a suitable initial vectorx₀

Setd₀ :=r₀ =−∇f

x0 =b−Ax₀ forn= 0toN do

a_n=Ad_n

λ_n = (r^T_nr_n)/(d^T_na_n) x_n+1 =x_n+λ_nd_n r_n+1 =r_n−λ_na_n

β_n+1 = (r^T_n+1r_n+1)/(r^T_nr_n) dn+1 =rn+1+βn+1dn

if convergedthenEXIT end for

It can therefore be advantageous to occasionally calculate the residue di-rectly asr_n+1 =b−Ax_n+1, which involves a second matrix multiplication.

In each step, the Conjugate Gradient algorithm invokes one multipli-cation withA. The solution is thus in fact constructed in the space spanned by the vectors{r₀,Ar₀,A²r₀,A³r₀, . . .}, which is called aKrylov space. There are a many other related methods also acting in Krylov space. One of the most important ones is the closely relatedLanczos algorithmfor computing low lying eigenvalues of a big (sparse) matrix. It was developed before CG. Other methods exist for nonsymmetric matrices.

The numerical stability and convergence of CG is again governed by thecondition numberof the matrixA. It can be improved greatly by a class of transformations calledPreconditioning. Instead of solvingAx = b, one solves the equivalent equationM⁻¹Ax=M⁻¹b. The ideal case would be the exact solutionM⁻¹ =A⁻¹, giving the identity matrixM⁻¹Awith con-dition number unity. Much simpler transformations can already improve convergence greatly, e.g. just takingM to be the diagonal ofA.

Im Dokument Computer Simulations (Seite 62-67)