ITERATIVE PROJECTION METHODS FOR
SPARSE LINEAR SYSTEMS AND
EIGENPROBLEMS
CHAPTER 4 : CONJUGATE GRADIENT METHOD
Heinrich Voss
voss@tu-harburg.de
Hamburg University of Technology Institute of Numerical Simulation
Introduction
The conjugate gradient method, CG method for short, was developed independently byHestenes(1951) andStiefel(1952). Theoretically, it is a direct method for symmetric and positive definite linear systems, i.e. it yields the exact solution in a finite and predetermined number of operations. Differently from elimination methods, however, at different stages of the CG method one obtains approximate solutions of the system.
For a system of dimension n, at most n steps of the CG method (using exact arithmetic) are needed, where at each step the approximate solution is refined progressively. Generally it is found that after a very small number of iterations a sufficiently good approximation is obtained and the process can be
Introduction ct.
The iterative nature of the CG method was already mentioned inHestenes & Stiefel(1952) but its importance for the numerical solution of large and sparse system was recognized only after the seminal paper ofReid(1971).
The history of the CG method and the Lanczos algorithm (which is closely related to the CG method) up to 1976 is presented in a review paper ofGolub & O’Leary(1989) which contains an annotated bibliography of 291 items.
A minimization problem
Suppose that the system matrix A of the linear system of equations Ax = b (1)
is symmetric and positive definite. Then the quadratic functional
φ(x ) :=1 2x
TAx − xTb,
x ∈ Rn, is strictly convex.
Hence, there exists a unique solution x∗of the minimum problem φ(x ) = min!, x ∈ Rn, (2)
1D minimizer
We now take advantage of the characterization of x∗as the minimizer of φ to construct iterative methods for the numerical solution of (1).
Let xk −1, k ≥ 1, be an approximate solution of (2), and assume that the residual
rk −1 :=b − Axk −1 6= 0
(for otherwise, xk −1already would be the solution of problem (2)).
We improve xk −1by a one dimensional line search. To this end we choose a search direction dk 6= 0 and determine the minimum of φ on the line
xk −1+ τdk, τ ∈ R, i.e. we minimize the real equation ψ(τ ) := φ(xk −1+ τdk) = 1 2(x k −1+ τdk)TA(xk −1+ τdk) − (xk −1+ τdk)Tb = 1 2τ 2(dk)TAdk − τ (dk)Trk −1+ φ(xk −1). (3)
1D minimizer ct.
From (dk)TAdk >0 we obtain that ψ has a unique minimum, which is characterized by ψ0(τ ) = τ (dk)TAdk − (dk)Trk −1 =0, i.e.
τk =
(dk)Trk −1
(dk)TAdk. (4)
As a new approximation to the minimizer x∗of (2) we therefore choose xk :=xk −1+ τkdk.
If rk :=b − Axk 6= 0, then we repeat the one dimensional line search with a new search direction dk +1.
Reduction of φ
From (3) and (4) we obtain the following reduction of the functional φ in the step above: φ(xk −1) − φ(xk) = φ(xk −1) − φ(xk −1+ τkdk) = φ(xk −1) − ψ(τk) = −1 2 (dk)Trk −1 (dk)TAdk 2 · (dk)TAdk+(dk)Trk −1 (dk)TAdk · (d k)Trk −1 = 1 2 · ((dk)Trk −1)2 (dk)TAdk .
Hence, a reduction of the size of φ occurs in the line search step, if and only if the residual rk −1and the search direction dk are not orthogonal.
Energy norm
Obviously, for the solution x∗of Ax = b
(x − x∗)TA(x − x∗) = xTAx − 2xTAx∗+ (x∗)TAx∗
= xTAx − 2xTb − (x∗)TAx∗+2(x∗)Tb = 2φ(x ) − φ(x∗) for every x ∈ Rn.
Hence for any subset D ⊂ Rnthe vector ˜x ∈ D is a minimizer of φ in D, if and only if it minimizes the energy norm
kx∗− ˜x kA:= q
(x∗− ˜x )TA(x∗− ˜x )
Example
The Gauß-Seidel method can be interpreted as a line search method if we choose the unit vectors cyclically as search directions, i.e.
djn+k :=uk = (δik)i=1,...,n, k = 1, . . . , n, j ∈ N ∪ {0}.
Suppose that we arrived at the approximate solution xjn+k −1. Then the search direction is uk, and hence,
τjn+k = (rjn+k −1)Tuk (uk)TAuk = rk(nj+k −1) akk = 1 akk bk − n X i=1 akixi(nj+k −1),
and the new iterate is given by
xi(nj+k )= ( xi(nj+k −1) if i 6= k xk(nj+k −1)+a1 kk bk −Pnν=1ak νxν(nj+k −1) if i = k
Example ct.
If additionally we relax the increment: xk(nj+k ):=xk(nj+k −1)+ ω 1 akk bk− n X ν=1 ak νxν(nj+k −1)
for some ω ∈ R, then we obtain the SOR method.
Similarly as in (4) we obtain for a relaxed line search the decrease of the functional
φ(xk −1) − φ(xk −1+ ωτkdk) = ω(1 − 0.5ω)
((dk)Trk −1)2 (dk)TAdk .
Hence, the SOR method decreases the functional φ in every step if and only if ω ∈ (0, 2), and this is exactly the set of parameters for which convergence of
Steepest descent method
Given the approximation xk −1to the minimizer x∗, the functional φ decreases most rapidly in the direction of the negative gradient
−∇φ(xk −1) =b − Axk −1 =rk −1.
Hence, locally the optimum choice of the search direction is the residual dk :=rk −1. The appertaining algorithm is thesteepest descent method.
1: r0=b − Ax0
2: for k = 1, 2, . . . until convergence do 3: sk =Ark −1 4: τk = krk −1k22(rk −1)Tsk 5: xk =xk −1+ τ krk −1 6: rk =rk −1− τ ksk 7: end for
Cost of steepest descent method: 1 matrix-vector product
2 scalar products 2 _axpys
Zig-zagging
For ill-conditioned problems the convergence rate of the steepest descent method can be very slow.
Example
The steepest descent method yields for the functional φ(x ) := xT 10−3 0 0 103 x
and the initial vector x0:= (100, 10−3)T after 100 and 1000 iterations, respectively, the approximations
x100 = (99.4912, 0.000994912)T, x1000= (95.0276, 0.000950276)T to the solution x∗=0, which is not a substantial improvement upon the initial guess.
Zig-zagging ct.
The reason for the bad performance is that in this example the level curves of φare very elongated ellipses.
The gradient directions that arise in the iteration mainly point into the x2-direction whereas the minimum is situated mainly in the x1-direction. Hence, the method is zig-zagging through the flat and steep-sided valley {x ∈ R2 : φ(x ) ≤ φ(x0)}.
For ill-conditioned problems this is the typical behavior of the steepest
Example
10−4 10−3 10−2 10−1 100Steepest descent for −∆ u = 0; stepsize = 1/128
Theorem 4.1
Let A be a symmetric and positive definite matrix and denote by λ1≤ λ2≤ · · · ≤ λnthe eigenvalues of A.
Then for every initial vector x0the sequence {xk}, which is constructed by the steepest descent method, converges to the solution x∗of Ax = b, and the following a priori error estimate with respect to the energy norm holds:
kxk − x∗kA≤
λn− λ1 λn+ λ1
k
Proof
For x ∈ Rnand r := b − Ax one step of the steepest descent method yields the iterate
y := x + kr k 2 2
rTAr(b − Ax ) =: (I − θ(x )A)x + θ(x )b, from which we obtain for the error vectors
y − x∗= (I − θ(x )A)(x − x∗).
Let
z := (I − σA)x + σb Then it follows
Proof ct.
We prove that for σ = ˜σ :=2/(λn+ λ1)the inequality kz − x∗k A≤ λn− λ1 λn+ λ1 kx − x∗k A holds. Then the error bound is obtained by induction. For e := x − x∗and ˜e := z − x∗we get
˜
e = (I − ˜σA)e =: Me.
Proof ct.
Hence, the spectral norm of M is given by kMk2=max 1 − 2λ1 λn+ λ1 , 1 − 2λn λn+ λ1 = λn− λ1 λn+ λ1 =: η, from which we obtain
ky − x∗kA ≤ k˜ekA= kA1/2˜ek2= kA1/2(I − ˜σA)ek2 = k(I − ˜σA)A1/2ek2≤ ηkA1/2ek2
Remark
If λ1and λnare the extreme eigenvalues of A, and if κ2(A) :=
λn λ1
denotes the condition number of A with respect to the Euclidean norm then the error reduction factor of the steepest descent method with respect to the energy norm can be written as
η := λn− λ1 λn+ λ1
= κ2(A) − 1 κ2(A) + 1 .
Hence, the smaller the condition number of A is, the faster the steepest descent method converges.
In particular, for κ2(A) = 1 one obtains η = 0.
Actually in this case A = I, and the gradient method yields the minimizer in the
Example
We consider the discrete version of the model problem with stepsize h = 1/m. The eigenvalues are
λi,j=2 2 − cos iπ m + 1 − cos jπ m + 1 , i, j = 1, . . . , m.
Hence, the minimum eigenvalue is λ1,1=4(1 − cosm+1π )and the maximum eigenvalue is λm,m =4(1 + cosm+1π ), and we obtain the error reduction factor
η = λm,m− λ1,1 λm,m+ λ1,1
=cos π m + 1.
Ideal line search
To avoid the zig-zagging of the steepest descent method we ask for more suitable search directions. An ideal line search method would have the following property:
If the approximations x1,x2, . . . are determined by exact line searches along the search directions d1,d2, . . ., then the minimizer xk of φ on the line xk −1+ τdk, τ ∈ R, even minimizes the functional φon the affine space x0+span{d1, . . . ,dk}, which is spanned by all preceding directions dj, j = 1, . . . , k .
If we are able to construct linear independent directions with this property, then the corresponding line search method is not only convergent but it is finite, because after at most n steps it finds the minimizer of φ on
x0+span{d1, . . . ,dn} = Rn. The line search method is said to have thefinite
Ideal line search ct.
Let x ∈ x0+span{d1, . . . ,dk}. Then x has the representation x = ˜x + τ dk, τ ∈ R, ˜x = x0+
k −1 X
j=1 τjdj.
Similarly as for the 1D-line search φ(x ) = φ(˜x ) + τ k −1 X j=1 τj(dj)TAdk+ τ (dk)Tr0+ 1 2τ 2(dk)TAdk. If (dj)TAdk =0 for j = 1, . . . , k − 1, then φ(x ) = φ(˜x ) + τ (dk)Tr0+1 2τ 2(dk)TAdk,
and the problem of minimizing φ over x0+span{d1, . . . ,dk} decouples into a minimization over x0+span{d1, . . . ,dk −1} and the minimization of the real equation
Ideal line search ct.
The minimizer τk of ψ is τk = (dk)Tr0/(dk)TAdk, and from
(dk)Trk −1 = dkTb − A(x0+ k −1 X j=1 τjdj) = (dk)Tr0− k −1 X j=1 τj(dk)TAdj = (dk)Tr0
one obtains again
τk =
(dk)Trk −1 (dk)TAdk.
Hence, if the search directions dj satisfy (di)TAdj =0 for i 6= j, then the minimization of φ over the affine space x0+span{d1, . . . ,dk} can be replaced by the consecutive line searches along d1, . . . ,dk.
Conjugate directions
Definition: Let A ∈ Rn×n be symmetric and positive definite. The set of vectors
d1, . . . ,dk is calledA-conjugate, if dj 6= 0 for j = 1, . . . , k and (di)TAdj =0 for i, j = 1, . . . , k , i 6= j.
With this notion we can formulate our result as follows: Theorem 4.2
Let x0∈ Rnbe given and assume that x1,x2, . . . have been determined by a line search method, where the search directions d1,d2, . . . are mutually A-conjugate. Then xk minimizes the functional φ over the affine space x0+span{d1, . . . ,dk}.
Remark
From the minimality of φ(xk)over x0+span{d1, . . . ,dk} it follows that (dj)Trk =0, j = 1, . . . , k ,
for otherwise, if (di)Trk 6= 0 for some i ∈ {1, . . . , k }, then there exists α ∈ R
Method of conjugate directions
1: r0=b − Ax0;
2: for k = 1, 2, . . . until rk =0do 3: choose dk such that
(dk)TAdj =0, j = 1, . . . , k − 1, and (dk)Trk −16= 0 4: τk = (dk)Trk −1(dk)TAdk 5: xk =xk −1+ τ kdk 6: rk =b − Axk 7: end for
Remark
Two vectors x , y ∈ Rn\ {0} are A-conjugate, if and only if they are orthogonal with respect to the scalar product
hx, y iA:=yTAx .
Hence, if d1, . . . ,dk are A-conjugate, then they are linearly independent. Moreover, given any set of linearly independent vectors v1, . . . ,vn∈ Rn, one can construct a set of n A-conjugate vectors by Gram-Schmidt
orthogonalization of v1, . . . ,vnwith respect to the inner product h·, ·iA. We will see that there is a less costly method to obtain an A-conjugate basis of Rn.
Conjugate Gradient Method
In order to execute the conjugate direction method we have to choose in the k -th step a vector dk such that (dk)TAdj =0, j = 1, . . . , k − 1, and
(dk)Trk −16= 0.
The first condition is satisfied by the A-orthogonal complement of any vector v 6∈ span{d1, . . . ,dk −1} with respect to span{d1, . . . ,dk −1}.
To fulfill also the second requirement, we choose v := rk −1, i.e.
dk =rk −1− k −1 X j=1 (dj)TArk −1 (dj)TAdj d j.
CG method
For dk we actually obtain
(dk)Trk −1 = (rk −1)Trk −1− k −1 X j=1 (dj)TArk −1 (dj)TAdj (d j)Trk −1 | {z } =0 = (rk −1)Trk −1 >0,
and hence, dk is an admissible search direction. The choice dk =rk −1− k −1 X j=1 (dj)TArk −1 (dj)TAdj d j.
seems to have the disadvantages that it is very costly to perform and that all previous search directions dj have to be stored.
Actually the sum reduces to the last term. This follows from the following Lemma.
Lemma 4.3
Let x0∈ Rnbe given and assume that the approximate solutions xj, j = 1, . . . , k , are determined by the conjugate direction method, where the search directions are given by .
dk =rk −1− k −1 X j=1 (dj)TArk −1 (dj)TAdj d j. (1)
Then the vectors rj and dj satisfy
(rk)Trj =0, j = 0, . . . , k − 1, (2) and
Proof
From (1) it follows that
rj =dj+1+ j X i=1 (di)TArj (di)TAdid i,
and therefore, (dj)Trk =0, j = 1, . . . , k yields
(rk)Trj = (rk)Tdj+1 | {z } =0 + j X i=1 (di)TArj (di)TAdi (r k)Tdi | {z } =0 =0 for j = 1, . . . , k − 1, i.e. (2).
Proof ct.
To prove (3) (i.e. (dj)TArk =0) we rewrite dj+1, j = 1, . . . , k − 1, in the following way: dj+1 = rj− j X i=1 (di)TArj (di)TAdi d i = b − Axj− j X i=1 (di)TArj (di)TAdi d i = b − A(xj−1+ τjdj) − j X i=1 (di)TArj (di)TAdi d i = b − Axj−1− τjAdj− j X i=1 (di)TArj (di)TAdi d i.
Proof ct.
From τj 6= 0 (otherwise, xj−1would have been the minimizer of φ and the algorithm would have been terminated) we obtain
Adj = 1 τj rj−1− dj+1− j X i=1 (di)TArj (di)TAdi d i ! ,
and therefore, for j = 1, . . . , k − 1
(rk)TAdj = 1 τj (rk)Trj−1− (rk)Tdj+1− j X i=1 (di)TArj (di)TAdi (r k)Tdi ! .
Finally from (rk)Trj =0 for j = 0, . . . , k − 1 and from (dj)Trk =0, for j = 1, . . . , k , we get that all terms on the right hand side vanish. This
From (1) and the last Lemma it follows that the current search direction dk is given by dk =rk −1− (d k −1)TArk −1 (dk −1)TAdk −1 d k −1=:rk −1+ β kdk −1.
Hence, we do not have to store the hole history of search directions to perform the A-orthogonalization of rk −1, but only the most recent direction is needed. It will turn out that the execution of the conjugate direction method requires the storage of four vectors.
Improvement of CG method
We can even improve the efficiency of the algorithm by some simple conversions.
Firstly, the current residual can be updated according to
rk =b − Axk =b − A(xk −1+ τkdk) =rk −1− τkAdk.
This manipulation saves one matrix-vector product Axk in every iteration because the matrix-vector product Adk is needed in the calculation of τk, anyhow.
From rk =rk −1− τ
kAdk one gets for k ≥ 2 Adk −1= 1
τk −1(r
k −2− rk −1),
and the orthogonality of the residuals yields (dk −1)TArk −1= 1
τk −1
(rk −2− rk −1)Trk −1= − 1 τk −1
(rk −1)Trk −1,
and from dk =rk −1+ βkdk −1and rk =rk −1− τkAdk we obtain (dk −1)TAdk −1 = (dk −1)TA(dk −1− βk −2dk −2) = (dk −1)TArk −2= 1 τk −1(r k −2− rk −1)Trk −2 = 1 τk −1 (rk −2)Trk −2.
Thus, the current search direction reads dk =rk −1+(r
k −1)Trk −1 (rk −2)Trk −2d
k −1.
Finally, we can replace (dk)Trk −1in τk by
(dk)Trk −1= (rk −1+ βkdk −1)Trk −1= (rk −1)Trk −1, which is needed in the determination of βk, anyway.
Putting these simplifications together we finally arrive at the following method of conjugate gradients, which is essentially the method of Hestenes and Stiefel.
Conjugate Gradient Algorithm
1: r0:=b − Ax0; 2: α0:= (r0)Tr0. 3: d1=r0; 4: for k = 1, 2, . . . until dk ==0do 5: sk =Adk; 6: τk = αk −1(dk)Tsk; 7: xk =xk −1+ τ kdk; 8: rk =rk −1− τ ksk; 9: βk =1/αk −1; 10: αk = (rk)Trk; 11: βk = βk · αk; 12: dk +1=rk+ β kdk; 13: end forCost of CG method
1 matrix-vector product 2 scalar products 3 _axpys
Storage requirements: 4 vectors
The termination criterion in the CG algorithm is unrealistic because rounding errors destroy the conjugacy among the search directions and finite
termination usually will not appear.
The termination criterion should be based on a combination of a sufficient decrease of the residual and a maximum number of iterations.
Example
10−5 10−4 10−3 10−2 10−1 100Steepest descent and CG for −∆ u = 0; stepsize = 1/128
||x st. descent||A
Error bound
Because the conjugate gradient method is a direct method, and hence, in exact arithmetic the solution is obtained in a finite number of steps, terms like ‘convergence’ or ‘asymptotic reduction rate’ do not make sense. In this section we shall derive an error estimate.
We already realized that for the solution x∗of Ax = b it holds kx − x∗k2A=2φ(x ) − φ(x∗) for every x ∈ Rn.
Hence, xk minimizes the error e := x∗− x with respect to the energy norm k · kAover the affine space x0+span{d1, . . . ,dk}.
The representation of span{d1, . . . ,dk} which is contained in the following lemma is the key to the error bound for the conjugate gradient method.
Lemma 4.4
Assume that the search directions d1, . . . ,dk are constructed by the conjugate gradient method. Then
span{d1, . . . ,dk} = span{r0,Ar0, . . . ,Ak −1r0}.
Definition
Let B ∈ Rn×nand v ∈ Rnbe given. Then the linear space Kk(v , B) := span{v , Bv , . . . , Bk −1v } is called the k -thKrylov spaceof B corresponding to v .
Proof
For k = 1 the statement is trivial since d1=r0.
Assume that it already has been shown for some k < n. Then (with the notation d0=0) we obtain dk +1 = rk+ βkdk =rk −1− τkAdk+ βkdk = dk− βk −1dk −1− τkAdk + βkdk, (∗) and from dk,dk −1 ∈ K k(r0,A) it follows that span{d1, . . . ,dk +1} ⊂ K k +1(r0,A).
Error bound ct.
As before we denote by e0:=x∗− x0the error of x0.Then from r0=b − Ax0=A(x∗− x0) =Ae0
one obtains Ajr0=Aj+1e0, and thus the elements of the Krylov space z ∈ K(r0,A) can be written as z = k X j=1 αjAje0, αj ∈ R.
Hence, it follows from (*) that the error ek :=x∗− xk of the k -th iterate xk satisfies kekk2A = min x ∈x0+Kk(r0,A)kx ∗− xk2 A = min z∈Kk(r0,A) kx∗− x0+zk2 A
Error bound ct.
Let
˜
Πk := {p : p is a polynomial of degree ≤ k , p(0) = 1}
be the set of polynomials p of maximum degree k such that p(0) = 1. Then (**) gets the form
kekk2 A= min
p∈ ˜Πk
kp(A)e0k2 A.
Hence, the error of the k -th iterate can be represented as the product of a polynomial p(A) in A and the initial error e0. Therefore, the CG method can be considered as a polynomial iteration, where in contrast to the Chebyshev iteration the underlying polynomial is constructed by the CG method itself. More important, the polynomial does not depend on parameters (estimates of the bounds of the spectrum of A) which have to be prepared by the user in the Chebyshev method.
Error bound ct.
To obtain a more convenient form of the error representation we consider the spectral decomposition A = n X j=1 λjzj(zj)T
of A, where zj, j = 1, . . . , n, is an orthonormal set of eigenvectors of A and 0 < λ1≤ λ2≤ · · · ≤ λnare the corresponding eigenvalues.
Then A2= n X i,j=1 λiλjzi(zi)Tzj | {z } =δij (zj)T = n X j=1 λ2jzj(zj)T, and by induction Ai = n X λijzj(zj)T.
Error bound ct.
Therefore, from e0=: n X j=1 γjzj one gets kekk2 A= min p∈ ˜Πk k n X j=1 γjp(λj)zjk2A.The definition of the energy norm yields n X j=1 γjp(λj)zj 2 A = n X i=1 γip(λi)zi !T A n X j=1 γjp(λj)zj = n X j=1 γ2jλjp(λj) 2 .
Error bound ct.
Hence, we obtain the very important representation of the error of xk in the energy norm kekk2 A=min p∈ ˜Πk n X j=1 γj2λj p(λj) 2 (1) which will be discussed in detail at the end of this section.
Coarse upper bound
First we deduce from (1) a coarse upper bound of the convergence properties of the CG method. From ke0k2A= n X j=1 λjγ2j
we obtain by estimating the values |p(λj)|by the maximum of |p| on the spectrum of A kekkA≤ min p∈ ˜Πk max λ∈spec(A) |p(λ)| · ke 0k A, and even coarser
kekkA≤ min p∈ ˜Πk
max λ1≤λ≤λn
|p(λ)| · ke0kA, and the special choice
p(λ) = Ck 2λ − λn− λ1 λn− λ1 Ck −λn− λ1 λn− λ1
Theorem 4.5
Let A ∈ Rn×nbe symmetric and positive definite, and denote by λnand λ 1the largest and smallest eigenvalue of A, respectively.
Let x0be any initial vector, and denote by xk the k -th approximate to the solution x∗ of Ax = b, which is obtained by the CG method.
Then the following a priori error estimate with respect to the energy norm holds: kxk− x∗kA≤ 2Rk 1 + R2kkx 0− x∗k A, R := √ λn− √ λ1 √ λn+ √ λ1 . (2)
Replacing the denominator in (2) by 1 we receive the error estimate
kxk − x∗kA≤ 2 p
κ2(A) − 1 !k
Example
To demonstrate the dependence of the convergence of the CG method on the condition number, we consider the linear system of equations Ax = 0, where A is a diagonal matrix of dimension 500, such that the entries are pseudo random numbers in the interval [1, a].
The next figure contains the reduction of the Euclidean norms of the errors for the cases a = 10, a = 100 and a = 10000 in a semilogarithmic scale.
Example ct.
10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100Speed of convergence of CG method
a=10
a=100 a=10000
Example
To demonstrate the quality of last Theorem the following figure contains the real reduction factors kekkA/ke0kAas well as those ones, which are predicted by (2) for the discrete version of the model problem with stepsize h = 1/128. In this problem the condition number is
κ2(A) =
1 + cos128π 1 − cos π
128
Example ct.
10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 error bound error reductionImproved error bound
The last Theorem provides only very coarse upper bounds of the speed of convergence of the CG method.
In particular, the finite termination property is not reflected by (2). This can be obtained from the finer estimate (1).
Because the matrix A ∈ Rn×nhas at most n distinct eigenvalues, we obtain from (1) for every polynomial q ∈ ˜Πk, such that qk(λj) =0, j = 1, . . . , n, that
kekk2A= min p∈ ˜Πk n X j=1 γj2λj p(λj) 2 ≤ n X j=1 γj2λj qk(λj) 2 =0. (4) Moreover, we get
Theorem 4.6
Let A ∈ Rn×nbe symmetric and positive definite, and let xk be the approximates to the solution x∗of Ax = b, which are obtained by the CG method.
If A has m ≤ n distinct eigenvalues, then in exact arithmetic the CG algorithm stops with xk =x∗after at most m steps.
Proof
If A has the distinct eigenvalues λ1< · · · < λm, then
qm(λ) := m Y j=1 (λ − λj) in (4) yields kemk2 A=0.
Clustered eigenvalues
If the matrix A does not have a very small number m n of distinct eigenvalues but they are contained in m disjoint intervals Ij of very small length, then the CG method will not stop with the exact solution after m steps, but it will produce a very small residual after m steps.
To see this we choose a polynomial q of degree m such that q(0) = 1, which possesses a root in each of the intervals Ij. Then by assumption all
eigenvalues are close to one of the roots of q, and thus max{|q(λj)| : λj is an eigenvalue of A},
Clustered eigenvalues ct.
To demonstrate the effect of clustered eigenvalues on the convergence properties of the CG method, we consider a homogeneous linear system Ax = 0, where A is a diagonal matrix of dimension n = 500 with condition number κ2(A) = 10000.
The following figure contains the reduction of the energy norms of the errors for the cases that all eigenvalues are uniformly distributed in 2, 5 and 10 intervals of length 1, respectively, and for the case that the eigenvalues are uniformly distributed in the interval [1, 10000] in a semilogarithmic scale.
Clustered eigenvalues ct.
10−10 10−8 10−6 10−4 10−2 100 uniformly distributed 10 intervals 5 intervals 2 intervalsOutlying eigenvalues
Knowing only the largest and the smallest eigenvalues λnand λ1of a positive definite matrix A, the bound in Theorem 4.5 is the best possible. If the interior eigenvalues of A lie at the points where the shifted Chebyshev polynomial Ck attains its maximum absolute value in [λ1, λn], then for a certain initial error e0, the CG polynomial will be equal to the Chebyshev polynomial, and the bound in Theorem 4.5 will be actually be attained at step k .
If additional information is available about interior eigenvalues of A, one can often improve the simpler estimate of Th. 4.5 while maintaining a simple expression.
Suppose, for example that A has one eigenvalue λnwhich is much larger than the others. Then we replace the polynomial in the minmax characterization of the error by the product of a linear factor which is zero at λn, and the shifted and scaled Chebyshev polynomial on the interval [λ1, λn−1]of degree k − 1:
Outlying eigenvalues ct.
pk(λ) = Ck −1 2λ − λn−1− λ1 λn−1− λ1 Ck −1 −λn−1− λ1 λn−1− λ1 λn− λ λn .Since the second factor is zero at λnand less than one in absolute value on each of the other eigenvalues, the maximum absolute value of this polynomial on {λ1, . . . , λn} is less than the maximum absolute value of the first factor on {λ1, . . . , λn−1}.
Using arguments like those in Chapter 3 it follows that kxk − x∗k A≤ 2 √κn−1− 1 √ κn−1+1 k −1 kx0− x∗k A, κn−1= λn−1 λ1
Outlying eigenvalues ct.
Similarly, if A has a few outlying eigenvalues:
λ1≤ λ2≤ · · · ≤ λn−` λn−`+1≤ · · · ≤ λn, λn−`/λn−`+1 1, one can consider a polynomial pk which is the product of a polynomial of degree ` that is zero at each of the outlyers and less than one in magnitude at each of the other eigenvalues, and a shifted and scaled Chebyshev
polynomial of degree k − ` on the interval [λ1, λn−`].
pk(λ) = Ck −` 2λ − λn−`− λ1 λn−`− λ1 Ck −1 −λn−`− λ1 λn−`− λ1 n Y j=n−`+1 λj− λ λj .
With this polynomial one obtains from the minmax characterization of the error the upper bound
kxk− x∗k A≤ 2 √κn−`− 1 √ κn−`+1 k −` kx0− x∗k A, κn−` = λn−` λ1
Local effects in convergence behavior
The following figure shows the convergence behavior of the CG method for A = diag([0.001; 0.1+rand (98, 1); 1.2]); b = ones(100, 0); x 0 = zeros(100, 1).
Local effects in convergence behavior
In the beginning the convergence is quite slow in agreement with the condition number κ(A) = 1200.
After iteration 10 the convergence gets noticeably faster than in the first phase.
To explain this effect we need a different variant of the CG method, namely its connection to the Lanczos process (Lanczos1950).
Lanczos method
The Lanczos algorithm determines an orthonormal basis {q1, . . . ,qk} of the Krylov space
Kk(r0,A) := span{r0,Ar0, . . . ,Ak −1r0}, k = 1, . . . , n, such that
Tk :=QkTAQk, Qk := (q1, . . . ,qk) is tridiagonal.
Lanczos method ct.
Assume that we already computed the orthonormal basis q1, . . . ,qk of the Krylov space Kk(v , A).
Then Ak −1v ∈ span{q1, . . . ,qk}, and therefore, there exist γ
1, . . . , γk ∈ R such that Akv = A(Ak −1v ) = A k X j=1 γkqj = γkAqk+A k −1X j=1 γjqj .
The second term on the right hand side is contained in Kk(v , A). Hence, to obtain an orthonormal basis of Kk +1(v , A) it suffices to compute the orthonormal complement of the uk :=Aqk with respect to the vectors q1, . . . ,qk.
Lanczos method ct.
Since Aqj ∈ Kj+1(v , A) ⊂ Kk −1(v , A) for every j < k − 1, we have (qj)Tuk = (qj)TAqk = (Aqj)Tqk =0.
Hence, (uk)Tqj =0 for j = 1, . . . , k − 2, and therefore Aqk = γkqk +1+ αkqk+ βk −1qk −1.
Lanczos method ct.
The coefficients are obtained fromβk −1 = (qk −1)TAqk = (Aqk −1)Tqk = (γk −1qk + αk −1qk −1+ βk −2qk −2)Tqk = γk −1, i.e. Aqk = βkqk +1+ αkqk + βk −1qk −1. (1) Thus, αk = (qk)TAqk, and the condition kqk +1k
2=1 yields
βk =1/kAqk− αkqk− βk −1qk −1k2. (2)
If the denominator in (2) vanishes, then Aqk ∈ K
k(v , A), and therefore, Kk(v , A) is a k -dimensional invariant subspace of A.
Lanczos method ct.
1: q0=0; k = 1 2: β0= kr0k 3: while rk −16= 0 do 4: qk =rk −1/β k −1 5: rk =Aqk 6: rk =rk− βk −1qk −1 7: αk = (qk)Trk 8: rk =rk− α kqk 9: βk = krkk 10: end whileThen with Tk =tridiag{βj−1, αj, βj}
AQk =QkTk +rk(ek)T, rk =Aqk − αkqk − βk −1qk −1.
Lanczos method ct.
Consider the CG method for the linear system
Ax = b, A symmetric and positive definite. Let x0=0 (else consider Ay = b − Ax0=: ˜b)
The approximation xk after k steps minimizes φ(x ) :=1
2x
TAx − bTx in K
Lanczos method ct.
Restricting φ to Kk(b, A) yields φk(y ) := 1 2y TQT kAQky − yTQTkb, y ∈ Rk, and the CG iterate xk satisfiesTkyk =QkTb, xk =Qkyk.
Disadvantage(at a first glance): All qj, j = 1, . . . , k , are needed to determine xk. We will come back to this point in the next section.
Local convergence behavior
The following analysis of the local convergence behavior of the CG method is contained in a paper ofvan der Sluis & van der Vorst(1986)
Multiplying
AQk =QkTk + βkqk +1(ek)T by e1from the right yields
Aq1=QkTke1.
Premultiplying this with A from the left gives
A2q1=AQkTke1= (QkTk+ βkqk +1(ek)T))Tke1=QkTk2e1for k>2.
And similarly one obtains
Ajq1=QkTkje1, for j = 3, . . . , k − 1, Akq1=QkTkke1+cqk +1 for some constant c
Local convergence behavior ct.
Since xk ∈ K
k(b, A) it follows that there exists some polynomial pk(t) =Pkj=0γjtj of degree k such that
rk =pk(A)b, and from rk ⊥ K k(b, A) and q1=b it follows 0 = QkTpk(A)q1=QTk k X j=0 γjAjq1 = k X j=0 γjQkTQkT j ke1+ γkcQTkqk +1 = pk(Tk)e1
Local convergence behavior ct.
Since Tk is symmetric there exists an orthonormal basis of Rk of eigenvectors yj, j = 1, . . . , k of Tk. Denote by θj the eigenvalue corresponding to yj, and assume that the eigenvalues are ordered by magnitude θ1≤ θ2≤ · · · ≤ θk. Let e1=Pk
j=1τjyj. Then τj 6= 0 for j = 1, . . . , k :
Assume that τj =0, i.e. (yj)Te1=0. Then it follows from Tkyj = θjyj 0 = (Tkyj)Te1= (yj)T(Tke1) = (yj)T(α1e1+ β1e2) ⇒ (yj)Te2=0, and by induction we obtain the contradiction (yj)Tei =0, i = 1, . . . , k .
Local convergence behavior ct.
Since τj 6= 0 for every j we obtain from
0 = pk(Tk)e1= k X j=1 τjp(Tk)yj = k X j=1 τjp(θj)yj
that the eigenvalues of θj are the roots of the polynomial pk, and from pk(0) = 1 we therefore obtain pk(t) = k Y j=1 (θj − t) , k Y j=1 θj (1)
Exploiting this characterization of the residual polynomial pk we can further analyze the convergence behavior of the CG method.
Local convergence behavior ct.
Let zj, j = 1, . . . , n be an orthonormal set of eigenvectors of A, and let λj be the eigenvalue corresponding to zj, where the eigenvalues again are ordered by magnitude.
Write the initial error as
x − x0= n X j=1 µjzj. From
A(x − xk) =b − Axk =rk =pk(A)r0=pk(A)A(x − x0) =Apk(A)(x − x0) it follows x − xk =pk(A)(x − x0) = n X j=1 µjpk(A)zj = n X j=1 µjpk(λj)zj.
Local convergence behavior ct.
We now replace the first component of xk (with respect to the basis zj) by the first component of x and take the modified xk as starting vector ˜x0of another CG process (with the same A and b):
x − ˜x0= n X
j=2
µjpk(λj)zj.
The new CG process generates a sequence ˜xmfor which the error is characterized by a polynomial ˜pm.
Local convergence behavior ct.
Let qk(t) = (λ1− t)(θ2− t) . . . (θk − t) λ1θ2. . . θk =θ1(λ1− t) λ1(θ1− t) pk(t). From ˜pm(0)qk(0) = 1 it follows kx − xk +mk2 A≤ k˜pm(A)qk(A)(x − x0)k2A= n X j=2 λjpm˜ (λj)2qk(λj)2µ2j. Defining Fk := θ1 λ1 max j=2,...,n λj − λ1 λj − θ1 it follows that |qk(λj)| ≤Fk|pk(λj)|Local convergence behavior ct.
Hence, the upper bound can be further simplified askx − xk +mk2A ≤ Fk2 n X j=2 λjp˜m(λj)2pk(λj)2µ2j = Fk2kx − ˜xmk2A≤ Fk2kx − ˜x mk2 A kx − ˜x0k2 A kx − xkk2A.
If the smallest eigenvalue θ1is close to λ1, then Fk ≈ 1, and we may expect a reduction of the error in the next steps that is bounded by the reduction that we obtain for a CG process for Ax = b in which λ1is missing.
With κ2(A) := λn/λ2we have kx − xm+kk A≤ 2 p κ2(A) − 1 pκ2(A) + 1 !m kx − xkk A.
Preconditioning
The estimate kxk − x∗k A≤ 2 p κ2(A) − 1 pκ2(A) + 1 !k kx0− x∗k A (1)suggests that the convergence of the CG method depends heavily on the condition number of the system matrix A.
We therefore transform problem Ax = b into a system ˜
A˜x = ˜b, (2) where
˜
A := C−1AC−T, b := C˜ −1b, x := C−Tx ,˜ (3)
and C ∈ Rn×nis a nonsingular matrix, such that linear systems of equations Cy = d and CTy = d (d ∈ Rngiven)
can be solved easily and that the condition number of ˜A is as small as possible.
CG for ˜
A˜
x = ˜
b
1: ˜r = ˜b − ˜A˜x 2: d = ˜˜ r 3: α = (˜r )T˜r 4: while√α > εdo 5: s = ˜˜ A˜d 6: τ = α.(˜d )T˜s 7: x = ˜˜ x + τ ˜d 8: ˜r = ˜r − τ ˜s 9: β =1/α 10: α = (˜r )T˜r 11: β = β · α 12: d = ˜˜ r + β ˜d 13: end whileCG for ˜
A˜
x = ˜
b original variables; ˜
d := C
Td ; ˜
s = C
−1s
1: C−1r = C−1b − C−1AC−Tx = C˜ −1(b − Ax ) 2: CTd = C−1r 3: α =rTC−TC−1r 4: while√α > εdo 5: C−1s = C−1AC−TCTd = C−1Ad 6: τ = α.(dTC)(C−1s) = αdTs 7: CTx = CTx + τ CTd 8: C−1r = C−1r − τ C−1s 9: β =1/α 10: α =rTC−TC−1r 11: β = β · α 12: CTd = C−1r + βCTd 13: end whilePreconditioned CG. M = CC
T 1: r = b − Ax 2: Solve Md = r for d 3: α =rTd 4: while√α > εdo 5: s = Ad 6: τ = αdTs 7: x = x + τ d 8: r = r − τ s 9: β =1/α 10: Solve Mz = r for z 11: α =rTz 12: β = β · α 13: d = z + βd 14: end whilePCG
Thepreconditioned conjugate gradient algorithm,PCG algorithmfor short,
requires as the original CG method one matrix-vector product and five level-1-operations, and additionally the solution of one system of linear equations Mz = r .
Notice that the matrix C, which we used in the derivation of the PCG method, does not appear in the algorithm, but only thepreconditionerM.
M = CCT obviously is symmetric and, by the regularity of C, positive definite, and every symmetric and positive definite matrix principally is a suitable preconditioner (we can choose C := M1/2in the derivation of the PCG method).
Preconditioner
In practical problems, M must have the following properties: Linear systems Mz = r can be solved easily
M is a good approximation of A.
The choice of the preconditioner can effect the speed of the convergence dramatically. Preconditioning is a field of active research.
The most important preconditioner is obtained by incomplete Cholesky factorization.
Preconditioning by Splitting Methods
To obtain a fast preconditioned CG method we have to find a symmetric and positive definite matrix M such that firstly, M approximates A as good as possible, and secondly, the linear system of equations Mz = r can be solved easily.
When we considered splitting methods, we studied nearly the same question. All splitting matrices constructed in Chapter 2, which inherit the symmetry and positive definiteness from the system matrix A are suitable preconditioners.
Preconditioning by Splitting ct.
If A := D − E − ET is the standard partition, where D denotes the diagonal part of A and −E the strictly lower triangular part, then the following preconditioners are at hand:
M = D Jacobi
M := (D − E )D−1(D − ET) symmetric GS M :=ω(2−ω)1 (D − ωE )D−1(D − ωET), ω ∈ (0, 2), SSOR.
Of course, these matrices are symmetric and positive definite, if the system matrix A has these properties.
The solution of Mz = r for M := B can be obtained by one step of the corresponding splitting iteration
Preconditioning by Splitting ct.
It is obvious that more than one step of the splitting method can be performed to get an improved preconditioning.
If in an inner iteration k steps of (1) are in use, then the preconditioner is given by M = k −1 X j=0 h B−1(B − A)ijB−1. (2)
Since A and B are symmetric, each term in (2) is symmetric. Thus M is symmetric, too. Concerning the positive definiteness of M the following theorem was proved byAdams(1985).
Theorem 4.7
Let A and B be a symmetric matrices, and suppose that A is positive definite and B is nonsingular. Then the matrix M given in (2) is symmetric.
(i) For k odd, M is positive definite if and only if B is positive definite.
(ii) For k even, M is positive definite if and only if 2B − A is positive definite. We next discuss the condition of Theorem 4.7 which seems to be quite strange.
Theorem 4.8
Suppose that A ∈ Rn×nand B ∈ Rn×nare symmetric and positive definite. Then the matrix 2B − A is positive definite if and only if ρ(B−1(B − A)) < 1. Proof: Obviously, λ is an eigenvalue of B−1(B − A) if and only if it is an eigenvalue of the generalized eigenproblem
Proof ct.
Since B is positive definite all eigenvalues of B−1(B − A) are real. From Rayleigh’s principle we obtain that ρ(B−1(B − A)) < 1 if and only if
−1 < x
T(B − A)x
xTBx <1 for every x 6= 0, which is equivalent to
−xTBx < xT(B − A)x < xTBx for every x 6= 0, i.e.
xT(2B − A)x > 0 and xTAx > 0 for every x 6= 0.
Since A is supposed to be positive definite ρ(B−1(B − A)) < 1 if and only if
Remark
Adams(1985) studied k -step preconditioners.
For SSOR preconditioners she proved that the condition number is a nonincreasing function of k .
She reports however, that the actual decrease in the number of iterations is not enough to balance the extra work involved in the preconditioner, at least if the dimension of the problem is not too large.
Incomplete Cholesky Preconditioner
Cholesky Factorization
1: for i = 1 : n do 2: for j = 1 : i − 1 do
3: cij=aij−Pj−1k =1cikcjk/cjj
4: end for 5: cii=
q
aii−Pi−1k =1cik2
6: end for
If the Cholesky factor C of A is known, then the linear system of equations Ax = b can be solved directly:
Solve Cy = b, solve CTx = y where both systems are triangular.
Incomplete Cholesky ct.
Let
m(i) := min{j : 1 ≤ j ≤ i, aij 6= 0}, i = 1, . . . , n, i.e. ai,m(i)is the first nonzero entry in the i-th row of A.
Then, the set of index pairs
S(A) := {(i, j) : m(i) ≤ j ≤ i, 1 ≤ i ≤ n} is called theenvelopeof the matrix A.
From the algorithm above it follows immediately, that all elements cijsuch that (i, j) 6∈ S(A) vanish. Hence, if A is a band matrix, then C has the same band structure.
Incomplete Cholesky ct.
Matrices that arise from discretizations of boundary value problems of partial differential equations usually possess an envelope that is very large but only sparsely populated by nonzero elements. This sparsity pattern unfortunately is not inherited by the Cholesky factor, but most of the envelope is filled in during the elimination process.
To reduce the storage requirements one chooses a subset J ⊂ S(A) of the envelope of A, which always contains the diagonal index pairs
(1, 1), . . . , (n, n), and modifies the algorithm above in the following way: All entries cijsuch that (i, j) 6∈ J are set to zero.
Incomplete Cholesky Factorization
1: for i = 1 : n do 2: for j = 1 : i − 1 do 3: if (i, j) 6∈ J then 4: cij=0 5: else6: cij=aij−Pj−1k =1cikcjk/cjj;
7: end if 8: end for 9: cii= q aii−Pi−1j=1c2ij 10: end for
Incomplete Cholesky Factorization ct.
If J = S(A), then C is the Cholesky factor and A = CCT.
If J ⊂ S(A), J 6= S(A), then the factorization is calledincomplete Cholesky
decomposition,IC decompositionfor short.
A reduction of J usually decreases the storage requirements, the factorization time and the time needed to solve the linear systems CTy = r and Cz = y in the CG step. It also increases the defect A − CCT which reduces the rate of convergence of the preconditioned CG method.
These conflicting influences make it difficult to determine the optimum choice of J.
No fill principle
The most common choice, calledno-fill principle, is
J = J(A) := {(i, j) : 1 ≤ j ≤ i ≤ n, aij 6= 0}.
For diagonally sparse matrices, where the nonvanishing entries appear on parallels of the main diagonal (e.g. finite difference approximations of boundary value problems on rectangles are of this type)Meijerink & van der Vorst(1977) proposed to allow the incomplete factor C to have a few more nonzero diagonals than the matrix A itself (counting only the number of diagonals of A on one side of the main diagonal, of course).
IC decomposition
One disadvantage of the IC factorization is, that there exist symmetric and positive definite matrices A such that the IC decomposition does not exist. The algorithm may fail because the square root of a negative number is required to compute a diagonal element cii.
The following theorem, that we state without proof, contains a class of matrices for which the IC decomposition is guaranteed to exist.
Definition
A regular matrix A ∈ Rn×nis called anM-matrix, if A has nonpositive off-diagonal elements and if its inverse has nonnegative entries. For A ∈ Rn×nthe matrix
hAi = (haiji) where haiji =
−|aij| for i 6= j |aii| for I = j is calledOstrowski’s comparison matrix(Ostrowski1937). A ∈ Rn×nis calledH-matrix, if hAi is an M-matrix.
IC factorization
Theorem
Let A be a symmetric H-matrix. Then for any choice of the index set J, the IC factorization can be carried out.
In case that the conditions of the last Theorem are not satisfied the
incomplete Cholesky decomposition may not exist. There have been several suggestions made to remedy the nonexistence.
Suppose that the computation of ciirequires the square root of a nonpositive number. Then, one simple idea is to replace cii by an arbitrary positive number.
One strategy in use is the choice c2 ii =
Pi−1
IC factorization
A different approach which was advocated byManteuffel(1980): It is well known that strictly diagonally dominant matrices with positive diagonal and nonpositive off-diagonal entries are M-matrices. Hence, for α ≥ 0 sufficiently large, the matrix ˆA := A + αI is an H-matrix, and by the last Theorem its IC factorization exists.
Gustafsson(1978) proposed the so calledmodified incomplete Cholesky
decomposition,MIC decompositionfor short, where the diagonal elements
are increased whenever an off-diagonal element of C is neglected.
Munksgaard(1980) did not prescribe the set J of index pairs to be filled in the IC decomposition in advance, but he proposed to drop elements in the factorization if they are numerically small compared with the diagonal elements of their row and column. He combined this strategy with that of the MIC decomposition of Gustafsson.
Example
10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 cg without preconditioning 1.51E8 flopsnofill cholinc, 7.17E7+1.44E5 flops cholinc(a,0.1), 7.11E7+6.40E6 flops
cholinc(a,0.01), 4.34E7+1.31E7 flops
cholinc(a,0.001), 2.96E7+3.58E7 flops