Summer school on Iterative Projection methods for sparse linear systems and eigenproblems, Jyväskylä, Finland 2006 Chapters: 4

(1)

ITERATIVE PROJECTION METHODS FOR

SPARSE LINEAR SYSTEMS AND

EIGENPROBLEMS

CHAPTER 4 : CONJUGATE GRADIENT METHOD

Heinrich Voss

voss@tu-harburg.de

Hamburg University of Technology Institute of Numerical Simulation

(2)

Introduction

The conjugate gradient method, CG method for short, was developed independently byHestenes(1951) andStiefel(1952). Theoretically, it is a direct method for symmetric and positive definite linear systems, i.e. it yields the exact solution in a finite and predetermined number of operations. Differently from elimination methods, however, at different stages of the CG method one obtains approximate solutions of the system.

For a system of dimension n, at most n steps of the CG method (using exact arithmetic) are needed, where at each step the approximate solution is refined progressively. Generally it is found that after a very small number of iterations a sufficiently good approximation is obtained and the process can be

(3)

Introduction ct.

The iterative nature of the CG method was already mentioned inHestenes & Stiefel(1952) but its importance for the numerical solution of large and sparse system was recognized only after the seminal paper ofReid(1971).

The history of the CG method and the Lanczos algorithm (which is closely related to the CG method) up to 1976 is presented in a review paper ofGolub & O’Leary(1989) which contains an annotated bibliography of 291 items.

(4)

A minimization problem

Suppose that the system matrix A of the linear system of equations Ax = b (1)

is symmetric and positive definite. Then the quadratic functional

φ(x ) :=1 2x

T_{Ax − x}T_b,

x ∈ Rn, is strictly convex.

Hence, there exists a unique solution x∗_{of the minimum problem} φ(x ) = min!, x ∈ Rn, (2)

(5)

1D minimizer

We now take advantage of the characterization of x∗_{as the minimizer of φ to} construct iterative methods for the numerical solution of (1).

Let xk −1_{, k ≥ 1, be an approximate solution of (2), and assume that the} residual

rk −1 :=b − Axk −1 6= 0

(for otherwise, xk −1_{already would be the solution of problem (2)).}

We improve xk −1_{by a one dimensional line search. To this end we choose a} search direction dk _{6= 0 and determine the minimum of φ on the line}

xk −1_{+ τ}_dk_{, τ ∈ R, i.e. we minimize the real equation} ψ(τ ) := φ(xk −1+ τdk) = 1 2(x k −1_{+ τ}_dk₎T_A(xk −1_{+ τ}_dk_{) − (x}k −1_{+ τ}_dk₎T_b = 1 2τ 2_(dk₎T_Adk _{− τ (d}k₎T_rk −1_{+ φ(x}k −1_). ₍₃₎

(6)

1D minimizer ct.

From (dk₎T_Adk _>_{0 we obtain that ψ has a unique minimum, which is} characterized by ψ0_{(τ ) = τ (d}k₎T_Adk _{− (d}k₎T_rk −1 ₌_{0, i.e.}

τk =

(dk₎T_rk −1

(dk₎T_Adk. (4)

As a new approximation to the minimizer x∗of (2) we therefore choose xk :=xk −1+ τkdk.

If rk _:=_{b − Ax}k _{6= 0, then we repeat the one dimensional line search with a} new search direction dk +1_.

(7)

Reduction of φ

From (3) and (4) we obtain the following reduction of the functional φ in the step above: φ(xk −1) − φ(xk) = φ(xk −1) − φ(xk −1+ τkdk) = φ(xk −1) − ψ(τk) = −1 2 (dk₎T_rk −1 (dk₎T_Adk 2 · (dk₎T_Adk₊(dk)Trk −1 (dk₎T_Adk · (d k₎T_rk −1 = 1 2 · ((dk₎T_rk −1₎2 (dk₎T_Adk .

Hence, a reduction of the size of φ occurs in the line search step, if and only if the residual rk −1_{and the search direction d}k _{are not orthogonal.}

(8)

Energy norm

Obviously, for the solution x∗of Ax = b

(x − x∗)TA(x − x∗) = xTAx − 2xTAx∗+ (x∗)TAx∗

= xTAx − 2xTb − (x∗)TAx∗+2(x∗)Tb = 2φ(x ) − φ(x∗) _{for every x ∈ R}n.

Hence for any subset D ⊂ Rn_{the vector ˜}_{x ∈ D is a minimizer of φ in D, if and} only if it minimizes the energy norm

kx∗− ˜x kA:= q

(x∗_{− ˜}_{x )}T_A(x∗_{− ˜}_{x )}

(9)

Example

The Gauß-Seidel method can be interpreted as a line search method if we choose the unit vectors cyclically as search directions, i.e.

djn+k :=uk = (δik)i=1,...,n, k = 1, . . . , n, j ∈ N ∪ {0}.

Suppose that we arrived at the approximate solution xjn+k −1_{. Then the search} direction is uk_{, and hence,}

τjn+k = (rjn+k −1)Tuk (uk₎T_Auk = r_k(nj+k −1) akk = 1 akk bk − n X i=1 akix_i(nj+k −1),

and the new iterate is given by

x_i(nj+k )= ( x_i(nj+k −1) if i 6= k x_k(nj+k −1)+_a1 kk bk −Pnν=1ak νxν(nj+k −1) if i = k

(10)

Example ct.

If additionally we relax the increment: x_k(nj+k ):=x_k(nj+k −1)+ ω 1 akk bk− n X ν=1 ak νxν(nj+k −1)

for some ω ∈ R, then we obtain the SOR method.

Similarly as in (4) we obtain for a relaxed line search the decrease of the functional

φ(xk −1) − φ(xk −1+ ωτkdk) = ω(1 − 0.5ω)

((dk₎T_rk −1₎2 (dk₎T_Adk .

Hence, the SOR method decreases the functional φ in every step if and only if ω ∈ (0, 2), and this is exactly the set of parameters for which convergence of

(11)

Steepest descent method

Given the approximation xk −1_{to the minimizer x}∗_{, the functional φ decreases} most rapidly in the direction of the negative gradient

−∇φ(xk −1_{) =}_{b − Ax}k −1 ₌_rk −1_.

Hence, locally the optimum choice of the search direction is the residual dk _:=_rk −1_{. The appertaining algorithm is the}_{steepest descent method}_.

1: r0₌_{b − Ax}0

2: for k = 1, 2, . . . until convergence do 3: sk =Ark −1 4: τk = krk −1k22(rk −1)Tsk 5: xk ₌_xk −1_{+ τ} krk −1 6: rk ₌_rk −1_{− τ} ksk 7: end for

Cost of steepest descent method: 1 matrix-vector product

2 scalar products 2 _axpys

(12)

Zig-zagging

For ill-conditioned problems the convergence rate of the steepest descent method can be very slow.

Example

The steepest descent method yields for the functional φ(x ) := xT 10−3 ₀ 0 103 x

and the initial vector x0_{:= (100, 10}−3₎T _{after 100 and 1000 iterations,} respectively, the approximations

x100 = (99.4912, 0.000994912)T, x1000= (95.0276, 0.000950276)T to the solution x∗=0, which is not a substantial improvement upon the initial guess.

(13)

Zig-zagging ct.

The reason for the bad performance is that in this example the level curves of φare very elongated ellipses.

The gradient directions that arise in the iteration mainly point into the x2-direction whereas the minimum is situated mainly in the x1-direction. Hence, the method is zig-zagging through the flat and steep-sided valley {x ∈ R2 _{: φ(x ) ≤ φ(x}0_)}.

For ill-conditioned problems this is the typical behavior of the steepest

(14)

Example

10−4 10−3 10−2 10−1 100

Steepest descent for −∆ u = 0; stepsize = 1/128

(15)

Theorem 4.1

Let A be a symmetric and positive definite matrix and denote by λ1≤ λ2≤ · · · ≤ λnthe eigenvalues of A.

Then for every initial vector x0_{the sequence {x}k_{}, which is constructed by the} steepest descent method, converges to the solution x∗of Ax = b, and the following a priori error estimate with respect to the energy norm holds:

kxk − x∗kA≤

λ_n− λ₁ λn+ λ1

k

(16)

Proof

For x ∈ Rnand r := b − Ax one step of the steepest descent method yields the iterate

y := x + kr k 2 2

rT_Ar(b − Ax ) =: (I − θ(x )A)x + θ(x )b, from which we obtain for the error vectors

y − x∗= (I − θ(x )A)(x − x∗).

Let

z := (I − σA)x + σb Then it follows

(17)

Proof ct.

We prove that for σ = ˜σ :=2/(λn+ λ1)the inequality kz − x∗_k A≤ λn− λ1 λn+ λ1 kx − x∗_k A holds. Then the error bound is obtained by induction. For e := x − x∗and ˜e := z − x∗we get

˜

e = (I − ˜σA)e =: Me.

(18)

Proof ct.

Hence, the spectral norm of M is given by kMk2=max 1 − 2λ1 λ_n+ λ₁ , 1 − 2λn λ_n+ λ₁ = λn− λ1 λ_n+ λ₁ =: η, from which we obtain

ky − x∗kA ≤ k˜ekA= kA1/2˜ek2= kA1/2(I − ˜σA)ek2 = k(I − ˜σA)A1/2ek2≤ ηkA1/2_ek2

(19)

Remark

If λ1and λnare the extreme eigenvalues of A, and if κ2(A) :=

λn λ1

denotes the condition number of A with respect to the Euclidean norm then the error reduction factor of the steepest descent method with respect to the energy norm can be written as

η := λn− λ1 λn+ λ1

= κ2(A) − 1 κ2(A) + 1 .

Hence, the smaller the condition number of A is, the faster the steepest descent method converges.

In particular, for κ2(A) = 1 one obtains η = 0.

Actually in this case A = I, and the gradient method yields the minimizer in the

(20)

Example

We consider the discrete version of the model problem with stepsize h = 1/m. The eigenvalues are

λi,j=2 2 − cos iπ m + 1 − cos jπ m + 1 , i, j = 1, . . . , m.

Hence, the minimum eigenvalue is λ1,1=4(1 − cos_m+1π )and the maximum eigenvalue is λm,m =4(1 + cos_m+1π ), and we obtain the error reduction factor

η = λm,m− λ1,1 λm,m+ λ1,1

=cos π m + 1.

(21)

Ideal line search

To avoid the zig-zagging of the steepest descent method we ask for more suitable search directions. An ideal line search method would have the following property:

If the approximations x1_,_x2_{, . . .} _{are determined by exact line} searches along the search directions d1_,_d2_{, . . .}_{, then the minimizer} xk _{of φ on the line x}k −1_{+ τ}_dk_{, τ ∈ R, even minimizes the functional} φon the affine space x0₊_span{d1_{, . . . ,}_dk_{}, which is spanned by all} preceding directions dj_{, j = 1, . . . , k .}

If we are able to construct linear independent directions with this property, then the corresponding line search method is not only convergent but it is finite, because after at most n steps it finds the minimizer of φ on

x0₊_span{d1_{, . . . ,}_dn_{} = R}n_{. The line search method is said to have the}_finite

(22)

Ideal line search ct.

Let x ∈ x0₊_span{d1_{, . . . ,}_dk_{}. Then x has the representation} x = ˜x + τ dk, _{τ ∈ R, ˜}x = x0+

k −1 X

j=1 τjdj.

Similarly as for the 1D-line search φ(x ) = φ(˜x ) + τ k −1 X j=1 τj(dj)TAdk+ τ (dk)Tr0+ 1 2τ 2_(dk₎T_Adk_. If (dj₎T_Adk ₌_{0 for j = 1, . . . , k − 1, then} φ(x ) = φ(˜x ) + τ (dk)Tr0+1 2τ 2_(dk₎T_Adk_,

and the problem of minimizing φ over x0₊_span{d1_{, . . . ,}_dk_{} decouples into a} minimization over x0₊_span{d1_{, . . . ,}_dk −1_{} and the minimization of the real} equation

(23)

Ideal line search ct.

The minimizer τk of ψ is τk = (dk)Tr0/(dk)TAdk, and from

(dk)Trk −1 = dkTb − A(x0+ k −1 X j=1 τ_jdj) = (dk)Tr0− k −1 X j=1 τj(dk)TAdj = (dk)Tr0

one obtains again

τk =

(dk₎T_rk −1 (dk₎T_Adk.

Hence, if the search directions dj satisfy (di)TAdj =0 for i 6= j, then the minimization of φ over the affine space x0+span{d1, . . . ,dk} can be replaced by the consecutive line searches along d1_{, . . . ,}_dk_.

(24)

Conjugate directions

Definition: Let A ∈ Rn×n _{be symmetric and positive definite. The set of vectors}

d1_{, . . . ,}_dk _{is called}_A-conjugate_{, if d}j _{6= 0 for j = 1, . . . , k and} (di)TAdj =0 for i, j = 1, . . . , k , i 6= j.

With this notion we can formulate our result as follows: Theorem 4.2

Let x0_{∈ R}n_{be given and assume that x}1_,_x2_{, . . .} _{have been determined by a} line search method, where the search directions d1_,_d2_{, . . .} _{are mutually} A-conjugate. Then xk _{minimizes the functional φ over the affine space} x0₊_span{d1_{, . . . ,}_dk_}.

(25)

Remark

From the minimality of φ(xk₎_{over x}0₊_span{d1_{, . . . ,}_dk_{} it follows that} (dj)Trk =0, j = 1, . . . , k ,

for otherwise, if (di₎T_rk _{6= 0 for some i ∈ {1, . . . , k }, then there exists α ∈ R}

(26)

Method of conjugate directions

1: r0₌_{b − Ax}0_;

2: for k = 1, 2, . . . until rk =0do 3: choose dk such that

(dk)TAdj =0, j = 1, . . . , k − 1, and (dk)Trk −16= 0 4: τk = (dk)Trk −1(dk)TAdk 5: xk ₌_xk −1_{+ τ} kdk 6: rk ₌_{b − Ax}k 7: end for

(27)

Remark

Two vectors x , y ∈ Rn_{\ {0} are A-conjugate, if and only if they are orthogonal} with respect to the scalar product

hx, y iA:=yTAx .

Hence, if d1_{, . . . ,}_dk _{are A-conjugate, then they are linearly independent.} Moreover, given any set of linearly independent vectors v1_{, . . . ,}_vn_{∈ R}n_{, one} can construct a set of n A-conjugate vectors by Gram-Schmidt

orthogonalization of v1_{, . . . ,}_vn_{with respect to the inner product h·, ·iA. We will} see that there is a less costly method to obtain an A-conjugate basis of Rn_.

(28)

Conjugate Gradient Method

In order to execute the conjugate direction method we have to choose in the k -th step a vector dk _{such that (d}k₎T_Adj ₌_{0, j = 1, . . . , k − 1, and}

(dk₎T_rk −1_{6= 0.}

The first condition is satisfied by the A-orthogonal complement of any vector v 6∈ span{d1_{, . . . ,}_dk −1_{} with respect to span{d}1_{, . . . ,}_dk −1_}.

To fulfill also the second requirement, we choose v := rk −1_{, i.e.}

dk =rk −1− k −1 X j=1 (dj₎T_Ark −1 (dj₎T_Adj d j_.

(29)

CG method

For dk we actually obtain

(dk)Trk −1 = (rk −1)Trk −1− k −1 X j=1 (dj)TArk −1 (dj₎T_Adj (d j₎T_rk −1 | {z } =0 = (rk −1)Trk −1 >0,

and hence, dk _{is an admissible search direction.} The choice dk =rk −1− k −1 X j=1 (dj)TArk −1 (dj₎T_Adj d j_.

seems to have the disadvantages that it is very costly to perform and that all previous search directions dj _{have to be stored.}

Actually the sum reduces to the last term. This follows from the following Lemma.

(30)

Lemma 4.3

Let x0∈ Rn_{be given and assume that the approximate solutions x}j_, j = 1, . . . , k , are determined by the conjugate direction method, where the search directions are given by .

dk =rk −1− k −1 X j=1 (dj₎T_Ark −1 (dj₎T_Adj d j_. ₍₁₎

Then the vectors rj _{and d}j _satisfy

(rk)Trj =0, j = 0, . . . , k − 1, (2) and

(31)

Proof

From (1) it follows that

rj =dj+1+ j X i=1 (di₎T_Arj (di₎T_Adid i_,

and therefore, (dj₎T_rk ₌_0, _{j = 1, . . . , k yields}

(rk)Trj = (rk)Tdj+1 | {z } =0 + j X i=1 (di₎T_Arj (di₎T_Adi (r k₎T_di | {z } =0 =0 for j = 1, . . . , k − 1, i.e. (2).

(32)

Proof ct.

To prove (3) (i.e. (dj₎T_Ark ₌_{0) we rewrite d}j+1_{, j = 1, . . . , k − 1, in the} following way: dj+1 = rj− j X i=1 (di₎T_Arj (di₎T_Adi d i = b − Axj− j X i=1 (di₎T_Arj (di₎T_Adi d i = b − A(xj−1+ τjdj) − j X i=1 (di₎T_Arj (di₎T_Adi d i = b − Axj−1− τjAdj− j X i=1 (di₎T_Arj (di₎T_Adi d i_.

(33)

Proof ct.

From τj 6= 0 (otherwise, xj−1would have been the minimizer of φ and the algorithm would have been terminated) we obtain

Adj = 1 τj rj−1− dj+1− j X i=1 (di₎T_Arj (di₎T_Adi d i ! ,

and therefore, for j = 1, . . . , k − 1

(rk)TAdj = 1 τj (rk)Trj−1− (rk)Tdj+1− j X i=1 (di₎T_Arj (di₎T_Adi (r k₎T_di ! .

Finally from (rk₎T_rj ₌_{0 for j = 0, . . . , k − 1 and from (d}j₎T_rk ₌_{0, for} j = 1, . . . , k , we get that all terms on the right hand side vanish. This

(34)

From (1) and the last Lemma it follows that the current search direction dk is given by dk =rk −1− (d k −1₎T_Ark −1 (dk −1₎T_Adk −1 d k −1_=:_rk −1_{+ β} kdk −1.

Hence, we do not have to store the hole history of search directions to perform the A-orthogonalization of rk −1, but only the most recent direction is needed. It will turn out that the execution of the conjugate direction method requires the storage of four vectors.

(35)

Improvement of CG method

We can even improve the efficiency of the algorithm by some simple conversions.

Firstly, the current residual can be updated according to

rk =b − Axk =b − A(xk −1+ τ_kdk) =rk −1− τkAdk.

This manipulation saves one matrix-vector product Axk _{in every iteration} because the matrix-vector product Adk _{is needed in the calculation of τk}_, anyhow.

(36)

From rk ₌_rk −1_{− τ}

kAdk one gets for k ≥ 2 Adk −1= 1

τ_{k −1}(r

k −2_{− r}k −1_),

and the orthogonality of the residuals yields (dk −1)TArk −1= 1

τk −1

(rk −2− rk −1₎T_rk −1_{= −} 1 τk −1

(rk −1)Trk −1,

and from dk =rk −1+ βkdk −1and rk =rk −1− τkAdk we obtain (dk −1)TAdk −1 = (dk −1)TA(dk −1− βk −2dk −2) = (dk −1)TArk −2= 1 τ_{k −1}(r k −2_{− r}k −1₎T_rk −2 = 1 τk −1 (rk −2)Trk −2.

(37)

Thus, the current search direction reads dk =rk −1+(r

k −1₎T_rk −1 (rk −2₎T_rk −2d

k −1_.

Finally, we can replace (dk₎T_rk −1_{in τk} _by

(dk)Trk −1= (rk −1+ βkdk −1)Trk −1= (rk −1)Trk −1, which is needed in the determination of βk, anyway.

Putting these simplifications together we finally arrive at the following method of conjugate gradients, which is essentially the method of Hestenes and Stiefel.

(38)

Conjugate Gradient Algorithm

1: r0_:=_{b − Ax}0_; 2: α0:= (r0)Tr0. 3: d1=r0; 4: for k = 1, 2, . . . until dk ==0do 5: sk =Adk; 6: τk = αk −1(dk)Tsk; 7: xk ₌_xk −1_{+ τ} kdk; 8: rk ₌_rk −1_{− τ} ksk; 9: βk =1/αk −1; 10: αk = (rk)Trk; 11: βk = βk · αk; 12: dk +1₌_rk_{+ β} kdk; 13: end for

(39)

Cost of CG method

1 matrix-vector product 2 scalar products 3 _axpys

Storage requirements: 4 vectors

The termination criterion in the CG algorithm is unrealistic because rounding errors destroy the conjugacy among the search directions and finite

termination usually will not appear.

The termination criterion should be based on a combination of a sufficient decrease of the residual and a maximum number of iterations.

(40)

Example

10−5 10−4 10−3 10−2 10−1 100

Steepest descent and CG for −∆ u = 0; stepsize = 1/128

||x st. descent||A

(41)

Error bound

Because the conjugate gradient method is a direct method, and hence, in exact arithmetic the solution is obtained in a finite number of steps, terms like ‘convergence’ or ‘asymptotic reduction rate’ do not make sense. In this section we shall derive an error estimate.

We already realized that for the solution x∗of Ax = b it holds kx − x∗k2_A=2φ(x ) − φ(x∗) for every x ∈ Rn.

Hence, xk _{minimizes the error e := x}∗_{− x with respect to the energy norm} k · kAover the affine space x0+span{d1, . . . ,dk}.

The representation of span{d1_{, . . . ,}_dk_{} which is contained in the following} lemma is the key to the error bound for the conjugate gradient method.

(42)

Lemma 4.4

Assume that the search directions d1, . . . ,dk are constructed by the conjugate gradient method. Then

span{d1, . . . ,dk} = span{r0,Ar0, . . . ,Ak −1r0}.

Definition

Let B ∈ Rn×n_{and v ∈ R}n_{be given. Then the linear space} Kk(v , B) := span{v , Bv , . . . , Bk −1v } is called the k -thKrylov spaceof B corresponding to v .

(43)

Proof

For k = 1 the statement is trivial since d1₌_r0_.

Assume that it already has been shown for some k < n. Then (with the notation d0₌_{0) we obtain} dk +1 = rk+ β_kdk =rk −1− τ_kAdk+ β_kdk = dk− βk −1dk −1− τkAdk + βkdk, (∗) and from dk_,_dk −1 _{∈ K} k(r0,A) it follows that span{d1_{, . . . ,}_dk +1_{} ⊂ K} k +1(r0,A).

(44)

Error bound ct.

As before we denote by e0_:=_x∗_{− x}0_{the error of x}0_{.Then from} r0=b − Ax0=A(x∗− x0_{) =}_Ae0

one obtains Aj_r0₌_Aj+1_e0_{, and thus the elements of the Krylov space} z ∈ K(r0_,_{A) can be written as} z = k X j=1 α_jAje0, α_j _{∈ R.}

Hence, it follows from (*) that the error ek _:=_x∗_{− x}k _{of the k -th iterate x}k satisfies kekk2_A = min x ∈x0_+K_k_(r0_,A)kx ∗_{− xk}2 A = min z∈Kk(r0,A) kx∗_{− x}0₊_zk2 A

(45)

Error bound ct.

Let

˜

Π_k := {p : p is a polynomial of degree ≤ k , p(0) = 1}

be the set of polynomials p of maximum degree k such that p(0) = 1. Then (**) gets the form

kek_k2 A= min

p∈ ˜Πk

kp(A)e0_k2 A.

Hence, the error of the k -th iterate can be represented as the product of a polynomial p(A) in A and the initial error e0_{. Therefore, the CG method can be} considered as a polynomial iteration, where in contrast to the Chebyshev iteration the underlying polynomial is constructed by the CG method itself. More important, the polynomial does not depend on parameters (estimates of the bounds of the spectrum of A) which have to be prepared by the user in the Chebyshev method.

(46)

Error bound ct.

To obtain a more convenient form of the error representation we consider the spectral decomposition A = n X j=1 λjzj(zj)T

of A, where zj_{, j = 1, . . . , n, is an orthonormal set of eigenvectors of A and} 0 < λ1≤ λ2≤ · · · ≤ λnare the corresponding eigenvalues.

Then A2= n X i,j=1 λiλjzi(zi)Tzj | {z } =δij (zj)T = n X j=1 λ2_jzj(zj)T, and by induction Ai = n X λi_jzj(zj)T.

(47)

Error bound ct.

Therefore, from e0=: n X j=1 γjzj one gets kek_k2 A= min p∈ ˜Πk k n X j=1 γjp(λj)zjk2A.

The definition of the energy norm yields n X j=1 γjp(λj)zj 2 A = n X i=1 γip(λi)zi !T A   n X j=1 γjp(λj)zj   = n X j=1 γ2_jλ_jp(λj) 2 .

(48)

Error bound ct.

Hence, we obtain the very important representation of the error of xk _{in the} energy norm kek_k2 A=min p∈ ˜Πk n X j=1 γ_j2λj p(λj) 2 (1) which will be discussed in detail at the end of this section.

(49)

Coarse upper bound

First we deduce from (1) a coarse upper bound of the convergence properties of the CG method. From ke0k2_A= n X j=1 λjγ2j

we obtain by estimating the values |p(λj)|by the maximum of |p| on the spectrum of A kekkA≤ min p∈ ˜Πk max λ∈spec(A) |p(λ)| · ke 0_k A, and even coarser

kekkA≤ min p∈ ˜Πk

max λ1≤λ≤λn

|p(λ)| · ke0kA, and the special choice

p(λ) = Ck 2λ − λn− λ1 λn− λ1 Ck −λ_n− λ1 λn− λ1

(50)

Theorem 4.5

Let A ∈ Rn×n_{be symmetric and positive definite, and denote by λn}_{and λ} 1the largest and smallest eigenvalue of A, respectively.

Let x0_{be any initial vector, and denote by x}k _{the k -th approximate to the} solution x∗ of Ax = b, which is obtained by the CG method.

Then the following a priori error estimate with respect to the energy norm holds: kxk− x∗kA≤ 2Rk 1 + R2kkx 0_{− x}∗_k A, R := √ λn− √ λ₁ √ λn+ √ λ1 . (2)

Replacing the denominator in (2) by 1 we receive the error estimate

kxk − x∗kA≤ 2 p

κ2(A) − 1 !k

(51)

Example

To demonstrate the dependence of the convergence of the CG method on the condition number, we consider the linear system of equations Ax = 0, where A is a diagonal matrix of dimension 500, such that the entries are pseudo random numbers in the interval [1, a].

The next figure contains the reduction of the Euclidean norms of the errors for the cases a = 10, a = 100 and a = 10000 in a semilogarithmic scale.

(52)

Example ct.

10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100

Speed of convergence of CG method

a=10

a=100 a=10000

(53)

Example

To demonstrate the quality of last Theorem the following figure contains the real reduction factors kekkA/ke0kAas well as those ones, which are predicted by (2) for the discrete version of the model problem with stepsize h = 1/128. In this problem the condition number is

κ2(A) =

1 + cos₁₂₈π 1 − cos π

128

(54)

Example ct.

10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 error bound error reduction

(55)

Improved error bound

The last Theorem provides only very coarse upper bounds of the speed of convergence of the CG method.

In particular, the finite termination property is not reflected by (2). This can be obtained from the finer estimate (1).

Because the matrix A ∈ Rn×nhas at most n distinct eigenvalues, we obtain from (1) for every polynomial q ∈ ˜Πk, such that qk(λj) =0, j = 1, . . . , n, that

kekk2_A= min p∈ ˜Πk n X j=1 γ_j2λj p(λj) 2 ≤ n X j=1 γ_j2λj qk(λj) 2 =0. (4) Moreover, we get

(56)

Theorem 4.6

Let A ∈ Rn×n_{be symmetric and positive definite, and let x}k _{be the} approximates to the solution x∗_{of Ax = b, which are obtained by the CG} method.

If A has m ≤ n distinct eigenvalues, then in exact arithmetic the CG algorithm stops with xk ₌_x∗_{after at most m steps.}

Proof

If A has the distinct eigenvalues λ1< · · · < λm, then

qm(λ) := m Y j=1 (λ − λj) in (4) yields kem_k2 A=0.

(57)

Clustered eigenvalues

If the matrix A does not have a very small number m n of distinct eigenvalues but they are contained in m disjoint intervals Ij of very small length, then the CG method will not stop with the exact solution after m steps, but it will produce a very small residual after m steps.

To see this we choose a polynomial q of degree m such that q(0) = 1, which possesses a root in each of the intervals Ij. Then by assumption all

eigenvalues are close to one of the roots of q, and thus max{|q(λj)| : λj is an eigenvalue of A},

(58)

(59)

Clustered eigenvalues ct.

To demonstrate the effect of clustered eigenvalues on the convergence properties of the CG method, we consider a homogeneous linear system Ax = 0, where A is a diagonal matrix of dimension n = 500 with condition number κ2(A) = 10000.

The following figure contains the reduction of the energy norms of the errors for the cases that all eigenvalues are uniformly distributed in 2, 5 and 10 intervals of length 1, respectively, and for the case that the eigenvalues are uniformly distributed in the interval [1, 10000] in a semilogarithmic scale.

(60)

Clustered eigenvalues ct.

10−10 10−8 10−6 10−4 10−2 100 uniformly distributed 10 intervals 5 intervals 2 intervals

(61)

Outlying eigenvalues

Knowing only the largest and the smallest eigenvalues λnand λ1of a positive definite matrix A, the bound in Theorem 4.5 is the best possible. If the interior eigenvalues of A lie at the points where the shifted Chebyshev polynomial Ck attains its maximum absolute value in [λ1, λn], then for a certain initial error e0, the CG polynomial will be equal to the Chebyshev polynomial, and the bound in Theorem 4.5 will be actually be attained at step k .

If additional information is available about interior eigenvalues of A, one can often improve the simpler estimate of Th. 4.5 while maintaining a simple expression.

Suppose, for example that A has one eigenvalue λnwhich is much larger than the others. Then we replace the polynomial in the minmax characterization of the error by the product of a linear factor which is zero at λn, and the shifted and scaled Chebyshev polynomial on the interval [λ1, λn−1]of degree k − 1:

(62)

Outlying eigenvalues ct.

pk(λ) = Ck −1 2λ − λn−1− λ1 λ_n−1− λ1 Ck −1 −λn−1− λ1 λ_n−1− λ1 λn− λ λ_n .

Since the second factor is zero at λnand less than one in absolute value on each of the other eigenvalues, the maximum absolute value of this polynomial on {λ1, . . . , λn} is less than the maximum absolute value of the first factor on {λ1, . . . , λn−1}.

Using arguments like those in Chapter 3 it follows that kxk _{− x}∗_k A≤ 2 √κ_n−1− 1 √ κn−1+1 k −1 kx0_{− x}∗_k A, κn−1= λn−1 λ1

(63)

Outlying eigenvalues ct.

Similarly, if A has a few outlying eigenvalues:

λ1≤ λ2≤ · · · ≤ λn−` λn−`+1≤ · · · ≤ λn, λn−`/λn−`+1 1, one can consider a polynomial pk which is the product of a polynomial of degree ` that is zero at each of the outlyers and less than one in magnitude at each of the other eigenvalues, and a shifted and scaled Chebyshev

polynomial of degree k − ` on the interval [λ1, λn−`].

pk(λ) = Ck −` 2λ − λn−`− λ1 λ_n−`− λ1 Ck −1 −λn−`− λ1 λ_n−`− λ1 n Y j=n−`+1 λj− λ λ_j .

With this polynomial one obtains from the minmax characterization of the error the upper bound

kxk_{− x}∗_k A≤ 2 √κn−`− 1 √ κ_n−`+1 k −` kx0_{− x}∗_k A, κn−` = λn−` λ₁

(64)

Local effects in convergence behavior

The following figure shows the convergence behavior of the CG method for A = diag([0.001; 0.1+rand (98, 1); 1.2]); b = ones(100, 0); x 0 = zeros(100, 1).

(65)

Local effects in convergence behavior

In the beginning the convergence is quite slow in agreement with the condition number κ(A) = 1200.

After iteration 10 the convergence gets noticeably faster than in the first phase.

To explain this effect we need a different variant of the CG method, namely its connection to the Lanczos process (Lanczos1950).

(66)

Lanczos method

The Lanczos algorithm determines an orthonormal basis {q1_{, . . . ,}_qk_{} of the} Krylov space

Kk(r0,A) := span{r0,Ar0, . . . ,Ak −1r0}, k = 1, . . . , n, such that

Tk :=Q_kTAQk, Qk := (q1, . . . ,qk) is tridiagonal.

(67)

Lanczos method ct.

Assume that we already computed the orthonormal basis q1_{, . . . ,}_qk _{of the} Krylov space Kk(v , A).

Then Ak −1_{v ∈ span{q}1_{, . . . ,}_qk_{}, and therefore, there exist γ}

1, . . . , γk ∈ R such that Akv = A(Ak −1v ) = A k X j=1 γkqj = γkAqk+A k −1X j=1 γjqj .

The second term on the right hand side is contained in Kk(v , A). Hence, to obtain an orthonormal basis of Kk +1(v , A) it suffices to compute the orthonormal complement of the uk :=Aqk with respect to the vectors q1, . . . ,qk.

(68)

Lanczos method ct.

Since Aqj ∈ Kj+1(v , A) ⊂ Kk −1(v , A) for every j < k − 1, we have (qj)Tuk = (qj)TAqk = (Aqj)Tqk =0.

Hence, (uk₎T_qj ₌_{0 for j = 1, . . . , k − 2, and therefore} Aqk = γkqk +1+ αkqk+ βk −1qk −1.

(69)

Lanczos method ct.

The coefficients are obtained from

β_{k −1} = (qk −1)TAqk = (Aqk −1)Tqk = (γk −1qk + αk −1qk −1+ βk −2qk −2)Tqk = γk −1, i.e. Aqk = βkqk +1+ αkqk + βk −1qk −1. (1) Thus, αk = (qk)TAqk, and the condition kqk +1_k

2=1 yields

βk =1/kAqk− αkqk− βk −1qk −1k2. (2)

If the denominator in (2) vanishes, then Aqk _{∈ K}

k(v , A), and therefore, Kk(v , A) is a k -dimensional invariant subspace of A.

(70)

Lanczos method ct.

1: q0₌_{0; k = 1} 2: β0= kr0k 3: while rk −1_{6= 0 do} 4: qk ₌_rk −1_/β k −1 5: rk ₌_Aqk 6: rk ₌_rk_{− β}_{k −1q}k −1 7: αk = (qk)Trk 8: rk ₌_rk_{− α} kqk 9: βk = krkk 10: end while

Then with Tk =tridiag{βj−1, αj, βj}

AQk =QkTk +rk(ek)T, rk =Aqk − αkqk − βk −1qk −1.

(71)

Lanczos method ct.

Consider the CG method for the linear system

Ax = b, A symmetric and positive definite. Let x0₌_{0 (else consider Ay = b − Ax}0_{=: ˜}_b)

The approximation xk _{after k steps minimizes} φ(x ) :=1

2x

T_{Ax − b}T_x _{in K}

(72)

Lanczos method ct.

Restricting φ to Kk(b, A) yields φk(y ) := 1 2y T_QT kAQky − yTQTkb, y ∈ Rk, and the CG iterate xk _satisfies

Tkyk =QkTb, xk =Qkyk.

Disadvantage(at a first glance): All qj_{, j = 1, . . . , k , are needed to determine} xk_{. We will come back to this point in the next section.}

(73)

Local convergence behavior

The following analysis of the local convergence behavior of the CG method is contained in a paper ofvan der Sluis & van der Vorst(1986)

Multiplying

AQk =QkTk + βkqk +1(ek)T by e1from the right yields

Aq1=QkTke1.

Premultiplying this with A from the left gives

A2q1=AQkTke1= (QkTk+ βkqk +1(ek)T))Tke1=QkTk2e1for k>2.

And similarly one obtains

Ajq1=QkT_kje1, for j = 3, . . . , k − 1, Akq1=QkTkke1+cqk +1 for some constant c

(74)

Local convergence behavior ct.

Since xk _{∈ K}

k(b, A) it follows that there exists some polynomial pk(t) =Pkj=0γjtj of degree k such that

rk =pk(A)b, and from rk _{⊥ K} k(b, A) and q1=b it follows 0 = Q_kTpk(A)q1=QTk k X j=0 γ_jAjq1 = k X j=0 γjQkTQkT j ke1+ γkcQTkqk +1 = pk(Tk)e1

(75)

Local convergence behavior ct.

Since Tk is symmetric there exists an orthonormal basis of Rk of eigenvectors yj, j = 1, . . . , k of Tk. Denote by θj the eigenvalue corresponding to yj, and assume that the eigenvalues are ordered by magnitude θ1≤ θ2≤ · · · ≤ θk. Let e1₌Pk

j=1τjyj. Then τj 6= 0 for j = 1, . . . , k :

Assume that τj =0, i.e. (yj)Te1=0. Then it follows from Tkyj = θjyj 0 = (Tkyj)Te1= (yj)T(Tke1) = (yj)T(α1e1+ β1e2) ⇒ (yj)Te2=0, and by induction we obtain the contradiction (yj)Tei =0, i = 1, . . . , k .

(76)

Local convergence behavior ct.

Since τj 6= 0 for every j we obtain from

0 = pk(Tk)e1= k X j=1 τjp(Tk)yj = k X j=1 τjp(θj)yj

that the eigenvalues of θj are the roots of the polynomial pk, and from pk(0) = 1 we therefore obtain pk(t) = k Y j=1 (θj − t) , _k Y j=1 θj (1)

Exploiting this characterization of the residual polynomial pk we can further analyze the convergence behavior of the CG method.

(77)

Local convergence behavior ct.

Let zj, j = 1, . . . , n be an orthonormal set of eigenvectors of A, and let λj be the eigenvalue corresponding to zj_{, where the eigenvalues again are ordered} by magnitude.

Write the initial error as

x − x0= n X j=1 µjzj. From

A(x − xk) =b − Axk =rk =pk(A)r0=pk(A)A(x − x0) =Apk(A)(x − x0) it follows x − xk =pk(A)(x − x0) = n X j=1 µjpk(A)zj = n X j=1 µjpk(λj)zj.

(78)

Local convergence behavior ct.

We now replace the first component of xk _{(with respect to the basis z}j_{) by the} first component of x and take the modified xk _{as starting vector ˜}_x0_{of another} CG process (with the same A and b):

x − ˜x0= n X

j=2

µjpk(λj)zj.

The new CG process generates a sequence ˜xm_{for which the error is} characterized by a polynomial ˜pm.

(79)

Local convergence behavior ct.

Let qk(t) = (λ1− t)(θ2− t) . . . (θk − t) λ1θ2. . . θk =θ1(λ1− t) λ1(θ1− t) pk(t). From ˜pm(0)qk(0) = 1 it follows kx − xk +m_k2 A≤ k˜pm(A)qk(A)(x − x0)k2A= n X j=2 λjpm˜ (λj)2qk(λj)2µ2j. Defining Fk := θ1 λ1 max j=2,...,n λj − λ1 λj − θ1 it follows that |qk(λj)| ≤Fk|pk(λj)|

(80)

Local convergence behavior ct.

Hence, the upper bound can be further simplified as

kx − xk +mk2_A ≤ F_k2 n X j=2 λjp˜m(λj)2pk(λj)2µ2j = F_k2kx − ˜xmk2_A≤ F_k2kx − ˜x m_k2 A kx − ˜x0_k2 A kx − xkk2A.

If the smallest eigenvalue θ1is close to λ1, then Fk ≈ 1, and we may expect a reduction of the error in the next steps that is bounded by the reduction that we obtain for a CG process for Ax = b in which λ1is missing.

With κ2(A) := λn/λ2we have kx − xm+k_k A≤ 2 p κ2(A) − 1 pκ₂(A) + 1 !m kx − xk_k A.

(81)

Preconditioning

The estimate kxk _{− x}∗_k A≤ 2 p κ2(A) − 1 pκ2(A) + 1 !k kx0_{− x}∗_k A (1)

suggests that the convergence of the CG method depends heavily on the condition number of the system matrix A.

We therefore transform problem Ax = b into a system ˜

A˜x = ˜b, (2) where

˜

A := C−1AC−T, b := C˜ −1b, x := C−Tx ,˜ (3)

and C ∈ Rn×n_{is a nonsingular matrix, such that linear systems of equations} Cy = d and CTy = d (d ∈ Rngiven)

can be solved easily and that the condition number of ˜A is as small as possible.

(82)

CG for ˜

A˜

x = ˜

b

1: ˜r = ˜b − ˜A˜x 2: d = ˜˜ r 3: α = (˜r )T_˜_r 4: while√α > εdo 5: s = ˜˜ A˜d 6: τ = α.(˜d )T_˜_s 7: x = ˜˜ x + τ ˜d 8: ˜r = ˜r − τ ˜s 9: β =1/α 10: α = (˜r )T_˜_r 11: β = β · α 12: d = ˜˜ r + β ˜d 13: end while

(83)

CG for ˜

A˜

x = ˜

b original variables; ˜

d := C

T

d ; ˜

s = C

−1

s

1: C−1r = C−1b − C−1AC−Tx = C˜ −1(b − Ax ) 2: CT_{d = C}−1_r 3: α =rT_C−T_C−1_r 4: while√α > εdo 5: C−1s = C−1AC−TCT_{d = C}−1_Ad 6: τ = α.(dT_C)(C−1_s) _{= α}_dT_s 7: CT_{x = C}T_{x + τ C}T_d 8: C−1r = C−1r − τ C−1s 9: β =1/α 10: α =rTC−TC−1r 11: β = β · α 12: CTd = C−1r + βCTd 13: end while

(84)

Preconditioned CG. M = CC

T 1: r = b − Ax 2: Solve Md = r for d 3: α =rT_d 4: while√α > εdo 5: s = Ad 6: τ = αdT_s 7: x = x + τ d 8: r = r − τ s 9: β =1/α 10: Solve Mz = r for z 11: α =rTz 12: β = β · α 13: d = z + βd 14: end while

(85)

PCG

Thepreconditioned conjugate gradient algorithm,PCG algorithmfor short,

requires as the original CG method one matrix-vector product and five level-1-operations, and additionally the solution of one system of linear equations Mz = r .

Notice that the matrix C, which we used in the derivation of the PCG method, does not appear in the algorithm, but only thepreconditionerM.

M = CCT _{obviously is symmetric and, by the regularity of C, positive definite,} and every symmetric and positive definite matrix principally is a suitable preconditioner (we can choose C := M1/2_{in the derivation of the PCG} method).

(86)

Preconditioner

In practical problems, M must have the following properties: Linear systems Mz = r can be solved easily

M is a good approximation of A.

The choice of the preconditioner can effect the speed of the convergence dramatically. Preconditioning is a field of active research.

The most important preconditioner is obtained by incomplete Cholesky factorization.

(87)

Preconditioning by Splitting Methods

To obtain a fast preconditioned CG method we have to find a symmetric and positive definite matrix M such that firstly, M approximates A as good as possible, and secondly, the linear system of equations Mz = r can be solved easily.

When we considered splitting methods, we studied nearly the same question. All splitting matrices constructed in Chapter 2, which inherit the symmetry and positive definiteness from the system matrix A are suitable preconditioners.

(88)

Preconditioning by Splitting ct.

If A := D − E − ET is the standard partition, where D denotes the diagonal part of A and −E the strictly lower triangular part, then the following preconditioners are at hand:

M = D Jacobi

M := (D − E )D−1(D − ET₎ _{symmetric GS} M :=_ω(2−ω)1 (D − ωE )D−1(D − ωET), ω ∈ (0, 2), SSOR.

Of course, these matrices are symmetric and positive definite, if the system matrix A has these properties.

The solution of Mz = r for M := B can be obtained by one step of the corresponding splitting iteration

(89)

(90)

Preconditioning by Splitting ct.

It is obvious that more than one step of the splitting method can be performed to get an improved preconditioning.

If in an inner iteration k steps of (1) are in use, then the preconditioner is given by M = k −1 X j=0 h B−1(B − A)ijB−1. (2)

Since A and B are symmetric, each term in (2) is symmetric. Thus M is symmetric, too. Concerning the positive definiteness of M the following theorem was proved byAdams(1985).

(91)

Theorem 4.7

Let A and B be a symmetric matrices, and suppose that A is positive definite and B is nonsingular. Then the matrix M given in (2) is symmetric.

(i) For k odd, M is positive definite if and only if B is positive definite.

(ii) For k even, M is positive definite if and only if 2B − A is positive definite. We next discuss the condition of Theorem 4.7 which seems to be quite strange.

(92)

Theorem 4.8

Suppose that A ∈ Rn×n_{and B ∈ R}n×n_{are symmetric and positive definite.} Then the matrix 2B − A is positive definite if and only if ρ(B−1(B − A)) < 1. Proof: Obviously, λ is an eigenvalue of B−1(B − A) if and only if it is an eigenvalue of the generalized eigenproblem

(93)

Proof ct.

Since B is positive definite all eigenvalues of B−1(B − A) are real. From Rayleigh’s principle we obtain that ρ(B−1(B − A)) < 1 if and only if

−1 < x

T_{(B − A)x}

xT_Bx <1 for every x 6= 0, which is equivalent to

−xTBx < xT(B − A)x < xTBx for every x 6= 0, i.e.

xT(2B − A)x > 0 and xTAx > 0 for every x 6= 0.

Since A is supposed to be positive definite ρ(B−1(B − A)) < 1 if and only if

(94)

Remark

Adams(1985) studied k -step preconditioners.

For SSOR preconditioners she proved that the condition number is a nonincreasing function of k .

She reports however, that the actual decrease in the number of iterations is not enough to balance the extra work involved in the preconditioner, at least if the dimension of the problem is not too large.

(95)

Incomplete Cholesky Preconditioner

Cholesky Factorization

1: for i = 1 : n do 2: for j = 1 : i − 1 do

3: cij=aij−Pj−1_{k =1}cikcjk/cjj

4: end for 5: cii=

q

aii−Pi−1k =1cik2

6: end for

If the Cholesky factor C of A is known, then the linear system of equations Ax = b can be solved directly:

Solve Cy = b, solve CTx = y where both systems are triangular.

(96)

Incomplete Cholesky ct.

Let

m(i) := min{j : 1 ≤ j ≤ i, aij 6= 0}, i = 1, . . . , n, i.e. ai,m(i)is the first nonzero entry in the i-th row of A.

Then, the set of index pairs

S(A) := {(i, j) : m(i) ≤ j ≤ i, 1 ≤ i ≤ n} is called theenvelopeof the matrix A.

From the algorithm above it follows immediately, that all elements cijsuch that (i, j) 6∈ S(A) vanish. Hence, if A is a band matrix, then C has the same band structure.

(97)

Incomplete Cholesky ct.

Matrices that arise from discretizations of boundary value problems of partial differential equations usually possess an envelope that is very large but only sparsely populated by nonzero elements. This sparsity pattern unfortunately is not inherited by the Cholesky factor, but most of the envelope is filled in during the elimination process.

To reduce the storage requirements one chooses a subset J ⊂ S(A) of the envelope of A, which always contains the diagonal index pairs

(1, 1), . . . , (n, n), and modifies the algorithm above in the following way: All entries cijsuch that (i, j) 6∈ J are set to zero.

(98)

Incomplete Cholesky Factorization

1: for i = 1 : n do 2: for j = 1 : i − 1 do 3: if (i, j) 6∈ J then 4: cij=0 5: else

6: cij=aij−Pj−1_{k =1}cikcjk/cjj;

7: end if 8: end for 9: cii= q aii−Pi−1j=1c2ij 10: end for

(99)

Incomplete Cholesky Factorization ct.

If J = S(A), then C is the Cholesky factor and A = CCT_.

If J ⊂ S(A), J 6= S(A), then the factorization is calledincomplete Cholesky

decomposition,IC decompositionfor short.

A reduction of J usually decreases the storage requirements, the factorization time and the time needed to solve the linear systems CTy = r and Cz = y in the CG step. It also increases the defect A − CCT which reduces the rate of convergence of the preconditioned CG method.

These conflicting influences make it difficult to determine the optimum choice of J.

(100)

No fill principle

The most common choice, calledno-fill principle, is

J = J(A) := {(i, j) : 1 ≤ j ≤ i ≤ n, aij 6= 0}.

For diagonally sparse matrices, where the nonvanishing entries appear on parallels of the main diagonal (e.g. finite difference approximations of boundary value problems on rectangles are of this type)Meijerink & van der Vorst(1977) proposed to allow the incomplete factor C to have a few more nonzero diagonals than the matrix A itself (counting only the number of diagonals of A on one side of the main diagonal, of course).

(101)

IC decomposition

One disadvantage of the IC factorization is, that there exist symmetric and positive definite matrices A such that the IC decomposition does not exist. The algorithm may fail because the square root of a negative number is required to compute a diagonal element cii.

The following theorem, that we state without proof, contains a class of matrices for which the IC decomposition is guaranteed to exist.

Definition

A regular matrix A ∈ Rn×nis called anM-matrix, if A has nonpositive off-diagonal elements and if its inverse has nonnegative entries. For A ∈ Rn×n_{the matrix}

hAi = (haiji) where haiji =

−|a_ij| for i 6= j |aii| for I = j is calledOstrowski’s comparison matrix(Ostrowski1937). A ∈ Rn×nis calledH-matrix, if hAi is an M-matrix.

(102)

IC factorization

Theorem

Let A be a symmetric H-matrix. Then for any choice of the index set J, the IC factorization can be carried out.

In case that the conditions of the last Theorem are not satisfied the

incomplete Cholesky decomposition may not exist. There have been several suggestions made to remedy the nonexistence.

Suppose that the computation of ciirequires the square root of a nonpositive number. Then, one simple idea is to replace cii by an arbitrary positive number.

One strategy in use is the choice c2 ii =

Pi−1

(103)

IC factorization

A different approach which was advocated byManteuffel(1980): It is well known that strictly diagonally dominant matrices with positive diagonal and nonpositive off-diagonal entries are M-matrices. Hence, for α ≥ 0 sufficiently large, the matrix ˆA := A + αI is an H-matrix, and by the last Theorem its IC factorization exists.

Gustafsson(1978) proposed the so calledmodified incomplete Cholesky

decomposition,MIC decompositionfor short, where the diagonal elements

are increased whenever an off-diagonal element of C is neglected.

Munksgaard(1980) did not prescribe the set J of index pairs to be filled in the IC decomposition in advance, but he proposed to drop elements in the factorization if they are numerically small compared with the diagonal elements of their row and column. He combined this strategy with that of the MIC decomposition of Gustafsson.

(104)

Example

10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 cg without preconditioning 1.51E8 flops

nofill cholinc, 7.17E7+1.44E5 flops cholinc(a,0.1), 7.11E7+6.40E6 flops

cholinc(a,0.01), 4.34E7+1.31E7 flops

cholinc(a,0.001), 2.96E7+3.58E7 flops