Eigenvector Perturbations for General Kernel Functions

kΛΨ^>vjk ≤mjkΨ⁺k. (4.39)

Now since trivially, the 2-norm bounds each individual component of a vector, and [ΛΨ^>vj]` = λ`ψ`(X)^>vj/√

n(because [ΛΨ^>]`i=λ`[Ψ]i`=λ`ψ`(Xi)/√ n),

√1

n|λ`ψ`(X)^>vj| ≤mjkΨ⁺k. (4.40) The kernel matrix K has at most rank r. Therefore, if r < n, and the columns of V are sorted in non-increasing order, mr+1, . . . , mn, andvr+1, . . . , vn lie in the nullspace ofK, and are orthogonal to the image ofK which lies insidev1, . . . , vr. On the other hand, the image of Kis also spanned by the columns ofΨ. Therefore,Ψ^>vj= 0 forr+ 1≤j≤n.

We conclude this section by relating the size of the pseudo-inverse to kΨ^>Ψ−Ik, which was called therelative error term kCkin Chapter 3.

Recall that the norm of the pseudo inverse Ψ⁺ is the inverse of the smallest singular value of Ψ:

kΨ⁺k= 1/σn(Ψ). (4.41)

The singular values are the square roots of the eigenvalues ofΨ^>Ψ. First of all, note thatkΨ^>Ψ− Ik= max_i|λi(Ψ^>Ψ)−1|, and thereforekΨ^>Ψ−Ik →0 implies thatλ_i(Ψ^>Ψ)→1 for all 1≤i≤n.

Furthermore,

1−λn(Ψ^>Ψ)≤ max

1≤i≤n|λi(Ψ^>Ψ)−1| ≤ kΨ^>Ψ−Ik ⇒ λn(Ψ^>Ψ)≥1− kΨ^>Ψ−Ik. (4.42) Therefore,σn(Ψ) =p

λn(Ψ^>Ψ)≥p

1− kΨ^>Ψ−Ik, and it follows that kΨ⁺k= 1

σn(Ψ) ≤ 1

p1− kΨ^>Ψ−Ik. (4.43)

We have proven the following lemma:

Lemma 4.44 Under the conditions of the previous theorem, it holds that

kΨ⁺k ≤ 1

p1− kΨ^>Ψ−Ik. (4.45)

Thus, since kΨ^>Ψ−Ik →0almost surely, it follows that kΨ⁺k →1.

4.8 Eigenvector Perturbations for General Kernel Functions

The next step consists in relating the scalar products between sample vectors of eigenfunctions and the eigenvectors of the degenerate kernelsK^[r]. In Lemma 4.28, we have seen that this is ac-complished by multiplying these scalar products with the scalar products between the eigenvectors ofKand K^[r]:

√1

n|λ_`ψ_`(X)^>u_j| ≤

j=1

|u^>_iv_j| 1

√n|λ_`ψ_`(X)^>v_j|. (4.46) In Theorem 4.34, we have proved that

√1

n|λ`ψ`(X)^>vj| ≤mjkΨ⁺k. (4.47) Therefore,

j=1

|u^>_iv_j| 1

√n|λ_`ψ_`(X)^>v_j| ≤ kΨ⁺k

j=1

|u^>_iv_j|m_j. (4.48)

The last sum is the expression we will study in this section.

Recall that u₁, . . . , u_n are the eigenvectors ofK andv₁, . . . , v_r are those ofK^[r]. We interpret Kas being an additive perturbation of K^[r],K=K^[r]+E^r. The vector

s= u^>_iv₁, . . . , u^>_iv_r

. (4.49)

contains the coefficients of ui with respect to the eigenbasis of K^[r]. Therefore, these scalar products measures the perturbation of vi to ui induced by the additive perturbation E^r. If kE^rk = 0, ui =vi, and since the vj are orthogonal, only [s]i = 1, with all other entries being zero. For non-zero perturbations,uiwill be perturbed away fromvi leading to a spreading of the coefficients away from the configuration of all coefficients being zero except for [s]i= 1. We wish to study the amount of this perturbation, and the effect this has on the sumPr

j=1|u^>_ivj|mj. The first question is addressed by a family of general results on perturbation of eigenvectors, known assin-theta-theorems.

The following Lemma is a special case of (Davis and Kahan, 1970, Theorem 6.2) (see also (Eisenstat and Ipsen, 1994; Stewart and Sun, 1990))

Lemma 4.50 LetAbe a symmetricn×nmatrix with spectral decompositionULU^>. LetUandL be partitioned as follows.

U= [U1U2], L=

L1 0 0 L2

, (4.51)

where U1 ∈ Mn,k, L1 ∈ Mk, U2 ∈ Mn,n−k, and L2 ∈ Mn−k. Furthermore, let E be another symmetric matrix and A˜ = A+E. Let ˜l be an eigenvalue of A˜ and x˜ an associated unit-length eigenvector. Then,

kU^>₂xk ≤˜ kEk

minn−k≤i≤n|˜l−li|. (4.52)

Proof It holds that

(A+E)˜x= ˜lx ⇒ Ex˜= (˜lI−A)˜x. (4.53) Therefore,

kEk ≥ kE˜xk=k(˜lI−A)˜xk=k(˜lI−ULU^>)˜xk (4.54) This norm becomes smaller when we only consider the lastn−kcomponents of the resulting vector.

This part is computed by(˜lI−U2L2U^>₂)˜x. Therefore, we continue (4.54):

kE˜xk ≥ k(˜lI−U₂L₂U^>₂)˜xk=kU2(˜lI−L₂)U^>₂xk,˜ (4.55) becauseU₂U^>₂=I. Finally,

kU2(˜lI−L2)U^>₂xk˜ =k(˜lI−L2)U^>₂xk ≥˜ min

n−k≤i≤n|˜l−li|kU^>₂xk.˜ (4.56) Dividing bymini|˜l−li| concludes the proof of the lemma.

This lemma has a simple corollary for the case where one considers scalar products between individual eigenvectors ofAand ˜A:

Corollary 4.57 Denote the eigenvalues ofK by li and those ofK^[r] by mj. Let the corresponding eigenvectors beui, andvj respectively. Then,

|u^>_ivj| ≤ kE^rk

|l_i−m_j|∧1 =:ωij (4.58)

wherea∧b= min(a, b). The numbersωij will be called perturbation coefficients.

Proof The corollary follows from the previous lemma by setting A=K^[r], ˜A=K,E=E^r, and settingU2 equal to an×1 matrix equal to vj. The scalar product cannot become larger than 1 because|u^>_ivj| ≤ kuikkvjk= 1, andui,vj are unit length vectors.

4.8. Eigenvector Perturbations for General Kernel Functions 71

0 20 40 60 80 100

10⁻¹² 10⁻¹⁰ 10⁻⁸ 10⁻⁶ 10⁻⁴ 10⁻² 10⁰

i, j

eigenvalue

original and perturbed eigenvalue for ||E|| = 1e−09 m_j l_i

(a) The original and perturbed eigenvalues.

0 20 40 60 80 100

10⁻¹⁰ 10⁻⁸ 10⁻⁶ 10⁻⁴ 10⁻² 10⁰

||E||/|li − mj|

perturbation coefficients ω_ij

i = 5 i = 10 i = 20 i = 30

(b) The resulting perturbation coefficients.

Figure 4.2: Example plots for perturbation coefficientsω_ij. IfkE^rkis small and the eigenvalues de-cay quickly, the large eigenvalues are well-separated such that the perturbation of the eigenvectors is negligibly small.

This is a classical result which is usually paraphrased as the perturbation being small if the eigenvalues are well-separated. In our case, we assume that the eigenvalues decay to zero, such that the eigenvalue become clustered around 0 and seem anything but well-separated. However, note that the separation is measured at the scale ofkE^rk. In Chapter 3, we have seen thatkE^rk →0 as rincreases, such thatkE^rkwill be rather small typically, and eigenvalues which are close together can be well-separated nevertheless. Now, for|li−mj|>kE^rk we can re-write

ωij = 1

|li−mj| kE^rk

. (4.59)

Typically, j 7→ ω_ij will have the following shape for fixed i (see Figure 4.2). In Figure 4.2(b), each line describes the characteristics of the perturbation of a single eigenvector. The pertur-bation coefficient ω_ij will be 1 for eigenvalues m_j which are closer than kE^rk to l_i. For larger eigenvalues, ω_ij drops off fairly quickly, as it does for smaller eigenvalues, although it eventually starts to reach a plateau and not decay further. Roughly stated, if l_i is still much larger than kE^rk, and li is isolated, then ωij will have a single peak of 1 at ωii and be negligibly small for ωi1, . . . , ω_i,j−1, ωi,j+1, . . . , ωin. For small eigenvaluesli, the perturbation can be rather severe, al-though we see that the perturbation will occur mostly in the direction of eigenspaces to comparably small eigenvalues.

Now, we return to the sum

j=1

ω_ijm_j. (4.60)

We will show that outside of a relatively small set aroundli,ωij will be of the order of kE^rk.

Lemma 4.61 Consider

ω_ijm_j=

kE^rk

|li−mj| ∧1

m_j. (4.62)

Then,

mj≥2li ⇒ ωijmj ≤2kE^rk, (4.63)

mj ≤1

2li ⇒ ωijmj ≤ kE^rk. (4.64)

Proof For this proof, we will drop the superscript r onE^r for convenience. First, note that for m_j = 2l_i,

kEkmj

|li−mj| = 2kEkli

2li−li

= 2kEk. (4.65)

Furthermore, it holds that for m_j > l_i, m_j 7→ kEkm_j/(m_j−l_i) decreases monotonously as m_j increases.

For the second inequality, observe that formj =¹₂li, kEkmj

|li−m_j| =

1 2kEkli

li−¹₂lj

=kEk, (4.66)

and ifmj< li,mj7→ kEkmj/(li−mj) is decreasing monotonously asmj decreases.

Based on the last lemma, we can bound the sum (4.60) as follows:

Lemma 4.67 Define the set

J(l_i) =n

j∈ {1, . . . , r}

2l_i ≤m_j ≤2l_io

. (4.68)

Then, withC(li) =|J(li)|,

j=1

ωijmj≤2liC(li) + 2rkE^rk. (4.69) Proof It holds that

j=1

ωijmj= X

j∈J(li)

ωijmj+ X

j /∈J(li)

ωijmj. (4.70)

For j ∈ J(li), ωijmj ≤ mj ≤ 2li, and for j /∈ J(li), ωijmj ≤ 2kEk by the previous lemma.

Therefore,

j=1

ωijmj≤ X

j∈J(li)

2li+ X

j /∈J(li)

2kEk ≤2liC(li) + (r−C(li))kEk. (4.71) Since C(li) will be rather small typically, we can simplify the bound by omitting the second occurrence of theC(l_i) term. This completes the proof of the lemma.

Note that of the two terms in (4.69), only the first term 2liC(li) does not scale with kE^rk.

This term relates to the number of eigenvalues which cluster aroundli. Therefore, we see that the perturbation is basically confined to the cluster aroundli.

Im Dokument Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning (Seite 69-72)