Estimates II: Bounded Kernel Functions - Spectral Properties of the Kernel Matrix and their Rel

bound,

max

1≤`,m≤r|c_`m| ≥ ε r

≤X

`≥m

|c_`m| ≥ ε r

o≤r(r+ 1) exp

− nε² 2M⁴r²

(3.91) by Equation (3.89). Equating the right hand side with δ and solving (3.91) for ε results in the

claimed inequality.

From this theorem, we see that for each fixedr, the convergence speed ofkC^r_nk →0 depends on the sizerofC^r_nandM only. A relative bound for a larger number of eigenvalues will necessarily be less tight, but this effect is only due to the increased number eigenfunctions which are considered.

In particular, the eigenvalues themselves do not appear in the bound.

Next we turn to the absolute error term which is governed by the truncation functione^r. Since the eigenfunctions are uniformly bounded,

|e^r(x, y)|=

∞

i=r+1

λiψi(x)ψi(y)

≤M²

∞

i=r+1

λi (3.92)

Therefore, ifE^r_n= (e_ij)∈Mn, withe_ij=e^r(X_i, X_j)/n, kE^r_nk ≤n max

1≤i,j≤n|eij| ≤M²

∞

i=r+1

λi. (3.93)

Using (3.93) and Theorem 3.85, we obtain the following relative-absolute bound.

Theorem 3.94 (Relative-Absolute Bound, Bounded Eigenfunctions)

Let k be a Mercer kernel on Hµ(X) with eigenvalues (λi)_i∈N and eigenfunctions (ψi)_i∈N. Let sup_ikψik_∞ =M < ∞. Let µ be a probability measure on X, and Kn be the normalized kernel matrix based on ann-sample fromµ. Then, for 1≤r≤n, and0< δ <1, with probability larger than1−δ,

|λi(Kn)−λi| ≤λiC(r, n) +E(r, n) (3.95) with

C(r, n)< M²r r2

nlogr(r+ 1) δ E(r, n)< λr+M²

∞

i=r+1

λi.

(3.96)

In words, the eigenvalues ofK_n converge to their limits with a relative-absolute bound, whose relative error term is independent of the eigenvalues, and increases almost linearly in r. The absolute error term is given by the sum of the remaining eigenvalues, which will be small if the eigenvalues decay quickly. We can thus say that the eigenvalues converge essentially on a relative scale with respect to the true eigenvalues of the kernel function.

For a more detailed asymptotic analysis, see Section 3.11.

3.10 Estimates II: Bounded Kernel Functions

The class of kernel functions with bounded eigenfunctions is rather restrictive, although it is possible to construct interesting learning algorithms using, for example, a kernel function based on the sine basis. An example which leads to unbounded eigenfunctions is the rbf-kernel (see (2.17)).

Since the eigenfunctions can in principle become arbitrarily large (measured in the supremum norm or the fourth moment, while keeping the 2-norm fixed), we need to impose some form of

regularity condition. In this section, we assume that the “diagonal” x 7→ k(x, x) of the kernel function is bounded,

sup

x∈X

|k(x, x)|=K <∞. (3.97) This condition is quite natural for the class of kernels built from radial basis functions. These are kernel functions which can be written as

k(x, y) =g(kx−yk) (3.98)

with an appropriateg:R→R.

The first consequence of condition (3.97) is that the eigenfunctions and the error function e^r cannot become arbitrarily large.

Lemma 3.99 Let kbe a Mercer kernel with eigenvalues(λ_i)and eigenfunctions(ψ_i)such that the diagonal ofk is uniformly bounded byK. Then forI⊆N,

0≤X

i∈I

λ_iψ_i²(x)≤k(x, x)≤K (3.100)

for allx∈ X. In particular,

|ψi(x)| ≤ rK

λi

. (3.101)

Consequently, the error function e^r is bounded0≤e^r(x, x)≤K for allr∈N. Proof Since all the summandsλ_iψ_i²(x) are positive,

K≥ |k(x, x)|=

∞

i=1

λiψ_i²(x)

≥

i∈I

λiψ_i²(x)

. (3.102)

The bound onψi follows forI={i}. The bound one^r follows forI={r+ 1, . . .}.

The error estimates will depend on certain regularity parameters of ψi and e^r. First of all, we are interested in the variance ofψ`ψm under µ since the expectationψ`ψm is approximated via empirical means inC^r_n. Using the standard notation thatEµ(f) is the expectation off with respect to the measureµ, and Var_µ(f) the respective variance, define

γ_`m² =Var_µ(ψ_`ψ_m). (3.103)

Moreover, we introduce the following expectation which is closely related to the absolute error term. Forr∈N, let

t_r=E(e^r(X₁, X₁)). (3.104)

3.10.1 The Relative Error Term

We begin by treating the relative error. The first step is to upper bound the variance of the random variables of whichC^r_n is constructed.

Lemma 3.105 Let (ψ_i) be the eigenfunctions of a Mercer kernel whose diagonal is uniformly bounded byK. Then,E_µ(ψ_`²ψ_m²)≤min(K/λ_`, K/λ_m), and

γ_`m² =Varµ(ψ`ψm−δ`m)≤min(K/λ`, K/λm)−δ`m. (3.106) Proof By the H¨older inequality,

Eµ(ψ_`²ψ_m²)≤ kψ_`²k1kψ_`²k_∞≤K

λ_`, (3.107)

3.10. Estimates II: Bounded Kernel Functions 43 becausekψ²_`k1=kψ`k²₂= 1. The same bound holds with`andminterchanged which proves the first inequality.

The second inequality follows from the definition of the variance and the fact thatE_µ(ψ_iψ_j) = δ_ij:

Varµ(ψ`ψm−δ`m) =Varµ(ψ`ψm) =Eµ(ψ²_`ψ²_m)−(Eµψ`ψm)²

≤min(K/λ_`, K/λ_m)−δ_`m, (3.108)

and the proof is completed.

Theorem 3.109 Let kbe a Mercer kernel with eigenvalues(λ_i)and let the diagonal ofkuniformly bounded byK. Then, with probability larger than 1−δ,

kC^r_nk< r s

nλ_rlogr(r+ 1)

δ + 4rK

3nλ_rlogr(r+ 1)

δ (3.110)

Proof Letc`mbe the entries ofC^r_n. It holds that c_`m= 1

i=1

ψ_`(X_i)ψ_m(X_i)−δ_`m. (3.111)

Therefore, for 1≤i≤r, by Lemma 3.99, sup_x,y∈X|ψi(x)ψi(y)| ≤K/λr,

−K λr

−δ`m≤c`m≤ K λr

−δ`m, (3.112)

and the range ofc`mhas sizeM := 2K/λr.

We can bound the variance ofψ`(Xi)ψm(Xi)−δ`m using Lemma 3.105:

Varµ(ψ`ψm−δ`m)≤ K

λ_r =:σ². (3.113)

By the Bernstein inequality (Theorem 2.42), P{|c`m| ≥ε} ≤2 exp

− nε² 2σ²+ 2M ε/3

(3.114) In the proof of Theorem 3.85, we showed that

P{kC^r_nk ≥ε} ≤ X

`≥m

|c`m| ≥ ε r o

. (3.115)

Thus,

P{kC^r_nk ≥ε} ≤r(r+ 1) exp

− n(ε/r)² 2σ²+ 2M ε/3r

. (3.116)

Setting the right hand side equal toδ and solving forεyields (compare Theorem 2.44) that with probability larger than 1−δ,

kC^r_nk<2M r

3n logr(r+ 1)

δ +r

r2σ²

n logr(r+ 1)

δ . (3.117)

Substituting the values forσ²and M yields the claimed upper bound.

Remark 3.118 The previous lemma contains a term which scales asO(1/n). However, using the Chebychev inequality, one can show that this term is not essential, because with probability larger than 1−δ,

kC^r_nk< r s

r(r+ 1)K

2λrnδ . (3.119)

Proof By the Chebychev inequality,

P{|c`m| ≥ε} ≤ Var_µ(ψ_`ψ_m−δ_`m)

nε² ≤ K

λrnε². (3.120)

Thus,

P{kC^r_nk ≥ε} ≤ r(r+ 1) 2

Kr²

λrnε². (3.121)

Equating the right hand side toδand solving forεproves the remark.

Note however, that the scaling withδandris much worse for the bound based on the Chebychev inequality than for the bound based on the Bernstein inequality.

3.10.2 Absolute Error Term

Let us now turn to the absolute error term. The absolute error term in the relative-absolute bound (3.72) measures the size ofE^r_nin the operator norm. Fortunately, the norm can be bounded rather efficiently since E^r_n is also positive definite because, by construction, e^r is also a Mercer kernel.

Then, since the eigenvalues are all positive,

kE^r_nk ≤traceE^r_n= 1 n

i=1

e^r(Xi, Xi). (3.122)

By the strong law of large numbers it follows that 1

i=1

e^r(Xi, Xi)→a.s.E(e^r(X1, X1)) =tr, (3.123) wheretrhas been defined in (3.104). In the following, we will first relate tr to the eigenvalues of k, compute certain statistical properties ofe^rand then derive a finite sample size bound onkE^r_nk.

We introduce the following handy notation for the tail sum of the eigenvalues:

Λ>r=

∞

i=r+1

λi. (3.124)

First we show thattris actually equal to the tail sum of the eigenvalues.

Lemma 3.125 Let k be a Mercer kernel with diagonal bounded by K < ∞ and eigenvalues (λi).

Then,

tr=

∞

i=r+1

λi= Λ>r. (3.126)

Proof Using (3.19), we compute tr=

e^r(x, x)µ(dx) = Z

∞

i=r+1

λiψ_i²(x)

! µ(dx)

∞

i=r+1

λ_i Z

ψ_i²(x)µ(dx) =

∞

i=r+1

λ_ikψ_ik²=

∞

i=r+1

λ_i= Λ_>r.

(3.127)

Note that summation and integration commute becausePR

i=r+1λiψ²_i(x) is bounded byK for all

R > r, and Lebesgue’s theorem.

3.10. Estimates II: Bounded Kernel Functions 45 Next, we compute statistical properties of e^r(X₁, X₁) which are necessary for the application of the Bernstein inequality.

Lemma 3.128 Let E^r_n be the truncation error matrix as defined in (3.78), and let X ∼ µ, the common distribution of theXi. Then, for a kernel function with a diagonal uniformly bounded by K,

0≤e^r(X, X)≤K, Var(e^r(X, X))≤KE(e^r(X, X)) =Ktr. (3.129) Proof The first inequality has already been proven in Lemma 3.99. With respect to the variance, note that

Var(e^r(X, X) =E(e^r(X, X)²)−(E(e^r(X, X)))²

≤E(|e^r(X, X)|) sup

x∈X

|e^r(X, X)| −(E(e^r(X, X)))²

≤E(e^r(X, X)) sup

x∈X

|e^r(X, X)|=t_rK

(3.130)

by the H¨older inequality and Lemma 3.99.

We are now prepared to prove the error bound onX.

Theorem 3.131 Let kbe a Mercer kernel onH_µ(X)with eigenvalues(λ_i)and eigenfunctions(ψ_i), whose diagonal is bounded by K <∞. Then, for a given confidence 0 < δ < 1, and 1 ≤r≤ n with probability larger than 1−δ,

kE^r_nk< tr+

r2Ktr

n log1 δ+2K

3n log1

δ. (3.132)

Proof In Lemma 3.128, we have proven that the range of e^r(Xi, Xi) has size K, and that Var(e^r(Xi, Xi))≤Ktr.

Thus, by the (one-sided) Bernstein inequality, with probability larger than 1−δ, P{kE^r_nk −t_r≥ε} ≤exp − nε²

2Ktr+^2Kε₃

! .

Setting the right hand side equal toδand solving forεresults in the claimed upper bound (compare

(2.47)).

Remark 3.133 As in the case of the relative error term, using the Chebychev inequality, one can show that with probability larger than 1−δ,

kE^r_nk< t_r+ rKtr

nδ . (3.134)

3.10.3 Relative-Absolute Bound for Bounded Kernel Functions

We can now derive the main result for bounded kernel functions. Combining Theorem 3.109, and Theorem 3.131, we obtain a final bound for the case where the kernel function is bounded.

Theorem 3.135 (Relative-Absolute Bound, Bounded Kernel Functions)

Let k be a Mercer kernel on Hµ(X) with eigenvalues (λi), and a diagonal which is uniformly bounded byK <∞. LetKn be the normalized kernel matrix based on ann-sample fromµ. Then, for1≤r≤n, and0< δ <1, with probability larger than 1−2δ,

|li−λi| ≤λiC(r, n) +E(r, n) (3.136)

with

C(r, n)< r s

2K nλr

log2r(r+ 1)

δ + 4Kr

3nλr

log2r(r+ 1)

δ ,

E(r, n)< λr+ Λ>r+

r2KΛ>r

n log2 δ+2K

3n log2 δ.

(3.137)

Proof The basic relative-absolute bounds holds by Theorem 3.71. The upper bounds on the relative error term kC^r_nk and the absolute error term kE^r_nk were derived in Theorem 3.109 and 3.131. The bound on kE^r_nk follows by Theorem 3.131 and substituting for tr the term from Lemma 3.125.

Both bounds hold with probability larger than 1−δ. Therefore, combining both bounds with confidenceδ/2 using Lemma 2.57 leads to a bound on the sum which holds with probability 1−δ.

Remark 3.138 Again, note that the O(1/n) terms in C(r, n) and E(r, n) are not essential. As stated in Remarks 3.118 and 3.133, using the Chebychev inequality, one can show that a similar bound holds with

C(r, n) =r s

r(r+ 1)K

λ_rnδ , E(r, n) =λ_r+ Λ_>r+

r2KΛ>r

nδ . (3.139)

Im Dokument Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning (Seite 41-46)