• Keine Ergebnisse gefunden

1 2 3 4 5 6 7 8

x 104 11

11.5 12 12.5 13 13.5 14 14.5 15 15.5

number of paths Y0

Mean and std of Y as a function of the number of paths

Importance sampling for energy derivative Crude least−squares Monte Carlo

Sequential gradient method.

Figure 4.29: Convergence ofYbtn0stop,L in the case of energy derivative with EIS and different optimization methods.

sequences(κhb)b∈N for differentκ reveal all kinds of patterns with respect to convergence: Sometimes they seem to converge, sometimes there is no regular behavior observable, in many cases we have several limit points. Though it is possible to obtain variance reduction effects by simply using one hazardly cho-sen element of the particular sequence.

The next concern is the choice of the optimization method: In the superhedging framework the direct simplex method turns out to be superior to the sequential methods. However, despite the convergence problems both sequential optimization methods reveal the best efficiency in the out of the money look-back options example. Consequently, there is no general rule which method should be preferred.

Furthermore, in few applications we are faced with biased estimators when using the EIS approach. It is not clear, how this bias appears and we therefore cannot always rely on the outcomes even there seems to be convergence towards an optimalh, which is the bad message.

As last problem one should mention the high computational burden which is especially necessary for the sequential optimization procedures because of the dependency of the ’optimal’hon the starting value

κh0. This property prevents to some extent the effectiveness of the induced variance reduction since one should include the computation time in such a consideration, see e.g. the discussion in Boyle et al. [10], p. 1270 f.

Conclusively, for practical purposes one could give the following recommendation: If there are any tailor-made methods from option pricing for the linear BSDE available one should rather use them (also if f is nonlinear) because of their higher potential variance reduction effect and their - in general - lower computational requirement. Having none of such methods from option pricing at hand, one should test EIS with any of the optimization methods and even experiment with results of non-converging sequences in the sequential optimization methods. Unfortunately, the use of the EIS methodology in BSDEs seems to be more problematic as in econometrics, though it is by far not useless and sometimes even highly successful.

4.4 Nonparametric methods

In the former chapters we only considered least-squares Monte Carlo methods. More precisely, we chose as approximation for conditional expectations orthogonal projections on spaces spanned by a finite num-ber of square-integrable basis functions. However, this is not the only possible choice. This subsection is

116 4.4. Nonparametric methods

devoted to the approach via nonparametric methods, especially the use of the Nadaraya-Watson estima-tor. It also tries to compare the numerical results to the findings of Bender and Denk [2] concerning the least-squares Monte Carlo method without variance reduction.

The Nadaraya-Watson estimator is a special case of an estimator resulting from locally polynomial re-gression. In the latter case, the basic modeling assumption is the following: We have a sample of pairs (X1,Y1), . . . ,(XL,YL), which are independent and identically distributed random vectors. Yλis assumed to be scalar andXλRM0 with density fX.

In statistics the model equation is usually written as

Yλ=m(Xλ) +v12(Xλλ, λ=1, . . . ,L, (4.10) wherev(x)andm(x)denote the unknown conditional variance and the conditional mean atxrespectively, i.e. v(x) = Var[Y|X = x] andm(x) = E[Y|X = x]. The noise terms ελ are assumed to be i.i.d. with E[ελ] =0, Var[ελ] =1 and(ε1, . . . ,εL,X1, . . . ,XL)are assumed to be independent.

At first sight the notation is somewhat confusing: Our task in the numerical approximation of BSDEs is to compute a conditional expectation, hence in this syntax the determination of the unknown functionm.

In contrast, the left hand side of equation (4.10) has to be read as observation, which is explained on the right hand side by the conditional expectation givenXplus some error term.

To get an idea of the methods of locally polynomial regression, we want to sketch their basic form: For the moment supposeXis one-dimensional and the unknown functionmis smooth in the sense that the first (r+1)-derivatives in a neighborhood ofx0exist. Thus we can writem(x)as Taylor polynomial around x0plus a remainder term, i.e.

m(x) =

r j=0

m(j)(x0)

j! (x−x0)j+(x−x0)r+1

(r+1)! m(r+1)(x0+θ(x−x0))

for someθwith 0≤θ≤1. Withβj(x0):= m(j)j!(x0) we obtain the local polynomial approximation m(x)≈

r j=0

βj(x0)(x−x0)j.

This motivates the following optimization problem bβ(x0) =arginf

β∈Rr+1

½ L

λ=1

µ Yλ

r j=0

βj(x0) (x−x0)j

2 K

µXλ−x0 hL

¶ ¾

. (4.11)

Thereby,Kis a kernel function, which usually has support[−1, 1]. Typical examples are Uniform kernel: K(u) = 121{|u|≤1},

Triangle kernel: K(u) = (1− |u|)1{|u|≤1}, Epanechnikov kernel: K(u) = 34(1−u2)1{|u|≤1}, Gaussian kernel: K(u) = 1exp(−12u2).

Since we are able to choose this function, quite naturally the question arises, how to do this optimally.

However, "for practical purposes the choice of the kernel function is almost irrelevant for the efficiency of the estimate" as Härdle et al. [27], p. 61 remark.

The form of the different kernel functions explain the notion "locally": We only include observations in our estimation, where|Xλ−x0| ≤hL, if we use compactly supported kernel functions or at least downweight observations "far away" fromx0, if we use the Gaussian kernel. This kind of modeling is reasonable, since we can expect, that those observations have no or at least less impact on the resulting outcome of

4.4. Nonparametric methods 117

E[Y|X = x0]. Consequently, the role of the so-called bandwidth hL gets important: It determines the region aroundx0, where we conjecture some influence. In contrast to the choice of kernel functions, the bandwidth has an important impact on the quality of the estimator as we will see later on.

The corresponding estimators form(x0)and the firstrderivatives ofmatx0based on problem (4.11) are given by

b

m(j)(x0) =j!βbj(x0), j=0, . . . ,r.

We now describe the simplest case of locally polynomial regression, wherer=0, i.e. the locally constant regression, whose resulting estimator is usually called Nadaraya-Watson estimator. We give up the tem-porary assumptionX being one-dimensional. The functional form of the estimator in the multivariate case is

m(x) =b ∑Lλ=1KH(Xλ−x)Yλ

Lλ=1KH(Xλ−x) ,

whereKH represents a multivariate kernel function and the subindexHsymbolizes the dependency on the nonsingular, symmetric bandwidth matrix. This matrix determines the region where we conjecture some influence on the dependent variable.

For our purposes it is enough to restrict to product kernel functions of the form

KH(u) =

M0

j=1

KjH(uj),

whereKjHis a one-dimensional kernel function and which corresponds to a diagonal bandwidth matrix HLcontaining on its main diagonal the bandwidths in the specific direction.

The representation of the estimator ofmreveals that it is simply a weighted average of the observed vari-ableYλ, i.e.m(x) =b ∑Lλ=1wλ(x)Yλ, wherewλ(x) = KH(Xλ−x)

Lλ=1KH(Xλ−x). This property is quite disadvantageous in our setting, since this produces dependent approximations of the solutions of the BSDE starting at the first Picard iteration. Consequently, we are not able to use the results about the asymptotic behavior of the resulting estimators. Nonetheless, in order to get an idea about the convergence speed in situations where the condition of independence is fulfilled, we cite the following result of Ruppert and Wand [43]

concerning the asymptotic conditional bias and the asymptotic conditional variance of the multivariate Nadaraya-Watson estimator:

Theorem 4.4.1. Let x be an interior point of the support of fXand assume that the following properties are satisfied:

1. The multivariate kernel KHis a compactly supported, bounded kernel such thatR

uu>KH(u)du=µ2(KH)I, whereµ2(KH)6=0is scalar and I is the M0×M0identity matrix. In addition, all odd-order moments of KH vanish, that is,R

u1l1·. . .·ulddKH(u)du = 0for all nonnegative integers l1, . . . ,lM0 such that their sum is odd. (This last condition is satisfied by spherically symmetric kernels and product kernels based on symmetric univariate kernels.)

2. At x the function v is continuous, f and m are continuously differentiable and all second-order derivatives are continuous. Additionally, fX(x)>0and v(x)>0.

3. The sequence of bandwidth matrices HL12 is such that L−1det(HL)and each component of HLtends to zero as L→and HLremains symmetric and positive definite. Moreover, there is a constant C such that the ratio of the largest to the smallest eigenvalue of HLis at least C for all L.

118 4.4. Nonparametric methods

Then the asymptotic conditional bias and variance of the Nadaraya-Watson kernel regression estimator are Bias(m(x)|Xb 1, . . . ,XL) = µ2(KH)∇m(x)>HLH>L∇f(x)

fX(x) +1

2µ2(KH)tr{HL>Hm(x)HL}+oP(tr(HL)), Var(m(x)|Xb 1, . . . ,XL) = 1

Ldet(HL) Z

|KH(u)|2du v(x)

fX(x)(1+oP(1)).

Thereby∇and Hdenote the gradient and the Hessian, respectively, tr denotes the trace of a matrix and for two sequences of random variables uλ,vλwe write uλ=oP(vλ)if and only if for allε>0holds P(|uλ/vλ|>ε)→0.

Proof. See, Ruppert and Wand [43], Theorem 2.1, where we have to note that in this paper the local linear estimator is examined (r=1 in (4.11)). Since we are interested in the asymptotics of the Nadaraya-Watson estimator we have to add for the asymptotical bias the first summand. The above formula can be found e.g. in Härdle et al. [27], Theorem 4.8.

The third assumption ensures that the asymptotical conditional bias and variance converge to zero. This shows that the bandwidth choice has to be made carefully. At the same time the expressions for the conditional asymptotic bias and variance reveal the basic trade-off concerning a too large or to small bandwidth choice. If each component ofHis very small the bias will be small, but the variance is growing due to the fact that only few observations will fall in the local neighborhood. On the other hand, larger bandwidths create modeling errors and possibly an oversmooth. We further want to mention that at boundary points, the asymptotic conditional bias and variance of the Nadaraya-Watson estimator are of higher magnitude than in the interior. This is the case, as there are in general fewer observations in the neighborhood available.

This drawback would be avoided if we instead used the local linear estimator. Also the asymptotic bias is lower, since the leading summand in the above formula is missing. Additionally, the asymptotic variance is equal to that of the Nadaraya-Watson estimator, implying we should rather use this methodology, to get better convergence results. However, the computational effort would also rise considerably and as we will see, this is one of the real problems with the nonparametric approach to the numerical approximation of the solution of BSDEs.

Now, we sketch some ideas how to get a reasonable bandwidthhin the case where we choose a bandwidth matrixHL = h·IM0×M0: There are several proposals how to choose the optimal bandwidth hopt. We have to distinguish between global and local criteria: An example of a local criterion is to choosehoptas minimizer of the asymptotic mean squared error (AMSE) at each pointx0, i.e.

AMSE(L,h,x0) = 1

LhM0C1+h4C2

for two constantsC1andC2depending on the kernel function the curvature ofm,σand the density fX

at the chosen point. Consequently, the optimal bandwidthhoptis proportional toL−1/(4+M0). If we apply this criterion we have to calculate for every point, where we want to evaluate the estimator a special bandwidth, which in turn means a high computational burden in an auxiliary step of our problem.

The global criterion selects a bandwidthhopt which is the same for all points, where we again assumed that the bandwidth is chosen the same for all components: HL = h·IM0×M0. The computational cost is lower in this case and we will apply such a criterion in our simulations. The simplest among them is the cross-validation technique: Here, we simply look for the minimizer of the expression

CV(h) = 1 L

L λ=1

(Yλ−mb−λ(Xλ))2w(Xλ),

4.4. Nonparametric methods 119

wherew(x)is some weight function andmb−λ(Xλ)denotes the leave-one-out-estimator b

m−λ(Xλ) = ∑j6=λKH¡

Xj−Xλ¢ Yj

j6=λKH¡

Xj−Xλ¢ .

This approach is (in average) equivalent to minimizing the averaged squared error (ASE) ASE(h) = 1

L

L λ=1

(mb(Xλ)−m(Xλ))2w(Xλ), which in turn is a discrete approximation of the integrated squared error (ISE)

ISE(h) = Z

RM0 (m(x)b −m(x))2w(x)fX(x)dx.

The latter clearly represents a global discrepancy measure.

In our special situation without importance sampling there areL simulations for any point of the time gridi=0, . . . ,N−1 given by

λXti = Ã

λSti ...

!

, φ(λXtN) +

N−1

j=i

f(tj,λStj,λYbtn−1j ,λZbtn−1j )∆j, and

λWd,i

i Ã

φ(λXtN) +

N−1

j=i+1

f

³

tj,λStj,λYbtn−1j ,λZbtn−1j

´

j

!

, d=1, . . . ,D.

The missing components of the vectorλXti have to be chosen the same way as in the least-squares Monte Carlo approach. Now, the corresponding Nadaraya-Watson estimators for the solution for the BSDE evaluated atλXti are given by

λYbtni = ∑l=1L

³

φ(lXtN) +∑N−1j=i f

³

tj,lStj,lYbjn−1,lZbtn−1j

´

j

´ Ki,nH ¡

lXti λXti

¢

Ll=1Ki,nH ¡

lXti λXti¢ and

λZbnd,ti = ∑Ll=1lWid,i

³

φ(lXtN) +∑N−1j=i+1f³

tj,lStj,lYbtn−1

j ,lZbn−1t

j

´

j´ Kei,nH ¡

lXtiλXti¢

Ll=1Kei,nH ¡

lXti λXti¢

forλ=1, . . . ,L,i=0, . . . ,N−1, whereKi,nH andKei,nH are the corresponding multivariate kernel functions.

Note that fort0=0 the estimator simply boils down to a mean, since the initial valueλXt0 is non-random.

Now, our algorithm ford=1 is organized as follows:

We have to chooseKi,nH andKei,nH. As already mentioned, this choice has only minor influence, hence we takeKi,nH =Kei,nH =KHfor allnandi=1, . . . ,N−1.

Afterwards, we initialize the Picard iteration as usual: Forλ=1, . . . ,L

λYbtnN = φ(λXtN), n∈N,

λZbtnN = 0, n∈N,

λYbt0i = λZb0ti =0, i=0, . . . ,N−1.

120 4.4. Nonparametric methods

In analogy to the parametric approach, given the estimators of the (n1)-th iteration evaluated at the simulations ofλXti i.e. λYbtn−1i andλZbtn−1i forλ=1, . . . ,Landi=0, . . . ,N−1 we can calculate the estimators of then-th iteration with the following procedure:

We select the bandwidth matrices Hi,nL and Hei,nL corresponding to the cross-validation technique, where we assume equal bandwidths for all components, i.e. Hi,nL = hi,n·IM0×M0 and Hei,nL = ehi,n·IM0×M0 in order to reduce the computational costs. This means for the optimal bandwidth concerning the estimation ofYtn,π

i :

and for the optimal bandwidthehi,nconcerning the estimation ofZtni we have an analog expression.

Afterwards, we compute the L×L weight-matriceswi,nλ,κ and wei,nλ,κ fori = 1, . . . ,N−1. They are

Finally, we can compute the estimators of the n-th Picard-iteration:

λYbtni = Again, we iterate until the difference of two subsequent estimators forY0is less than 0.001.

Here, we see why this kind of approach is numerically very costly: We need to evaluate the kernel func-tionL(L−1)times to find an optimal bandwidth for each time partition point and each Picard-iteration.

In some sense the bandwidth choice can be seen as analogon to the choice of basis functions in the least-squares Monte Carlo approach, however it is not possible to get a fast numerical scheme to compute an

’optimal’ one though we simplified the setting considerably. Maybe, this drawback should not be overes-timated since we do not even care about the quality of a special kind of basis functions in the parametric approach. We simply choose one and work with it instead of thinking of some criterion for superiority of one set of basis functions against another one. Hence, we could argue that looking for an ’optimal’

bandwidth should be excluded when comparing the nonparametric and the parametric approach with respect to computation time.

In the following, the pseudo-MATLAB-code for the numerical computation of the solution of BSDE via locally constant regression is given. Again,.*denotes a product of matrices taken component-by-component:

4.4. Nonparametric methods 121

% Simulation of the forward Markov process Xti (only its first component

% S is given here, other component(s) denoted by X2 are problem specific) for λ =1:L

St0(λ) = x X2t0(λ) = ...

endfor i = 0:N-1

Sti+1(:) = Sti(:) + (b(ti,Sti(:))+σ(ti,Sti(:)) * hti)*∆i+σ(ti,Sti(:)).*∆Wi(:) X2ti+1(:) = ...

end

% Initialization YtN(:) = φ(XtN(:)) ZtN(:) = 0

for i = 0:N-1

Y0ti(:) = Z0ti(:) = 0 endn = 0

Y0t0−Y−1t0 =1

% Picard-iteration

while |Ynt0−Yn−1t0 | > 0.001 and n < 100 b_Y(:) = YtN(:)

for i = 1:N-1 b_Z(:) = b_Y(:)

% The application of f has to be read by component-by-component b_Y(:) = b_Z(:) + f(tN−i,StN−i(:),Yn−1tN−i(:),Zn−1tN−i(:))

% Bandwidth selection (bandw is a function computing hopt

% according to some criterion e.g. cross-validation, for some

% chosen kernel function 'ker') hy = bandw(b_Y(:),XtN−i(:),'ker')

hz = bandw(∆WN−iN−i(:).* b_Z(:),XtN−i(:),'ker')

% Computation of weight matrices (weights is a function

% calculating the weight matrix for some chosen kernel function

% 'ker')

wy(:,:) = weights(XtN−i(:),XtN−i(:),hy,'ker') wz(:,:) = weights(XtN−i(:),XtN−i(:),hz,'ker') YntN−i(:) = wy(:,:) * b_Y(:)

ZntN−i(:) = wz(:,:) * (∆WN−i(:)

N−i .* b_Z(:)) endb_Z(:) = b_Y(:)

% The application of f has to be read component-by-component b_Y(:) = b_Z(:) + f(t0,St0(:),Yn−1t0 (:),Zn−1t0 (:))

% mean denotes the mean function Ynt0 = mean(b_Y(:))

122 4.4. Nonparametric methods

Znt0 = mean(∆W00(:).* b_Z(:)) n = n+1

end

We tested the method along the example of a straddle with payoff-function|ST−K| in the Bergman-model. The price of this option is given byY0, where(S,Y,Z)is the solution of the FBSDE

dSt = bStdt+σStdWt, dYt =

µ

rYt+b−r

σ Zt(R−r) µ

Yt−Zt σ

dt+ZtdWt, S0 = s0, YT=|ST−K|.

We use the same parameters as Bender and Denk [2], that is:

b σ r R T s0 K

0.05 0.2 0.01 0.06 2 100 100

The forward equation is simulated by the log-Euler scheme to get strictly positive stock prices for sure.

Furthermore, we choose the Gaussian kernel as kernel function and the bandwidth according to the cross-validation rule. Note, that theoretically the use of the Gaussian kernel cannot be supported, however, since we cannot use the asymptotic results anyway the minor influence of the kernel function for practical purposes turns this almost irrelevant.

Again we stop the Picard iteration, if the difference of two subsequent estimators forY0is less than 0.001, Our findings are as follows: Since the calculation of the weight matrices and especially the selection of the optimal bandwidth by cross-validation requires many evaluations of the kernel function, we are not able to include as many time partition points and as many simulations as in the least-squares Monte Carlo approach. Hence, we only useN =2 and at mostL = 5, 500 simulations, whereas for the least-squares Monte Carlo method,N=35 andL=100, 000 creates no difficulties concerning memory overflows. Also, the difference in computation time is significantly: The nonparametric approach for 5,500 simulated paths requires about 54 minutes to get an estimator for the time 0 option value compared to 0.02 seconds in the parametric least-squares approach. This enormous increase in computation time is almost exclusively due to the bandwidth selection: 52 minutes of the whole time are devoted to that task and the rest of time is mainly due to the computation of the weight matrices. Hence, even if we exclude the bandwidth selection from the comparison with respect to computation time the nonparametric approach is inferior by far.

Besides this unfavorable aspect, we cannot improve the (empirical) standard deviation of our estimator Ybtn0stop,L: It is almost the same as in a comparable simulation using the crude least-squares Monte Carlo approach with monomials up to order 2 as basis function and two time steps.

Bender and Denk [2] find in their simulations an estimated initial price of about 24.6 to 24.7. Hence, Figure 4.30, which depicts the empirical mean plus/minus two empirical standard deviations of 100 estimations ofYbtn0stop,L, indicates that for the situation with very few time steps the nonparametric estimator is doing better than the parametric one with respect to the exactness of the estimators. A minor advantage can also be seen in the required number of Picard iterations: We obtainnstop = 5 for all parameter choices, whereas the use of a polynomial basis in the least-squares Monte Carlo approach leads to slightly more iterations in general.

Conclusively, we cannot recommend future research concerning the required estimation of the condi-tional expectation within the framework of the Picard algorithm for decoupled FBSDEs. Even in our very simple setting, where we only use a one-dimensional Nadaraya-Watson estimator, we cannot use enough simulations, to get similar precision of the estimator as in the least-squares Monte Carlo approach.

Maybe, these drawbacks are not as worse in examples where the densityfX(x)is known. Then the calcu-lation of the optimal bandwidth and the weight matrices simplifies considerably implying less computa-tion time and less memory requirements. The most severe drawback however is that we cannot use the

4.4. Nonparametric methods 123

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 22

22.5 23 23.5 24 24.5 25

number of paths

Y0

Mean and std of Y as a function of the number of paths

Nonparametric approach Least−squares Monte Carlo

Figure 4.30: Convergence ofYbtn0stop,L in the case of nonlinear BSDE and use of the Nadaraya-Watson esti-mator versus crude least-squares Monte Carlo.

asymptotical expression for conditional bias and variance since the crucial assumption of independence of Theorem 4.4.1 is violated. At the moment we cannot overcome this difficulty and an analysis of the error induced by this kind of estimation remains open.

AppendixA

Appendix

A.1 Least-squares problem and singular value decomposition of a