Nonparametric methods - A Picard-type Iteration for Backward Stochastic Differential Equations

1 2 3 4 5 6 7 8

x 10⁴ 11

11.5 12 12.5 13 13.5 14 14.5 15 15.5

number of paths Y0

Mean and std of Y as a function of the number of paths

Importance sampling for energy derivative Crude least−squares Monte Carlo

Sequential gradient method.

Figure 4.29: Convergence ofYb_tⁿ₀^stop^,L in the case of energy derivative with EIS and different optimization methods.

sequences(κh^b)_b∈N for differentκ reveal all kinds of patterns with respect to convergence: Sometimes they seem to converge, sometimes there is no regular behavior observable, in many cases we have several limit points. Though it is possible to obtain variance reduction effects by simply using one hazardly cho-sen element of the particular sequence.

The next concern is the choice of the optimization method: In the superhedging framework the direct simplex method turns out to be superior to the sequential methods. However, despite the convergence problems both sequential optimization methods reveal the best efficiency in the out of the money look-back options example. Consequently, there is no general rule which method should be preferred.

Furthermore, in few applications we are faced with biased estimators when using the EIS approach. It is not clear, how this bias appears and we therefore cannot always rely on the outcomes even there seems to be convergence towards an optimalh, which is the bad message.

As last problem one should mention the high computational burden which is especially necessary for the sequential optimization procedures because of the dependency of the ’optimal’hon the starting value

κh⁰. This property prevents to some extent the effectiveness of the induced variance reduction since one should include the computation time in such a consideration, see e.g. the discussion in Boyle et al. [10], p. 1270 f.

Conclusively, for practical purposes one could give the following recommendation: If there are any tailor-made methods from option pricing for the linear BSDE available one should rather use them (also if f is nonlinear) because of their higher potential variance reduction effect and their - in general - lower computational requirement. Having none of such methods from option pricing at hand, one should test EIS with any of the optimization methods and even experiment with results of non-converging sequences in the sequential optimization methods. Unfortunately, the use of the EIS methodology in BSDEs seems to be more problematic as in econometrics, though it is by far not useless and sometimes even highly successful.

4.4 Nonparametric methods

In the former chapters we only considered least-squares Monte Carlo methods. More precisely, we chose as approximation for conditional expectations orthogonal projections on spaces spanned by a finite num-ber of square-integrable basis functions. However, this is not the only possible choice. This subsection is

116 4.4. Nonparametric methods

devoted to the approach via nonparametric methods, especially the use of the Nadaraya-Watson estima-tor. It also tries to compare the numerical results to the findings of Bender and Denk [2] concerning the least-squares Monte Carlo method without variance reduction.

The Nadaraya-Watson estimator is a special case of an estimator resulting from locally polynomial re-gression. In the latter case, the basic modeling assumption is the following: We have a sample of pairs (X₁,Y₁), . . . ,(X_L,Y_L), which are independent and identically distributed random vectors. Y_λis assumed to be scalar andX_λ∈R^M⁰ with density f_X.

In statistics the model equation is usually written as

Y_λ=m(X_λ) +v¹²(X_λ)ε_λ, λ=1, . . . ,L, (4.10) wherev(x)andm(x)denote the unknown conditional variance and the conditional mean atxrespectively, i.e. v(x) = Var[Y|X = x] andm(x) = E[Y|X = x]. The noise terms ε_λ are assumed to be i.i.d. with E[ε_λ] =0, Var[ε_λ] =1 and(ε₁, . . . ,ε_L,X₁, . . . ,X_L)are assumed to be independent.

At first sight the notation is somewhat confusing: Our task in the numerical approximation of BSDEs is to compute a conditional expectation, hence in this syntax the determination of the unknown functionm.

In contrast, the left hand side of equation (4.10) has to be read as observation, which is explained on the right hand side by the conditional expectation givenXplus some error term.

To get an idea of the methods of locally polynomial regression, we want to sketch their basic form: For the moment supposeXis one-dimensional and the unknown functionmis smooth in the sense that the first (r+1)-derivatives in a neighborhood ofx0exist. Thus we can writem(x)as Taylor polynomial around x0plus a remainder term, i.e.

m(x) =

∑

r j=0

m^(j)(x0)

j! (x−x0)^j+(x−x0)^r+1

(r+1)! m^(r+1)(x0+θ(x−x0))

for someθwith 0≤θ≤1. Withβ_j(x₀):= ^m^(j)_j!^(x⁰⁾ we obtain the local polynomial approximation m(x)≈

∑

r j=0

β_j(x₀)(x−x₀)^j.

This motivates the following optimization problem bβ(x₀) =arginf

β∈R^r+1

½ _L

∑

λ=1

µ Y_λ−

∑

r j=0

β_j(x₀) (x−x₀)^j

¶₂ K

µX_λ−x₀ hL

¶ ¾

. (4.11)

Thereby,Kis a kernel function, which usually has support[−1, 1]. Typical examples are Uniform kernel: K(u) = ¹₂1_{|u|≤1},

Triangle kernel: K(u) = (1− |u|)1_{|u|≤1}, Epanechnikov kernel: K(u) = ³₄(1−u²)1_{|u|≤1}, Gaussian kernel: K(u) = ^√¹_2πexp(−¹₂u²).

Since we are able to choose this function, quite naturally the question arises, how to do this optimally.

However, "for practical purposes the choice of the kernel function is almost irrelevant for the efficiency of the estimate" as Härdle et al. [27], p. 61 remark.

The form of the different kernel functions explain the notion "locally": We only include observations in our estimation, where|X_λ−x₀| ≤h_L, if we use compactly supported kernel functions or at least downweight observations "far away" fromx₀, if we use the Gaussian kernel. This kind of modeling is reasonable, since we can expect, that those observations have no or at least less impact on the resulting outcome of

4.4. Nonparametric methods 117

E[Y|X = x₀]. Consequently, the role of the so-called bandwidth h_L gets important: It determines the region aroundx₀, where we conjecture some influence. In contrast to the choice of kernel functions, the bandwidth has an important impact on the quality of the estimator as we will see later on.

The corresponding estimators form(x₀)and the firstrderivatives ofmatx₀based on problem (4.11) are given by

m^(j)(x0) =j!βb_j(x0), j=0, . . . ,r.

We now describe the simplest case of locally polynomial regression, wherer=0, i.e. the locally constant regression, whose resulting estimator is usually called Nadaraya-Watson estimator. We give up the tem-porary assumptionX being one-dimensional. The functional form of the estimator in the multivariate case is

m(x) =b ∑^L_λ=1K_H(X_λ−x)Y_λ

∑^L_λ=1K_H(X_λ−x) ,

whereK_H represents a multivariate kernel function and the subindexHsymbolizes the dependency on the nonsingular, symmetric bandwidth matrix. This matrix determines the region where we conjecture some influence on the dependent variable.

For our purposes it is enough to restrict to product kernel functions of the form

K_H(u) =

M⁰

∏

j=1

K^j_H(u_j),

whereK^j_His a one-dimensional kernel function and which corresponds to a diagonal bandwidth matrix H_Lcontaining on its main diagonal the bandwidths in the specific direction.

The representation of the estimator ofmreveals that it is simply a weighted average of the observed vari-ableY_λ, i.e.m(x) =b ∑^L_λ=1w_λ(x)Y_λ, wherew_λ(x) = ^K^H^(X^λ^−x)

∑^L_λ=1KH(Xλ−x). This property is quite disadvantageous in our setting, since this produces dependent approximations of the solutions of the BSDE starting at the first Picard iteration. Consequently, we are not able to use the results about the asymptotic behavior of the resulting estimators. Nonetheless, in order to get an idea about the convergence speed in situations where the condition of independence is fulfilled, we cite the following result of Ruppert and Wand [43]

concerning the asymptotic conditional bias and the asymptotic conditional variance of the multivariate Nadaraya-Watson estimator:

Theorem 4.4.1. Let x be an interior point of the support of f_Xand assume that the following properties are satisfied:

1. The multivariate kernel K_His a compactly supported, bounded kernel such thatR

uu^>K_H(u)du=µ₂(K_H)I, whereµ₂(K_H)6=0is scalar and I is the M⁰×M⁰identity matrix. In addition, all odd-order moments of K_H vanish, that is,R

u₁^l¹·. . .·u^l_d^dK_H(u)du = 0for all nonnegative integers l₁, . . . ,l_M⁰ such that their sum is odd. (This last condition is satisfied by spherically symmetric kernels and product kernels based on symmetric univariate kernels.)

2. At x the function v is continuous, f and m are continuously differentiable and all second-order derivatives are continuous. Additionally, f_X(x)>0and v(x)>0.

3. The sequence of bandwidth matrices H_L¹² is such that L⁻¹det(H_L)and each component of H_Ltends to zero as L→∞and H_Lremains symmetric and positive definite. Moreover, there is a constant C such that the ratio of the largest to the smallest eigenvalue of H_Lis at least C for all L.

118 4.4. Nonparametric methods

Then the asymptotic conditional bias and variance of the Nadaraya-Watson kernel regression estimator are Bias(m(x)|Xb 1, . . . ,XL) = µ2(KH)∇m(x)^>H_LH^>_L∇f(x)

f_X(x) +1

2µ2(KH)tr{H_L^>Hm(x)HL}+oP(tr(HL)), Var(m(x)|Xb ₁, . . . ,X_L) = 1

Ldet(H_L) Z

|K_H(u)|²du v(x)

f_X(x)(1+o_P(1)).

Thereby∇and Hdenote the gradient and the Hessian, respectively, tr denotes the trace of a matrix and for two sequences of random variables u_λ,v_λwe write u_λ=o_P(v_λ)if and only if for allε>0holds P(|u_λ/v_λ|>ε)→0.

Proof. See, Ruppert and Wand [43], Theorem 2.1, where we have to note that in this paper the local linear estimator is examined (r=1 in (4.11)). Since we are interested in the asymptotics of the Nadaraya-Watson estimator we have to add for the asymptotical bias the first summand. The above formula can be found e.g. in Härdle et al. [27], Theorem 4.8.

The third assumption ensures that the asymptotical conditional bias and variance converge to zero. This shows that the bandwidth choice has to be made carefully. At the same time the expressions for the conditional asymptotic bias and variance reveal the basic trade-off concerning a too large or to small bandwidth choice. If each component ofHis very small the bias will be small, but the variance is growing due to the fact that only few observations will fall in the local neighborhood. On the other hand, larger bandwidths create modeling errors and possibly an oversmooth. We further want to mention that at boundary points, the asymptotic conditional bias and variance of the Nadaraya-Watson estimator are of higher magnitude than in the interior. This is the case, as there are in general fewer observations in the neighborhood available.

This drawback would be avoided if we instead used the local linear estimator. Also the asymptotic bias is lower, since the leading summand in the above formula is missing. Additionally, the asymptotic variance is equal to that of the Nadaraya-Watson estimator, implying we should rather use this methodology, to get better convergence results. However, the computational effort would also rise considerably and as we will see, this is one of the real problems with the nonparametric approach to the numerical approximation of the solution of BSDEs.

Now, we sketch some ideas how to get a reasonable bandwidthhin the case where we choose a bandwidth matrixHL = h·I_M⁰_×M⁰: There are several proposals how to choose the optimal bandwidth hopt. We have to distinguish between global and local criteria: An example of a local criterion is to choosehoptas minimizer of the asymptotic mean squared error (AMSE) at each pointx₀, i.e.

AMSE(L,h,x₀) = 1

Lh^M⁰C₁+h⁴C₂

for two constantsC₁andC2depending on the kernel function the curvature ofm,σand the density fX

at the chosen point. Consequently, the optimal bandwidthhoptis proportional toL^−1/(4+M⁰⁾. If we apply this criterion we have to calculate for every point, where we want to evaluate the estimator a special bandwidth, which in turn means a high computational burden in an auxiliary step of our problem.

The global criterion selects a bandwidthhopt which is the same for all points, where we again assumed that the bandwidth is chosen the same for all components: HL = h·I_M⁰_×M⁰. The computational cost is lower in this case and we will apply such a criterion in our simulations. The simplest among them is the cross-validation technique: Here, we simply look for the minimizer of the expression

CV(h) = 1 L

L λ=1

∑

(Y_λ−mb_−λ(X_λ))²w(X_λ),

4.4. Nonparametric methods 119

wherew(x)is some weight function andmb_−λ(X_λ)denotes the leave-one-out-estimator b

m_−λ(X_λ) = ∑_j6=λK_H¡

X_j−X_λ¢ Y_j

∑j6=λK_H¡

X_j−X_λ¢ .

This approach is (in average) equivalent to minimizing the averaged squared error (ASE) ASE(h) = 1

∑

L λ=1

(mb(X_λ)−m(X_λ))²w(X_λ), which in turn is a discrete approximation of the integrated squared error (ISE)

ISE(h) = Z

R^M⁰ (m(x)b −m(x))²w(x)f_X(x)dx.

The latter clearly represents a global discrepancy measure.

In our special situation without importance sampling there areL simulations for any point of the time gridi=0, . . . ,N−1 given by

λXti = Ã

λS_t_i ...

, φ(_λXtN) +

N−1

∑

j=i

f(t_j,_λStj,_λYb_tⁿ⁻¹_j ,_λZb_tⁿ⁻¹_j )∆_j, and

∆_λW_d,i

∆_i Ã

φ(_λXtN) +

N−1

∑

j=i+1

t_j,_λSt_j,_λYb_tⁿ⁻¹_j ,_λZb_tⁿ⁻¹_j

∆_j

, d=1, . . . ,D.

The missing components of the vector_λXti have to be chosen the same way as in the least-squares Monte Carlo approach. Now, the corresponding Nadaraya-Watson estimators for the solution for the BSDE evaluated at_λXti are given by

λYb_tⁿ_i = ∑_l=1^L

φ(_lXtN) +∑^N−1_j=i f

t_j,_lStj,_lYb_jⁿ⁻¹,_lZb_tⁿ⁻¹_j

∆_j

´ K^i,n_H ¡

lXti −_λXti

∑^L_l=1K^i,n_H ¡

lX_t_i− _λX_t_i¢ and

λZbⁿ_d,t_i = ∑^L_l=1^∆^l_∆^W_i^d,i

φ(_lX_t_N) +∑^N−1_j=i+1f³

t_j,_lS_t_j,_lYb_tⁿ⁻¹

j ,_lZbⁿ⁻¹_t

∆_j´ Ke^i,n_H ¡

lX_t_i−_λX_t_i¢

∑^L_l=1Ke^i,n_H ¡

lX_t_i− _λX_t_i¢

forλ=1, . . . ,L,i=0, . . . ,N−1, whereK^i,n_H andKe^i,n_H are the corresponding multivariate kernel functions.

Note that fort₀=0 the estimator simply boils down to a mean, since the initial value_λX_t₀ is non-random.

Now, our algorithm ford=1 is organized as follows:

• We have to chooseK^i,n_H andKe^i,n_H. As already mentioned, this choice has only minor influence, hence we takeK^i,n_H =Ke^i,n_H =K_Hfor allnandi=1, . . . ,N−1.

• Afterwards, we initialize the Picard iteration as usual: Forλ=1, . . . ,L

λYb_tⁿ_N = φ(_λX_t_N), n∈N,

λZb_tⁿ_N = 0, n∈N,

λYb_t⁰_i = _λZb⁰_t_i =0, i=0, . . . ,N−1.

120 4.4. Nonparametric methods

In analogy to the parametric approach, given the estimators of the (n−1)-th iteration evaluated at the simulations of_λX_t_i i.e. _λYb_tⁿ⁻¹_i and_λZb_tⁿ⁻¹_i forλ=1, . . . ,Landi=0, . . . ,N−1 we can calculate the estimators of then-th iteration with the following procedure:

We select the bandwidth matrices H^i,n_L and He^i,n_L corresponding to the cross-validation technique, where we assume equal bandwidths for all components, i.e. H^i,n_L = h^i,n·I_M⁰_×M⁰ and He^i,n_L = eh^i,n·I_M⁰_×M⁰ in order to reduce the computational costs. This means for the optimal bandwidth concerning the estimation ofY_t^n,π

i :

and for the optimal bandwidtheh^i,nconcerning the estimation ofZ_tⁿ_i we have an analog expression.

Afterwards, we compute the L×L weight-matricesw^i,n_λ,κ and we^i,n_λ,κ fori = 1, . . . ,N−1. They are

Finally, we can compute the estimators of the n-th Picard-iteration:

λYb_tⁿ_i = Again, we iterate until the difference of two subsequent estimators forY0is less than 0.001.

Here, we see why this kind of approach is numerically very costly: We need to evaluate the kernel func-tionL(L−1)times to find an optimal bandwidth for each time partition point and each Picard-iteration.

In some sense the bandwidth choice can be seen as analogon to the choice of basis functions in the least-squares Monte Carlo approach, however it is not possible to get a fast numerical scheme to compute an

’optimal’ one though we simplified the setting considerably. Maybe, this drawback should not be overes-timated since we do not even care about the quality of a special kind of basis functions in the parametric approach. We simply choose one and work with it instead of thinking of some criterion for superiority of one set of basis functions against another one. Hence, we could argue that looking for an ’optimal’

bandwidth should be excluded when comparing the nonparametric and the parametric approach with respect to computation time.

In the following, the pseudo-MATLAB-code for the numerical computation of the solution of BSDE via locally constant regression is given. Again,.*denotes a product of matrices taken component-by-component:

4.4. Nonparametric methods 121

% Simulation of the forward Markov process Xt_i (only its first component

% S is given here, other component(s) denoted by X2 are problem specific) for λ =1:L

St0(λ) = x X2t0(λ) = ...

endfor i = 0:N-1

Sti+1(:) = Sti(:) + (b(ti,Sti(:))+σ(t_i,Sti(:)) * hti)*∆_i+σ(ti,Sti(:)).*∆Wi(:) X2ti+1(:) = ...

end

% Initialization YtN(:) = φ(XtN(:)) ZtN(:) = 0

for i = 0:N-1

Y⁰_t_i(:) = Z⁰_t_i(:) = 0 endn = 0

Y⁰_t₀−Y⁻¹_t₀ =1

% Picard-iteration

while |Yⁿ_t₀−Yⁿ⁻¹_t₀ | > 0.001 and n < 100 b_Y(:) = Yt_N(:)

for i = 1:N-1 b_Z(:) = b_Y(:)

% The application of f has to be read by component-by-component b_Y(:) = b_Z(:) + f(tN−i,StN−i(:),Yⁿ⁻¹_t_N−i(:),Zⁿ⁻¹_t_N−i(:))

% Bandwidth selection (bandw is a function computing hopt

% according to some criterion e.g. cross-validation, for some

% chosen kernel function 'ker') hy = bandw(b_Y(:),XtN−i(:),'ker')

hz = bandw(^∆W_∆^N−i_N−i^(:).* b_Z(:),XtN−i(:),'ker')

% Computation of weight matrices (weights is a function

% calculating the weight matrix for some chosen kernel function

% 'ker')

wy(:,:) = weights(Xt_N−i(:),Xt_N−i(:),hy,'ker') wz(:,:) = weights(Xt_N−i(:),Xt_N−i(:),hz,'ker') Yⁿt_N−i(:) = wy(:,:) * b_Y(:)

Zⁿt_N−i(:) = wz(:,:) * (^∆W_∆^N−i^(:)

N−i .* b_Z(:)) endb_Z(:) = b_Y(:)

% The application of f has to be read component-by-component b_Y(:) = b_Z(:) + f(t0,St0(:),Yⁿ⁻¹_t₀ (:),Zⁿ⁻¹_t₀ (:))

% mean denotes the mean function Yⁿ_t₀ = mean(b_Y(:))

122 4.4. Nonparametric methods

Zⁿ_t₀ = mean(^∆W_∆⁰₀^(:).* b_Z(:)) n = n+1

end

We tested the method along the example of a straddle with payoff-function|S_T−K| in the Bergman-model. The price of this option is given byY₀, where(S,Y,Z)is the solution of the FBSDE

dS_t = bS_tdt+σS_tdW_t, dY_t =

rY_t+b−r

σ Z_t−(R−r) µ

Y_t−Z_t σ

−

dt+Z_tdW_t, S₀ = s₀, Y_T=|S_T−K|.

We use the same parameters as Bender and Denk [2], that is:

b σ r R T s₀ K

0.05 0.2 0.01 0.06 2 100 100

The forward equation is simulated by the log-Euler scheme to get strictly positive stock prices for sure.

Furthermore, we choose the Gaussian kernel as kernel function and the bandwidth according to the cross-validation rule. Note, that theoretically the use of the Gaussian kernel cannot be supported, however, since we cannot use the asymptotic results anyway the minor influence of the kernel function for practical purposes turns this almost irrelevant.

Again we stop the Picard iteration, if the difference of two subsequent estimators forY0is less than 0.001, Our findings are as follows: Since the calculation of the weight matrices and especially the selection of the optimal bandwidth by cross-validation requires many evaluations of the kernel function, we are not able to include as many time partition points and as many simulations as in the least-squares Monte Carlo approach. Hence, we only useN =2 and at mostL = 5, 500 simulations, whereas for the least-squares Monte Carlo method,N=35 andL=100, 000 creates no difficulties concerning memory overflows. Also, the difference in computation time is significantly: The nonparametric approach for 5,500 simulated paths requires about 54 minutes to get an estimator for the time 0 option value compared to 0.02 seconds in the parametric least-squares approach. This enormous increase in computation time is almost exclusively due to the bandwidth selection: 52 minutes of the whole time are devoted to that task and the rest of time is mainly due to the computation of the weight matrices. Hence, even if we exclude the bandwidth selection from the comparison with respect to computation time the nonparametric approach is inferior by far.

Besides this unfavorable aspect, we cannot improve the (empirical) standard deviation of our estimator Yb_tⁿ₀^stop^,L: It is almost the same as in a comparable simulation using the crude least-squares Monte Carlo approach with monomials up to order 2 as basis function and two time steps.

Bender and Denk [2] find in their simulations an estimated initial price of about 24.6 to 24.7. Hence, Figure 4.30, which depicts the empirical mean plus/minus two empirical standard deviations of 100 estimations ofYb_tⁿ₀^stop^,L, indicates that for the situation with very few time steps the nonparametric estimator is doing better than the parametric one with respect to the exactness of the estimators. A minor advantage can also be seen in the required number of Picard iterations: We obtainn_stop = 5 for all parameter choices, whereas the use of a polynomial basis in the least-squares Monte Carlo approach leads to slightly more iterations in general.

Conclusively, we cannot recommend future research concerning the required estimation of the condi-tional expectation within the framework of the Picard algorithm for decoupled FBSDEs. Even in our very simple setting, where we only use a one-dimensional Nadaraya-Watson estimator, we cannot use enough simulations, to get similar precision of the estimator as in the least-squares Monte Carlo approach.

Maybe, these drawbacks are not as worse in examples where the densityf_X(x)is known. Then the calcu-lation of the optimal bandwidth and the weight matrices simplifies considerably implying less computa-tion time and less memory requirements. The most severe drawback however is that we cannot use the

4.4. Nonparametric methods 123

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 22

22.5 23 23.5 24 24.5 25

number of paths

Mean and std of Y as a function of the number of paths

Nonparametric approach Least−squares Monte Carlo

Figure 4.30: Convergence ofYb_tⁿ₀^stop^,L in the case of nonlinear BSDE and use of the Nadaraya-Watson esti-mator versus crude least-squares Monte Carlo.

asymptotical expression for conditional bias and variance since the crucial assumption of independence of Theorem 4.4.1 is violated. At the moment we cannot overcome this difficulty and an analysis of the error induced by this kind of estimation remains open.

AppendixA

Appendix

A.1 Least-squares problem and singular value decomposition of a

Im Dokument A Picard-type Iteration for Backward Stochastic Differential Equations : Convergence and Importance Sampling (Seite 127-137)