• Keine Ergebnisse gefunden

the previously described competition between internal models remains the same. To make the rejection adaptive to the current level of performance of each internal model, a threshold for rejection is used for each internal model that is proportional to the current mean squared error of the internal model. Thus, based on all samples that have been assigned to an internal model for training, the mean squared error is computed as

MSEki = 1 PT

t=1Iit,k

T

X

t=1

Iit,k· kyt−φi(wki, xti)k2, (xti, yt)∈Sik, (5.6) and a threshold for rejection is updated for the next iteration as

ρk+1i =λ·MSEki, (5.7)

whereλ∈R>0 is a parameter of the method. The weighting scheme is adapted to take the threshold for rejection into account as

Iit,k= (

1 if SEt,ki ≤ρki and ∀j, i6=j: SEt,ki ≤SEt,kj ,

0 otherwise. (5.8)

As before, for the initialization of the model parameters for both internal models, all samples from initial training setsS10 and S20 are used, without any form of assignment.

To test the method, two multilayer perceptrons were used as learning technique, each with 10 units in a single hidden layer. Before each iteration, 250 samples were generated using eachf1(xt1),f2(xt2) andf3(xt3), resulting in a total ofT = 750 training samples in each training setS1kandSk2. Figure 5.4 shows an example run of the method, using λ = 7. In Figure 5.4(a), the result of initializing the multilayer perceptrons using the initial training sets S10 and S20 is shown. Blue dots correspond to correctly assigned samples, and red dots correspond to wrongly assigned samples (on the one hand false negatives, i.e. samples that were rejected but should have been accepted, and on the other hand false positives, i.e. samples that were accepted but should have been rejected). The mean squared error of both internal models is indicated by dashed lines as a belt around the estimates. Figures 5.4(b)–(e) show the situation after iterations 1, 5, 10 and 25, respectively. It can be seen that the method is able to learn both mappings despite the additional noise, although convergence to the final solution is slower than when there is no additional noise (cf. Figure 5.3). Furthermore, while the learned estimates of both internal models are already very precise after iteration 10 (see Figure 5.4(b)), still some samples are wrongly assigned. However, the wrongly assigned samples correspond to “noisy” samples (i.e. samples that were generated using an input that is not known for the internal model) that happen to be similar to the function to be learned. For example, some of these samples will have been generated as yt=f3(xt3), but by chance this happens to be close to the value of f1(xt1). Thus, even though these samples are de facto wrongly assigned samples, they only add a negligible amount of noise to the assigned training samples.

To evaluate the influence of the parameter λ on the learning performance, the training was systematically repeated for different values, usingλ∈ {1, . . . ,13}. For each

5.2 Handling Noise

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

(a) Initialization

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

(b) Iteration 1

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

(c) Iteration 5

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

(d) Iteration 10

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

(e) Iteration 25

Figure 5.4: Simulation results for the example of learning estimates for two arbitrary functions, using the competitive learning method proposed in this chapter. See text for description.

value, 15 independent training trials were performed, each for 25 iterations. Figure 5.5 shows the mean learning performance of the internal model learning f1, exemplarily for the values 1, 7, 10 and 13. Figures 5.5(a), (c), (e) and (g) show the development of the mean squared error, with standard deviation indicated as a colored belt around the mean value. For comparison, 50 feedforward networks (with the same network architecture as the ones used for the internal models) were trained using only correct training samples, and their mean performance was computed to obtain a reference performance level, which is shown as a dotted red line in the plots. Figures 5.5(b), (d), (f) and (h) show the number of noise samples (i.e. ones that were computed as yt = f3(xt3)) that were wrongly accepted by the internal model in blue, and the total number of wrong decisions for the internal model (both false negatives and false positives) in red, again showing the mean as solid line and the standard deviation as a colored belt. As stated above, samples that do not belong to the respective target function are randomly distributed in the input space, with some non-trivial distribution in the y-dimension.

However, there is a certain probability that samples will fall onto, or lie close to, the target function. These samples are by chance similar to the target function, and thus cannot be distinguished from correct samples by the system. The expected numbers of samples for which this happens1 are shown as dotted lines, on the one hand for samples originating from f3 in blue, and on the other hand for samples originating from either f2 or f3 in red. Thus, for an optimally performing system, the solid blue line should coincide with the dotted blue line, meaning that the mean number of wrongly accepted noise samples equals the expected number of noise samples that are close to the target function, and the solid red line should coincide with the dotted red line, meaning that the mean total number of false positives and false negatives equals the expected total number of samples that do not correspond to the target function but happen to lie close to it.

Several things can be observed: First of all, the best performance is achieved with the choice of λ = 7, where after around 15 iterations the mean model performance (in terms of the mean squared error) has reached the reference performance level and training samples are optimally assigned. Choosing a too small value of λ results in a situation in which the models reduce the threshold for rejection too much, eventually producing a positive feedback loop of overfitting: As the mean squared error of the models gets lower during training, fewer training samples are accepted in the following iterations, causing an overfitting of the models due to the low amount of training sam-ples, which in turn further reduces the amount of accepted training samples. Figure 5.6 shows an example, where the choice of λ= 1 has caused such a situation. This is mir-rored in Figures 5.5(a)–(b), where on the one hand it can be seen that the internal model does not converge to a good performance, and on the other hand only few noise

1The expected number of samples that are close to the target value was numerically estimated by keeping track of the number of times that this happend across all training trials. During the generation of training samples, normally distributed noise with varianceσ2 was added to theyt. Thus, optimally performing internal models should produce a mean squared error of MSE =σ2. Therefore, in each trial the number of times thatkf1(xt1)f2(xt2)k2 < λ·σ2 andkf1(xt1)f3(xt3)k2 < λ·σ2 were counted, providing an estimate of the expected number of samples for which this is the case.

5.2 Handling Noise

0 5 10 15 20 25

0 0.1 0.2 0.3 0.4 0.5

MSE

iteration (a) λ= 1

0 5 10 15 20 25

0 100 200 300

iteration (b) λ= 1

0 5 10 15 20 25

0 0.1 0.2 0.3 0.4 0.5

MSE

iteration (c)λ= 7

0 5 10 15 20 25

0 100 200 300

iteration (d) λ= 7

0 5 10 15 20 25

0 0.1 0.2 0.3 0.4 0.5

MSE

iteration (e) λ= 10

0 5 10 15 20 25

0 100 200 300

iteration (f)λ= 10

0 5 10 15 20 25

0 0.1 0.2 0.3 0.4 0.5

MSE

iteration (g)λ= 13

0 5 10 15 20 25

0 100 200 300

iteration (h) λ= 13

Figure 5.5: (a), (c), (e), (g) Mean model performance; (b), (d), (f), (h) mean number of wrongly accepted noise samples (blue) and mean number of false negatives and false positives (red). See text for description.

-5 0 5 -2

0 2 4

-5 0 5

-2 0 2 4

Figure 5.6: Example where the choice of a too low value for the parameterλ= 1 has led to an overfitting of the internal models. The internal model on the left is close to the target function in some regions of the input space, which allows it to maintain a relatively low mean squared error. However in other regions it has drifted away from the target function, thus producing many false negatives, which keeps the internal model from finding the target function. The internal model on the right hand has found a sub-optimal solution, rejecting too many samples, also causing an overfitting.

-5 0 5

-2 0 2 4

-5 0 5

-2 0 2 4

Figure 5.7: Example where the choice of a too large value for the parameterλ= 13 keeps the internal models from obtaining optimality. Many false positives retain large amounts of noise in the training sets for the internal model, causing them to demonstrate sub-optimal learning performance.

samples are accepted, but at the same time the number of false negatives is very high.

In contrast, choosing a too large value of λ causes the model to continue to produce large amounts of false positives. As it can be seen in Figures 5.5(g)–(h) for the case of λ= 13, the performance of the model does improve over the number of iterations, but never reaches optimality, and the number of accepted noise samples remains large.

Figure 5.7 shows an example result of a training for this choice of parametrization.

In conclusion it can be said, that with the right choice for the parameter λ(in this case λ = 7) the proposed method can successfully self-organize the learning process also in the presence of large amounts of noise. Overall it would be possible to improve the performance of the model, if the global threshold for rejection that each internal model uses was replaced by locally varying thresholds. This could be achieved by estimating the model performance not by a single value (here the mean squared error in conjunction with the model parameter λ), but estimating it for subregions of the input space separately. This would allow to prevent situations in which an internal model that in average produces good estimates for large amounts of the input space is not able to approach the target functions in other regions where it produces many false negatives instead.