• Keine Ergebnisse gefunden

Figure 4.6: Development of the Ignorance over the EM iterations for different values of the size regularization parameter b, with a= 1, sx = 4, sy = 1 andσmin = 10−4.

Therefore, we decreased the overall score from 1 to 0.81 by combining the rescaling of the clusters (through parameter sx) with the added size and renormalization (through parameter b). It is interesting to note that applying the added size b without scaling does not reduce the error in this manner.

4.4 Cluster weights

Another important set of parameters are the weights associated with the clusters.

Similar to the variance distribution shown in figure 4.3, the spread between the weights can become quite large without regularization, though not over several orders of magnitude as seen for the variances (see figure 4.7). Still, the speckling of the distribution could stem from a combination of a small variance with high cluster weights, hence the modeling might benefit from weight regularization as well.

We basically use the same method we already used for the variances [11]:

wmnew= Again, the regularization parameter bw specifies the weight increase, and forbw → ∞ we get a uniform distribution. Since the weightswm are probabilities, we must ensure thatP

wm = 1 after regularization, so no additional scaling of the parameter can be done.

0 5 10 15 20 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14

Cluster index

Weight

Figure 4.7: Cluster weights wm for the Chua example from figure 4.1.

0 10 20 30 40

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Empirical Ignorance

Number of iterations

bw = 0 bw = 0.1 bw = 0.2 bw = 0.3 bw = 0.5 9 10 11 12

0.75 0.8 0.85

Figure 4.8: Development of the Ignorance score over the EM iterations with different weight regularization parameter bw. The zoomed region contains the area with minimum score, showing the difference being very small.

It turns out though that weight regularization does not have the effect on the model’s output one would hope. Figure 4.8 shows the development of the Ignorance score over the EM iterations, with bw ranging from 0 to 0.5, with the other regularization parameters beingsx = 3, b = 0.5 and σmin = 10−4. While the the lowest score is reached for bw = 0.1, as can be seen in the zoomed region, the difference is minuscule. While the overfitting after roughly 20 iterations is a bit

Page 76 4.4. Cluster weights reduced through the weight regularization, this alone hardly justifies an additional parameter. It seems that the PCTR and size regularization already have an influence on the weight distribution, so that an additional, explicit regularization does not yield a further reduction of the model error.

Active Learning

So far we dealt with the construction of good models based on some data set which was simply given beforehand. In this section, we will discuss the problem of how such a data set can be obtained efficiently in the first place. This question is motivated by the fact that in practice, data is usually obtained through some kind ofexperiment, which is often a costly process, be it in terms of time or money. Therefore, in such a situation, it is of great importance to choose query points carefully as to not waste resources.

In general, the aim of creating an experimental design is to get a good model with as few data points as possible, which is known under the termResponse Surface Modeling (RSM). However, there often exist additional objectives for which the constructed model should be used:

• Screening, which aims to find those input components with the greatest impact on the output variable.

• Optimization, i.e., finding the point in input space which maximizes or minimizes the target output.

• Robustness and Stability, meaning that little deviations in the input variables should not lead to drastic changes in the target output.

The problem of creating such an experimental design is of course not new, and there is a whole branch of statistics dedicated to this topic, known asOptimal Exper-iments [25], or more general as Design of Experiments (in the following abbreviated by DoE), which will be discussed in the first part of this chapter. DoE is well understood for linear models and Gaussian errors, and here it turns out that the data set can be constructed beforehand, meaning that it is completely independent of the actual output of the experiment. After a short review of the linear case, we will then deal with nonlinear models, leading to Active Learning, where in contrast to DoE, the current model output is integrated into the data set construction.

77

Page 78 5.1. Design of experiments (passive learning)

5.1 Design of experiments (passive learning)

In the following, an experiment is defined as procedure to obtain a target output variable y, dependent on some controllable factors x(1), . . . , x(d). Given a fixed number N, which defines the number of times the experiment can be performed, we say that these are optimal when we gain a maximum amount of information from the resulting data set

Ω={(x1, y1),(x2, y2), . . . ,(xN, yN)}, (5.1) where x= (x(1), . . . , x(d))∈Rd. This directly leads us to the question how we should quantify information, and how we can derive optimality criteria from this quantity.

The classical theory of DoE deals with linear parametrized models, meaning models of the form

withgi(x) being some suitablebasis functions (e.g. monomes or radial basis functions) and βi ∈R being the linear coefficients of the model. With β= (β1, . . . , βM)T and G(x) = (g1(x), . . . , gM(x)), we can write this as

f(x) =G(x)·β. (5.3)

Given the above data set (5.1) with N > M, meaning we have more data points than basis functions (and preferably much more), we can estimate these parameters through a least squares approximation by minimizing

N

being the so calleddesign matrix. The parameter vector minimizing (5.4) is given by βˆ = (G(X)TG(X))−1G(X)T·Y =G(X)·Y (5.6) where G(X) denotes the pseudo inverse of G(X), as already described in sec-tion 2.5.1. This parameter vector ˆβ is called the least square estimate and is a maximum-likelihood estimator, meaning that the probability density that the ob-servations Ωwere generated by a parameter vector β has a maximum for β = ˆβ.

Additionally, it is anunbiased estimator such that E[ ˆβ] =βf for any true parameter vector βf.

If we now look at the difference between the estimated and the true (unknown) parameter vectorβf, we get

βˆ −βf =G(X)·Y−G(X)·G(X)βf

=G(X)· Y−G(X)βf

. (5.7)

We now have to make two important assumptions:

• The true, unknown mapping from X to Y can be described by the linear parametrized model, and

• the measurement error is Gaussian and i.i.d.

Given these two assumptions, Y−G(X)βf is distributed according toN(0, σ2) and a covariance matrix

σ2·G(X)·

G(X)T

2·(G(X)T·G(X))−1 . (5.8) From this result, we can see that our aim should be to minimize the matrix given by (G(X)T·G(X))−1. However, there is no analytical procedure to uniquely minimize the entries of a matrix, hence we need some other kind of optimality criterion based on (5.8).

One possibility is to minimize the determinant, leading to a D-optimal experi-mental designXd-opt, satisfying

det

Generally speaking, this criterion minimizes the confidence ellipsoid of the pa-rametersβ. This however is just one possible criterion alongside several others for an optimal experimental design. For example, A-optimal designs minimize the trace

trace h

(G(Xa-opt)T·G(Xa-opt))−1 i

= Max, (5.10)

whileE-optimaldesigns minimize the largest eigenvalue of (G(X)T·G(X))−1. Among the other optimality criteria, two further ones shall be mentioned since they directly depend on the actual model error, which for one specific point in input space x∈ M ⊂ Rd is given by e(x) =G(x)( ˆβ−βf). Given the above two assumptions that our model is sufficient for describing the process with a Gaussian measurement error, this model error is also a random variableN(0, σ2) with covariance matrix

Σ=σ2·G(x)T(G(X)TG(X))−1G(x). (5.11)

Page 80 5.2. Active Learning strategies