Datamining - Measurement Automation - Identification and Analysis of Key Parameters in Organic

4.2 Measurement Automation

5.1.3 Datamining

The collection of parameters from the OSC preparation (chapter 3.2), the automation of the main characterisation methods (section4.2) and the OSC

5.1. DATA ANALYSIS METHODS 91 properties deduced from measured data with the help of splines (section5.1.1) or least-square tting (section5.1.2) lead to a very large amount of data. This data is both too large to be searched manually for patterns and already for small data sets exhibit a very complex interdependence. Yet one would like to answer the following two main questions:

• If the preparation parameters are kept the same, how large are the eects of statistical variations on the measured data?

• If some preparation parameters are intentionally varied, is the eect seen in the measured data (if there is any) due to the intentional vari-ations or due to statistical eects?

The process of searching answers and discovering patterns in data, generally with the help of computers and database software, is called datamining.

The selection of the statistical method for datamining is critical for being able to nd patterns, which help to explain the correlation between OSC preparation parameters and trends in measured data. These patterns are sometimes called structural patterns, because they show a structure, which helps to understand properties of the data. The goal is with their help to identify latent, i.e. hidden factors, in the data which are determining the data's properties. As the dependencies between several variables have to be examined, there is a vital need for multivariate methods, because analysing the eect of only one variable in the presence of other variables only reveals partial information. This is of particular importance for the complex OSC system. The data has to be analysed for the eect of intentional as well as unintentional variations of preparation parameters on the experimental results.

The principal component analysis (PCA) is one of the best known multivari-ate methods. It has shown to be a suitable method for the analysis of the data and was chosen also considering the development of a statistical model (section6.3). With its help, it is possible to determine variables, which have the largest inuence on the properties of the data, nd out the correlations between the variables and thus detect pattern in the data.

Principal Component Analysis

One suitable method to analyse the correlations and interdependencies be-tween not two, but bebe-tween more variables is the principal component

anal-92 CHAPTER 5. DATA ANALYSIS METHODS AND ENVIRONMENT ysis (PCA) [72, 78, 79]. This method is also known as Karhunen-Loève transform and is probably the oldest and best known multivariate method.

The term variable does not denote a variable in the classical sense, but a realisation function. The variables in this case are given by the parameters from the recorded production process and the measured properties of the OSCs. The realisation of such a variable describes the mapping of an OSC parameter to a scalar value. The realisations of several variables for one OSC yield a data set, e.g. the values of all parameters the OSC has been made with.

The PCA is a multivariate statistical method for numerically examining the correlations in a set of data. The PCA uses the fact that the variance of the data is a measure of its information content. The focus is on analysing the inuence of many interdependent variables, and not only one isolated variable, in order to nd the dominating correlations between production parameters and OSC characteristics. The theory of the PCA will be described in the following paragraphs, starting with the introduction of multivariate terminology and closing with a small example.

Manufacturing and characterisation of an OSC leads to a data set, which can be described as anp×1 observation vector

where the y_i represent the realisation of the preparation parameters (e.g.

concentration of the absorber solution, spin-coating speed, etc.) and scalar OSC properties (e.g. open circuit voltageV_oc, etc.) of the OSC the observa-tion vector belongs to. Since they_i are all belonging to the same object, they tend to be correlated. Having a set of n observation vectors y₁,y₂, . . . ,y_n of n OSCs, the observation vectors can be transposed and combined into a n×p data matrix Y:

5.1. DATA ANALYSIS METHODS 93 The rst subscript represents a particular OSC, whereas the second subscript corresponds to a particular variable. Thus one row contains the observed values of all variables for the same OSC, whereas a column lists the values of one variable for all OSCs. An alternative way of writing Y is using the columns:

Y = (y₍₁₎,y₍₂₎, . . . ,y_(p)), (5.18) where e.g. y₍₂₎ is the n×1vector containing the observations of the second variable on alln OSCs. This notation is useful when considering the realisa-tion of one variable for all OSCs, e.g. for the calcularealisa-tion of mean or standard deviation of a variable.

Plotting the data matrix results in a scatter plot with n data points in a Cartesian coordinate system spanned by the p axes of the variables. The approach of the PCA is to nd a new set of axes for the data such that the rst new axis in the scatter plot is lying in such a way that the variance of the data set (a measure of information content) is maximised along this axis. The second axis is chosen to capture the second highest variance while being orthogonal to the rst axis and so forth. All new axes are a linear combination of the axes given by the original variables. The transformation of the axis can be obtained as follows:

Let one data set be described by an×p-dimensional data matrixYas dened in5.17. Thep×p - dimensional covariance matrix S has elements

s_jk = 1 n−1

i=1

(y_ij −y¯_(j))(y_ik−y¯_(k)) = Cov(y_(j),y_(k)), (5.19) where y¯_(j) denotes the average value of y_(j), i.e. average of the j^th variable.

A diagonal element s_jj of S equals the variance of thej^th variable.

S is symmetric and positive denite. Hence it is possible to diagonalise it through an orthogonal transformation such that

Λ=Γ^TSΓ, (5.20)

with Λ being diagonal and having the positive eigenvaluesλ_α (α= 1, . . . , p) ofSchosen in decreasing order as diagonal entries. SinceSis positive denite, its eigenvalues are non-negative. The matrix Γ for the transformation has the corresponding normalised eigenvectors eˆ_α as columns. The eigenvectors are called principal component vectors (PCV), because they span the new

94 CHAPTER 5. DATA ANALYSIS METHODS AND ENVIRONMENT coordinate system. The new data matrix Z is composed of p variables z_(α) and dened as

Z=YΓ. (5.21)

This equals an orthogonal transformation of the original data matrix Y by the eigenvectors of the corresponding covariance matrixS. Thez_(α)are called principal components (PCs) ofZ, i.e. the variables describing the data in the transformed coordinate system. The covariance matrix of Z is diagonal and the diagonal entries Cov(z_(α),z_(α)) = Var(z_(α)) are given by λ_α. Hence the new variables z_(α) are uncorrelated. Given that the variance measures the information content and the principal componentsz_(α)are sorted by descend-ing λ_α often the rst few principal components can adequately describe the variations in the data, because they account for most of the variance. Thus a reduction of variables to describe the data is often possible while retaining most of the information content. However, the core interest in this work is nding interdependencies in the data.

The eigenvectors eˆ_α contain the principal component coecients. Their el-ements ˆe_(α)i can be considered as the contribution of y_(i) to the principal componentz_α.

The principal components obtained via S are not scale invariant and con-sequently changing the units of a variable e.g. from seconds to hours will change both eigenvectors and eigenvalues of S. Hence for the analysis it is preferable to have the variables in the data set to be commensurate, i.e.

similar in measurement scale and variance. If the variables are not commen-surate,S is replaced by the correlation matrixRfor which the variables y(i)

are additionally normalised by their standard deviation σy_(i)

y_(i) →˜y_(i) = y(i)−y¯(i)

σy_(i)

, (5.22)

before calculating the principal components as described above. However, when the principal components of R are expressed in terms of the original variables they will no longer be orthogonal.

One important feature of the PCA, which has to be considered when inter-preting the results, is that the PCA does not dierentiate between production parameters and OSC properties, i.e. governing and dependent variables. It treats all variables equally and shows only their correlation. Consequently great care is necessary when interpreting the principal components. Still

5.1. DATA ANALYSIS METHODS 95

A B

OSC 1 2.67 0.60 OSC 2 2.48 0.62

... ... ...

OSC 99 1.32 -0.17 OSC 100 0.70 0.16

-4 -2 0 2 4

Variable A -4

-2 0 2 4

Variable B

PC 2 PC 1

Tue Nov 28 13:21:12 2006

Figure 5.1: The plot shows the realisations of the two variables A and B for 100 solar cells. The variables A and B are can be e.g. one production parameter and one measured OSC property respectively. The plot shows that the two are clearly correlated. The principal component vectors (PCV) are the unit vectorsˆe_{P C1} andˆe_{P C2} and their directions are shown on top of the data.

They are a linear combination of the original coordinate basis vectors. The principal components (PCs) are the new variables and give the coordinates of the OSC data with respect to the PCVs.

from the PCA it is possible to identify parameters in the production process of organic solar cells, which have a signicant inuence on the device prop-erties and performance.

The main principles of the PCA will be explained with a small example comprising 100 solar cells described by two arbitrary variables, A and B, which for example can be one preparation parameter and a measured OSC property. The data is shown after subtracting the mean from A and B as scatter plot in gure 5.1.

The covariance matrix S and the principal components are determined as described above. The details of the resulting principal components are shown in table 5.1 and the directions of the two PCVs are drawn on top of the original data in gure 5.1.

In this example, representing the data with only the rst PC, i.e. only one dimension, would retain more than 80% of the variance in the data set.

96 CHAPTER 5. DATA ANALYSIS METHODS AND ENVIRONMENT PC 1 PC 2

A 0.850 -0.527 B 0.527 0.850

Eigenvalue % of Variance Cumulated %

PC 1 4.05 82.3 82.3

PC 2 0.75 17.7 100.0

Table 5.1: The results of a PCA on the OSC data shown in gure 5.1. The table on the left shows as row and column header the original variables and the new variables, i.e. principal components (PCs), respectively. The matrix in the table is the matrix Γ, which describes the transformation between the original variables and the PCs. The columns of Γ correspond to the nor-malised eigenvectors of the covariance matrix S of the OSC data. These vectors are called principal component vectors (PCVs) and form the new ba-sis for the OSC data. If ˆe_A and ˆe_B are the original basis vectors, then the rst unit vector of the new basis is given by ˆe_{P C1} = 0.850ˆe_A+ 0.527ˆe_B. This means that variable A is contributing 0.85 and variable B 0.527 to PC 1.

The corresponding eigenvalues λ_α of the PCVs are shown in the table on the right. The table on the right shows that when expressing the OSC data with the principal components PC 1 already accounts for more than 82% of the variance in the data set.

Im Dokument Identification and Analysis of Key Parameters in Organic Solar Cells (Seite 98-104)