regression AnAlysis - IWA Publishing would like to thank all of the libraries for pledging to

Regression analysis is a method to determine a formula for the relationship between two variables. Determining this relationship is interesting in many situations. Examples of the applicability in a smart water utility setting could for example be:

6 Determine the relationship between average daily dissolved oxygen concentration and the effluent ammonium concentration with the purpose to find the optimal dissolved oxygen setpoint;

6 Determine the relationship between valve opening and flow through said valve. For example, determining a good control scheme for the valve;

6 Determine the relationship between time elapsed from back flushing a filter until turbidity reaches an acceptable level with the purpose to determine the optimal time to return to normal filter operation.

Linear regression is a simple analysis which aims at determining the best-fit of a straight line between a variable x to predict another variable y. The best-fit could in principle be defined in a variety of ways, however generally it has been agreed upon to define it as the line that minimises the squared errors of prediction.

When prediction errors are defined in this way it is quite straightforward to determine the parameters of the linear regression, as can be seen from the formula. This can easily be calculated manually when having as few data as in this example. However in most cases, and especially as datasets get larger, various software packages can be applied such as for example Excel.

Mostly regressions are calculated in various types of software; however it is instrumental to try it out by hand yourself, to assure you that it is quite simply just adding, subtracting, multiplying and dividing.

To find the best fitting line, calculate the standard deviation of each data set x and the

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 f(x) = 0.665x + 1.286

114 smArt wAter utilities: Complexity made simple 0 1 2 3 4 5 6

0 1 2 3 4 5 6 0 1 2 3 4 5 6

y = 0.095x⁵ – 1.14x⁴ + 4.81x³ – 8.39x²+ 6.29x + 0.32 r² = 0.95

0 1 2 3 4 5 6

Figure 4.11: Various types of regression.

stdev (y)

where stdev(x) =

n –1

¹

Σ

^(x–E(x))²

correlation coefficient between the two. It is then possible to find the parameters a and b in the linear equation.

y = ax + b a = r

stdev (x)

b = mean(y) – a*mean(x) and r =

^Σ

x * y

Σ

^x²^{* y}²

y = 1.36e^0.28x r² = 0.90

y = 0.019x² + 0.55x + 1.41 r² = 0.98

0 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0

y = 1.60ln(x) + 1.77 r² = 0.93

Figure 4.12: Pr incipal components analysis – finding the principal components among three variables.

Figure 4.13: Pri ncipal components – forming clusters around a normal condition.

multiVAriAte AnAlysis

Multivariate analysis is a tool to detect patterns of the operation. Large datasets of a number of variables can be considered as a data “cloud”.

This is illustrated in the Figure 4.12, in which the values of three variables are recorded. All the three measurements at one instant make up one point in the cloud. Usually there are many more variables and the space is then multidimensional.

Many of the variables are usually correlated, since most of them reflect some underlying mechanisms that drive the process in different ways. For example, consider a flow rate in a pipe. There may be two different measurements indicating the same thing: data from a flow rate sensor and data from a pump speed. The information from these two variables is strongly correlated and one of them may be sufficient to indicate the flow rate. Similarly there are many other measurements that are more or less depending on each other.

Therefore the true dimension of the cloud is usually much less than the number of

measurement signals. By projecting the data cloud on a lower dimension it is often possible to find key variables that can explain the changing behaviour. In the simplest case with two

variables, a regression line or curve can represent both of them. In Figure 4.12 most of the variations of the cloud take place along a plane defined by the two orthogonal axes PC1 and PC2. Of course, some data points are located above or below the plane. The distance from the cloud point to the plane (measured along the axis PC3) indicates an error. If the errors are sufficiently small then most of the variations can be explained by only two variables instead of three. Similarly, there may be ten variables that relate to an alarm situation. However, the changing conditions may be explained by only two variables, which are a combination of the ten measurements.

Multivariate analysis had been used for many years in the chemical process industry before it was introduced into the wastewater industry in the late 1990s.

–4 –3 –2 –1 0 1 2 3 4 4

3 2 1 0 –1 –2 –3 –4

DeViation

VariaBle 1

VariaBle 2

X2 X3

116 Smart Water UtilitieS: Complexity made simple Under normal operating conditions, the cloud is limited within a certain volume, illustrated by the ellipse in Figure 4.13. In other words: the normal range of all the measurements can be defined. In the figure, a low and a high alarm limit have been indicated. There are different methods available to determine how this volume or surface is defined. Often it is possible to use data under normal operations and define the normal variation as a cluster with a certain centre point. Then, if some variables are deviating from normal, the cloud will move outside the normal volume and an operator can readily detect that something has happened. In principle he will get an automatic detection not only that a single variable has exceeded the permitted amplitude, but that the combination of many signals has crossed an alarm limit. Now the analysis allows backtracking of the data, so that the real physical signals that caused the deviation can be identified. Each one of the variables may have changed within permitted limits, but the combination of their deviations has caused an alarm.

The most well-known multivariate method is Principal Component Analysis (PCA). However, PCA methods are insufficient to deal with data that are highly variable in time, such as influent flow rates and compositions in wastewater systems. The standard PCA assumes that there is a linear relationship between the variables. If there are nonlinear relationships then the PCA has to adapt to this case, in analogy with the simple regression methods above. Furthermore, the wide range of time constants in a wastewater treatment system makes it difficult to look at correlations of data in just one time scale.

Consequently, more sophisticated methods have been developed to deal with these challenges.

One is called Adaptive PCA. Multi-scale PCA decomposes the measurement data into different time-scales. The various multivariate methods are powerful tools to give the operator an early warning about operation deviations from normal.

HyDrAulic moDelling

Im Dokument IWA Publishing would like to thank all of the libraries for pledging to support the transition of this title to Open (Seite 115-118)