• Keine Ergebnisse gefunden

Generalized Linear Models and Network Analysis – Project 1 (a/b) The data file

N/A
N/A
Protected

Academic year: 2021

Aktie "Generalized Linear Models and Network Analysis – Project 1 (a/b) The data file"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Generalized Linear Models and Network Analysis – Project 1 (a/b) The data file BacteriaData.xlsx (available from the link below) contains information about the number of colonies forming units (cfu’s) of airborne bacteria as also fungi in the air in and around Graz. Some students might want to analyze the bacteria situation the other ones the fungi situation.

1. (Data Access) Access the data. You find the data either in http://www.stat.tugraz.

at/courses/files/BacteriaData.xlsx (sheet bacteria) or by clicking on Bacteria Data at http://www.stat.tugraz.at/courses/glmLjubljana.html (only for those of you experiencing troubles with the use of read.xlsfrom within R).

The data resulted from a one year study in which bacteria/fungi (actually the number of colonies forming units, cfu’s) in the outdoor air were monitored at 7 different sites that can be characterized as follows:

(a) village zone, near big farms with liquid manure pits and dung-hills;

(b) grassland and arable land, without buildings;

(c) suburban area with one-family houses and small farms;

(d) busy crossing, near a slaughter-house;

(e) public park on top of the Schloßberg in the center of Graz;

(f) living area with apartment buildings and gardens;

(g) as for (f) but with compost arrangements.

2. (Data Manipulations) Use the information siteas a factor with 7 labels.

Every 2 weeks the concentration of airborne bacteria/fungi was observed. Also observed was temperature (temp) and humidity (humi) at this time. The gauge (measurement equipment) was a six stages microbial air sampler (Andersen). The variables b1, . . .,b6 (or f1, . . ., f6) describe cfu counts observed on every stage j = 1, . . . ,6 of the gauge resulting from 128.3 liter air. Define the variablebac(orfun) as the total number of cfu’s (sum of b1,. . .,b6or sum of f1,. . .,f6) in 1 m3 air.

3. (Linear Regression) Concentrate on the response variable bac (or fun) and analyze its linear relationship withhumi,temp, and site. Don’t considerdate here, because this information should be sufficiently described by the temperature and the humidity observed at the same day.

Find the best linear model for the response variablebac(orfun). Also check for a necessary interaction between temperature and humidity. Don’t forget to additionally check the relevances of quadratic effectstemp^2andhumi^2in your model. Such effects will help to account for some optimal temperature and/or optimal humidity which bacteria (or fungi) like most. Denote your model by mylm.

Assess your resulting regression model with respect to departures from the normal dis- tribution and from the assumption of constant variance (homoscedasticity) by means of suitable plots.

4. (Box-Cox Transformation) Now search for the optimal Box-Coxtransformation. Use a meaningful value close to the estimate ˆλjust found and transform your response variable.

Check if all predictors from the linear model are still necessary. Denote the found model

(2)

by mylmBCand test on the general necessity of such a transformation (H0 :λ= 1) as also on the adequacy of a log-transformation (H0 :λ= 0). What are your findings?

Compare the goodness-of-fit of the linear regression model mylm with that of the optimal Box-Cox model mylmBC.

Has the structure in the respective diagnostic plots from the Box-Cox-model now improved (compared with that from the multiple linear regression model from before)?

5. (Generalized Linear Model) Now try to model the cfu’s directly by using a GLM based on a normal distribution but a log-link function to assure positive means. How does the model fit compare with the one when using a standard linear model with identity link function (as before in mylm)?

Would a GLM based on gamma responses with log-link function even give better results?

Try to graphically compare the prediction regions under the normal and under the gamma model (both based on using the log-link) for site 6 and humidity 60% depending on tem- perature (shown as horizontal axis).

6. (Multilevel Model) Also consider a Gaussian multilevel analysis with macro leveldate (this would consider all the observations from one and the same day as being correlated).

In order to ensure that the responses can be considered to stem from a normal population, transform them in the same way as in modelmylmBCand also use the same linear predictor.

First convert the date (used to specify the groups) to a factor. Do we still need all the predictors or are some of them now redundant? Carefully interpret the results of this final model.

Also try a multilevel model that assumes gamma distributed responses and uses a log-link function. How does this compare to the normal model from before?

Referenzen

ÄHNLICHE DOKUMENTE

4 Today, TIDES stands for Transformative Innovation for Development and Emergency Support, and refers pri- marily to the core group of staff and activities located at NDU’s

As the generalized linear models are widely used tools in analyzing genetic data, the proposed tests, being more adaptive to the high dimensionality, are useful additions to

The paper reviews dierent estimation procedures based on kernel methods and test procedures on the correct specication of this model (vs. a parametric generalized linear

We show that the asymptotic variance of the resulting nonparametric estimator of the mean function in the main regression model is the same as that when the selection probabilities

The Retrospective Analysis of Antarctic Tracking Data (RAATD) is a Scientific Committee for Antarctic Research project led jointly by the Expert Groups on Birds and Marine Mammals and

Fitzi and Maurer [5] consider a (non-threshold) general adversary that is characterized by the subsets of parties that could be corrupted, and show that broadcast can be realized

(Access the Data) The ships data from the MASS package concern a type of dam- age caused by waves to the forward section of cargo-carrying vessels

• This class extends the class of linear models (LM’s) to regression models for non-normal data.. • Special interest will be on binary data (logistic regression) and count