134
5 M o d e l d e v e lo p m e n t
5 .1 T h e T a s k
Whichexplanatoryvariablesshallappearinthemodelformulainwhichform?
ExampleConstructionCostofnuklearpowerplants
ExplanationTypeTrsf.
K
ConstructionCostamountlogG
CapacityamountlogD
Dateofpermissioncontin.–W Z
Waitingtimebetw.application&perm.amount–B Z
Constructiontimeamount–Z
Follow-upplant(existingplantonsite)binary–N E
SiteinNEoftheUSbinary–K T
Coolingtowerbinary–B W
ReactorbyBabcock-Wilcoxbinary–N
Numberofplantsbuiltbythesamearchitects/engineersearlier,+1countsqrtK G
Partialpriceguaranteeoftheprojectleadingenterprisebinary–5.1
cFirstaidtransformations:
d
lo g
10h K i = β
0+ β
1lo g
10h G i + β
2D + β
3W Z + β
4B Z + β
5Z + β
6N E + β
7K + β
8B W + β
9√ N + β
10K G +
FehlereAsingleterm
•
ttestforaβ
j•
factor(nominalvariable)− →
Ftestforseveralβ
jDoesasignificancetestmakesenseinthiscontext?
136
Coefficients:ValueStd.ErrortvaluePr(
> | t |
)Signif(Intercept)-6.025862.34729-2.570.018*lg10(G)0.692540.137135.050.000***D0.095250.035802.660.015*WZ0.002630.009550.280.785BZ0.002290.001981.160.261Z-0.045730.03561-1.280.213NE0.110450.033913.260.004**KT0.053400.029701.800.087.BW0.012780.045370.280.781sqrt(N)-0.029970.01780-1.680.107KG-0.099510.05562-1.790.088.Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’15 .2 A u to m a tic M o d e l S e le c tio n
aStepwisebackwards
bdelete
W Z
!delete
B W , B Z , Z , √ NundK T
!
Coefficients:ValueStd.ErrortvaluePr(
> | t |
)Sign(Intercept)-3.46121.1458-3.020.005**log10(G)0.66290.12955.120.000***D0.06100.01603.820.001***NE0.08310.03302.520.018*KG-0.18440.0424-4.350.000***cStepwiseforward...
138.2
eallsubsets.
fCriteria
1.„CoefficientofDetermination"
R
2ormultiplecorrelationR
,2.ValueofFteststatisticforthemodel,
3.PvalueforFtest,
4.Estimatedvarianceoferror
b σ
2,g5.“Adjusted”coef.ofdet.:
R
2adj= 1 −
n−1n−p ′(1 − R
2)
6.
C
p:=
SSQ (E)/ b σ
2m+ 2 p
′− n = n (
MSE/ b σ
2m− 1 + 2 p
′/ n )
, 7.Akaike’sInformationcriterionAIC≈ C
p.Largermodelarenotalwaysbetter!
5.2 h
C
pintheexample:Add
K T
and√ N
!PvalueforKG:0.049.
140.2
iLassoPenalizedRegression:Penaltyonlargecoefficients
Minimize
Q β ; λ = X
i
R
2i+ λ X
j
| β
j| .
λ
:weightofpenaltyVariationof
λ − →
somecoefficients=0− →
Modelselection•
StandardizeX
(j)torenderβ
jcomparable.•
AdaptiveLasso:Attachweightsto
β
j:weights1 / b β
jwithb β
jfrompreliminaryLassoestimate.•
LassoissuitableforlargenumbersofX
(j),p > > n
.− →
Genomics,Proteomics0.00.20.40.60.81.0
−0.4 −0.2 0.0 0.2 0.4 0.6
bounds
coeff | bounds
D D D D D D D D D D D D D D D D DD
WWWWWWWWWWWWWWWWW WBBBBBBBBBBB B B B B B B B
GG G G G G G G G G G G G G G G G G
ZZZZZZZZZZZZZZZZZZ EEEEE E E E E E E E E E E E EE
KKKKKKK K K K KKKKKKKK
bbbbbbbbbbbbbbbbb bNNNNNNNNNNNNNNNNNN *
*
*
*
******** * * * * * *
142.2
jChoiceoftheweightoftheL1penaltyterm,
λ
“Crossvalidation”,10-fold.
kIsthe“best”modelthetruemodel?
Considerseveralmodelsastheresultoftheanalysis
Amongall“good”modelschooseoneormoresuitableone(s)
byplausibilityandsubjectmatterknowledge!
ExploratorydataanalysiswillNOTfindthe“true”model
butseveralwhichfitthedatawell.
5 .3 C o lli n e a rit y
aModel
Y = X β + E
X
issingular,X
(j)’scollinearifX
singulär⇐ ⇒ d e t h X i = 0 ⇐ ⇒
esgibtc
mitX c = 0 ( c 6 = 0 ) ⇐ ⇒
esgibteinj
mitx
(j)i= e c
0+ X
k6=j
e c
kx
Parameternotuniquesince
X β = X ( β + γ c ) , γ
beliebigbSolution:Deleteacolumn!
Caution:Interpretationofparametersmaychange!
144.3
cApproximatecollinearity
− →
parameterilldetermined++++++
05 − − − − − −
0 5
+ + + + + + +
0 3
Y
geschätztModell
x
(2)x
(1)5.3
dLargestandarderrorsofestimates
− →
coefficientsinsignificaeHowtodetectcollinearity?
–Standarderrorofthe
b β
j’s –Istherearelationx
(j)i≈ e c
0+ P
k6=je c
kx
(k)i ?=Regressionproblem!Coefficientofdetermination
R
2jorvarianceinflationfactorVIFj
= 1 / (1 − R
2j)
146.3
fWhatremediesagainstcollinearity?
–Choiceofexperimentalconditions,
g–lineartransformationof
x su p j
’s,e.g.,sumanddifferenceor“moreimportant”variableplusresidualsoftheotherone.
h–deletevariablewihthighest
R
2j !(Usuallyinsignifikant!)i
*
RidgeRegression=PenalizedRegression.–Penalizesquaredβ
j:Q β ; λ = X
i
R
2i+ λ X
j
β
2j.
5 .4 S tr a te g ie s o f M o d e l D e v e lo p m e n t
aModelselectionisaninterplaybetween
•
availableknowledgefromsubjectmatter&statistics,•
Residualanalysis,„detektivework",•
automaticmodelselectionprocedures,•
Residualanalysis,„detektivework",•
Prinzipleofsimplicity,•
Assessmentofplausibilityandcritiquebysubjectmatterknowledge.
148
0 .
Readdata,definevariablenames(soundstrivial...)checkplausibility(screening),getacquainted
1 .
“Firstaid”Transformations.2 .
Alargemodel•
allvariables(maineffects),•
Resultofastepwiseforwardselection3 .
ExaminationoftheRandompart:•
Outliersinresiduals,•
Distributionofresiduals,•
Equalityofvariances,•
Independenceoferrors.Itmaybewarrantedinviewoftheresultsto
•
transformthetargetvariable,•
introduceweights,•
userobustmethods(ifnotdoneroutinely)150
4 .
Non-linearities.5 .
Automaticmodelselection6 .
Addvariables7 .
Interactions8 .
Influentialobservations9 .
Critiquebysubjectmatterknowledge1 0 .
Examinefit1 1 .
Revision1 2 .
CheckdeletedtermsagainCelebrate!
5.4
bExampleconstructioncost
Question:Doespriceguaranteehelp?
Detectivewordgivesthemostconvincinganswer!
152
M e s s a g e s M o d e l d e v e lo p m e n t
1.Automaticmodelselectionproceduresareahelpfultool
butdonotfind“thetruth!”
2.Modelselectionisaninterplaybetween