• Keine Ergebnisse gefunden

Prob., distributions and identitites

N/A
N/A
Protected

Academic year: 2022

Aktie "Prob., distributions and identitites"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

AML Cheat Sheet

Prob., distributions and identitites

SVD X=UDV>,U∈Rn×d,V∈Rd×d Cauchy-Schwarz

P

iuivi

2 ≤P

j|uj|2P

k|vk|2 Cov. (univariate) Cov[x, y] =E[(x−E[x])(y−E[y]) Cov. (mult’vari.) Cov[x, y] =Ex,y[xy>]−Ex[x]Ey[y]>

V[x±y] =V[x] +V[y]±2Cov[x, y]

Vx[Ax+b] =Vx[Ax] =AVx[x]A>

Sum Rule P(X=x) =P

yp(X=x, Y =y) Conditional P(X|Y) =P(X, Y)/P(Y) Bayes’ Rule P(Y|X) = P(X|YP(X))P(Y) Multi Gaussian

p(x|µ,Σ) = ((2π)d· |Σ|)−1/2exp(−12(x−µ)>Σ−1(x−µ)) P(xi|c, µc, σ2c) = √1

2πσ2cexp

(xi−µ2c)2 c

Markov P[X≥]≤E[X]/, X≥0, >0 Hoeffding L. E[esX]≤exp(s2(b−a)2/8) Hoeffding Thm

P[Sn−ESn≥t]≤exp(−2t2/P

(bi−ai)2) P[Sn−ESn≤ −t]≤exp(−2t2/P

(bi−ai)2) ifSn→X¯n t→n

Kernels

•Gaussian (RBF) kernel:k(x, x0) = exp(−kx−x0k22/h2)

•Dimensionofk(x,x0) = (x>x0+c)dis N+dd

Properties

Symmetry k(x, x0) =k(x0, x) Pos semi-def R

k(x0,x)f(x)f(x0)dxdx0, f∈L2,Ω⊆Rd construct θ:xi7→(√

λtvit)nt=1

Identities

Addition k(x, x0) =k1(x, x0) +k2(x, x0) Multiply k(x, x0) =k1(x, x0)k2(x, x0) Scalar k(x, x0) =ck1(x, x0) forc >0 Transform k(x, x0) =f(k1(x, x0))

f polynom with positive coeff. or exp.

func. multiply f(x)k1(x, x0)f(x0) for anyf

Risks

Q(y, f(x)) =





(y−f(x))2 quadratic loss (regr.) I{y6=f(x)} 0-1 loss (class.)

exp(−βyf(x)) exponential loss (class.) Cond. Exp. Risk R(f, X) =R

YQ(Y, f(X))P(Y|X)dY (Total) Exp. Risk R(f) =EX[R(f, X)]

Emp. Error R(f, X) =ˆ 1nPn

i=1Q(yi, f(Xi))

Maximum Likelihood Estimators

θˆM L ∈arg maxP(X |θ)i.i.d.= Qn

i=1P(xi|θ) Definitions:

Bias bias(ˆθn) =E[ˆθn]−θ

bias[ ˆf(x)] =EDf(x)ˆ −E[Y|x]

Consistency ∀, P[|θˆn−θ|> ]n→∞→ 0 Score Λ(θ) := log∂θP(x|θ)

• EX|θ[Λ] = 0; • EX|θ[Λ] =∂θbθˆ+ 1 Fisher Information I(θ) =V

hlogP(x|θ)

∂θ

i

Asymptotical efficiency limn→∞(V[ˆθn(x1, ..., xn)]I(θ))−1= 1

Results

Rao-Cramer EX|θ[(ˆθ−θ)2]≥(1 +∂θbθˆ)2/I(n)(θ) +b2ˆ

θ

ML converg. √

n(ˆθnM L−θ0)→ ND (0, J−10)I(θ0)J−10)) J(θ) =−E[2log∂θ∂θP(x|θ)T ]

ML consist. θˆM Ln

p θ0

If ˆθM Lforθ g(ˆθM L) forg(θ)

Bayesian Learning

Maximum a Posteriori θˆ∈arg maxθp(x|θ)p(θ) Prediction:

p(X=x|X) = R

p(x|θ)p(θ|X)dθ Rec. Bayesian Est.

p(θ|Xn) = Rp(xn|θ)p(θ|Xn−1) p(xn|θ)p(θ|Xn−1)dθ

Regression

ε∼ N(0, σ) y=Xβ+ε

Least-square fit βˆ= (XTX)−1X>y f= opt. estimator βˆ∼ N(β,(XTX)−1σ2) For ˆθ=cTyunbiased: V[aTβ]ˆ ≤V[cTy]

MSE EDEX,Y( ˆf(X)−Y)2= variance +EXED( ˆf(X)−EDf(X))ˆ 2 bias2 +EX(EDf(X)ˆ −EY[Y|X])2 noise +EX,Y(Y −EY[Y|X])2 Ridge βˆ= (XTX+λI)−1X>y Gen. Reg. βˆ= arg minβRSS(β) +λPd

j=1j|q bias( ˆf) =E[ ˆf]−f V ar[ ˆf] =E[( ˆf−E[ ˆf])2

Bayesian Linear Regression

M odel: Y =Xβ+, ∼ N(0, σ2) Likelihood: p(Y|X, β, σ2) =N(Xβ, σ2)

P rior: p(β|Λ) =N(0,Λ−1) P osterior: p(β|X,y,Λ) =N(µββ)

µβ= (XTX+σ2Λ)−1XTβ2(XTX+σ2Λ)−1 (β−µβ)>Σ−1β (β−µβ) =β>Σ−1β β−2β>Σ−1β µβ>βΣ−1β µβ

Conditioning a Gaussian: p(xa|xb) =N(xaa|ba|b) µa|ba+ ΣabΣ−1bb(xb−µb),Σa|b= Σaa−ΣabΣ−1bbΣba

Λaa= (Σaa−ΣabΣ−1bbΣba)−1ab=−ΛaaΣabΣ−1bb

Gaussian Process

Joint distribution of [y, yn+1] is given by

y|X, σ2∼ N(0,XTΛ−1X+σ2I), kernelized version:

p y

yn+1

|xn+1,X, σ2

=N

0,

Cn k kT c

Cn=K+σ2In c=k(xn+1, xn+1) +σ2 k=k(xn+1,X) K=k(X,X)

p a1

a2

=N a1

a2

| u1

u2

,

Σ11 Σ12

Σ21 Σ22

a1,u1∈Re; Σ11∈Re×e PSD; Σ12∈Re×f PSD a2,u2∈Rd; Σ22∈Rf×f PSD; Σ2,1∈Rf×e PSD

Predictive density:

p(yn+1|xn+1,X,y) =N(µn+1, σn+12 ) µn+1=kTC−1n y σ2n+1=c−kTC−1n k

Numerical Estimation Techniques Cross-Validation

fitted models fˆ−ν∈arg minf∈F 1

|Z\Zν|

P

i6∈Zν(yi−f(xi))2 pred. error Rˆcv=n1P

i≤n(yi−fˆ−κ(i)(xi))2 unbiasedness N(k−1)k ≥m(exam2018 k-fold CV)

Bootstrap

It works if Rstrn ( ˆF ,Fˆ)−Rstrn (F,Fˆ)→P 0 bootst. avg risk: Rˆ=B1n1PB

b=1

Pn

i=1Q(yi,fˆ∗b(xi)) Sols for overlap: C−i:={j∈[B] :xi6∈ Z∗j}

(1)=n1Pn i=1

1

|C−i|

P

b∈C−il(yi, f∗b(xi)) ( ˆRfit on trainset) Rˆ.632= 0.368 ˆR+ 0.632 ˆR(1)

(0.632+)= (1−w)ˆ ∗Rˆ+ ˆwRˆ(1) ˆ

w= 0.632

1−0.368 ˆG,Gˆ= Rˆ(1)Rˆ

ˆ

γ−Rˆ ,ˆγ=n12PN i=1

PN

j=1l(yi,fˆ(xj))

Jackknife

n−1−i (x1, ..., xi−1, xi+1, ..., xn) = ˆSn−1(x1, ..., xi−1, xi+1, ..., xn) S˜n:= 1

n

n

X

i=1

n−i−i biasJ K := (n−1)( ˜Sn−Sˆn) (Jackknife) Debiasd estm. SˆJ K = ˆSn−biasJ K

Tests and criteria

LetX1, ..., Xn∼Q(x) i.i.d. andH0:Q=P0 H1:Q=P1. Test g(x1, ..., xn) =

(0(accepted) PP0(x1,...,xn)

1(x1,...,xn) > T 1(rejected) PP0(x1,...,xn)

1(x1,...,xn) ≤T

Thenα=E0[g(x1, ..., xn)] andβ= 1−E1[g(x1, ..., xn)] Assume that we know the log likelihood function (loss) of the model Bayes Factor p(X|Mk)/p(X|Ml) (i)

(i)>1 takeMk p(X|Mk) =R

p(X|θk,Mk)p(θk|Mk)dθk

BIC(minimise) −2 log(ˆp(X|θˆk,Mk)) +k0logn Laplace approx. (k0= #free params inMk) logp(X|Mk) = log p(X|θˆk,Mk)−log(n)k0/2 +O(1) MDL −logp(X|θk)−logp(θk)

AIC −2 log(ˆp(X|θˆk)) + 2k

KL D(p||ˆp) =−R

p(x) log

ˆ p(x|θˆk)

p(x)

dx TIC −2 log(ˆp(X|θˆk)) + 2trace[I1k)J1−1k)]

AIC is asymptotically equivalent to LOOCV for ordinary linear regression models.

Linear Discriminant Functions

Gradient Descent ak+1=ak−ηk∇J(ak)

J(ak+1)≈J(ak) +∇JT(ak+1−ak) +12(ak+1−ak)TH(ak+1−ak) ηOP T = ∇J||∇J||TH∇J2

Newton’s Rule ak+1=ak−H−1∇J(ak) Percep loss J(a) =P

˜

x∈X˜mc(−aT˜x) Percep update ak+1=akkP

˜ x∈X˜mcx˜ γ= mini∈X˜mc(ˆaTi) β2= maxi∈X˜mc||˜xi||2 Max steps (γ)−2β2||ˆa||2

Bayesian view:

(2)

Prior P(Y =y) =πy

Posterior density p(y|x) = Pπyp(x|y) zπzpz(x)

c(x) =

(y P

zL(z, y)p(z|x) =minρ≤kP

L(z, ρ)p(z|x)≤d D else

Outlier classi.: πOpO(x)≥max{(1−d)p(x),maxzπzpz(x)}

Fisher’s Linear Discriminant Analysis (LDA) sample avg mα=n1

α

P

x∈Xαx, nα=|Xα| projected avg m˜α=n1

α

P

x∈XαwTx=wTmα

class scatter Σα=P

xα∈Xα(x−mα)(x−mα)T within scatter ΣW =P

1≤α≤kΣα

projected scatter Σ˜α=wTΣαw

Fisher’s Separation J(w) = wT(m1−m2)(m1−m2)Tw)

Ww

yields w∝Σ−1W(m1−m2)

Mean scatter ΣB= (m1−m2)(m1−m2)T resultΣ−1WΣBw =wTΣBw

wTΣwww

Lagrangian Optimization

minf(w) w∈Ω⊆Rd s.t. gi(w)≤0 1≤i≤k

hj(w) = 0 1≤j≤m L(w,α,β) =f(w) +

k

X

i=1

αigi(w) +

m

X

j=1

βjhj(w)

∂L

∂w|w=w= 0 max

α,β θ(α,β) withθ(α,β) = inf

w L(w,α,β) s.t. αi≥0

Duality gap ∆ :=L(w, α, β)−θ(α, β)

Strong duality, i.e. convex obj. fctf & convex domain, then the duality gap is zero.

KKT Conditions: f∈C1 andgi, hi are affine, thenwis an optimum ifαsatisfy

∂L(w)

∂w = 0∂L(w)

∂β = 0

αigi(w) = 0,gi(w)≤0, αi ≥0

SVM

Soft Margin Geometric problem formulation Primal minw,ξ1

2wTw+CPn i=1ξi

zi(wTyi+w0)≥1−ξi ξi≥0 Dual maxαP

i≤nαi12P

i≤n

P

j≤nαiαjzizjyTiyTj C≥αi≥0,P

i≤nziαi= 0

Solution w0= (maxi:zi=−1w∗Tyi+ mini:zi=1w∗Tyi)/2 w=P

i∈SVαiziyi g(y) =P

i∈SVziαiyTiy+w0

By the KKT condition,ξii−C) = 0, non-zero slack variable can only occur ifαi=C.

The optimal margin is given by: w>w=P

i∈SVαi Multi-class SVM:wT = (wT1, ...,wTn).

Primal minw,ξ1

2wTw+CP

i≤nξi

ξi≥0 (wTziyi+wzi,0)−maxz6=zi(wTzyi+wz,0)≥1−ξi

Structured SVM:

Primal minw,ξ1

2wTw+CP

i≤nξi s.t.ξi≥0 wTΨ(zi,yi) −maxz6=zi[∆(z, zi) +wTΨ(z,yi)]≥ −ξi

wTΨ(zi,yi) −wTΨ(z,yi)≥∆(z, zi)−ξi ∀i,∀z6=zi

Dualminw,ξ12Pn

i=1

Pn j=1

P

zkKαikαjkΨi(zk)>Ψj(zk) +Pn

i=1

P

zkKαiki(zk)

s.t. C≥P

zk∈Kαik≥0,αik≥0,∀i,∀k Prediciton h(y) = arg maxz∈K

w>ψ(z,y)

Ensemble

If we combine different regressors: V[ ˆf(x)]≈σB2 bias[ ˆf(x)] =B1 PB

i=1bias[ ˆfi(x)]

Boosting: Weighted models and weighted training data instead of bootstrapping.

b

n

X

i=1

wi(b)I{cb(xi)6=yi}/

n

X

i=1

w(b)i

αb←log1−b

b

= log

p(y= 1|x) p(y=−1|x)

(log-odds ratio)

∀i wi←wiexp(αbI{cb(xi)6=yi}) ˆ

cB(x) =sign

B

X

b=1

αbcb(x)

!

avg exp loss= N1 PN

i=1exp(−yiˆcB(xi)) ErrAdaBoost= exp(−h(x)sign(PB

b=1αbyb(x)))

PAC Learning

error error(h) =Px∼D[c(x)6=h(x)]

(, δ) criterion: PX,Y[R(ˆcn)≤R(cBayes) +]>1−δ Strong PAC L.: holds for arbitrarily small

Weak PAC L.: non-trivially large PAC learnability P[R(ˆcn)≤ε]≥1−δ

efficiently Aruns in poly time in 1ε and 1δ Results:

R(ˆcn)−infc∈CR(c) ≤2 supc∈C|Rˆn(c)−R(c)|

P[supc∈C|Rˆn(c)−R(c)|> ] ≤2Nexp(−2n2) Implying: R(c)≤Rˆn(c) +p

(logN−log(δ/2))/2n X shattered byAif

{X∩A|A∈ A} contains all subsets ofX VC DimofA= max{n:∃X s.t.|X shattered byA,|X|=n}

score score(A, X) =|{X∩ A|A∈ A}|

shattering coeff s(A, n) = maxX:|X|=nscore(A, X) IfVA>2 (VA= VC dim. of A): s(A, n)≤nVA

P[R(cn)−infc∈CR(c)> ] ≤8s(A, n) exp(−n2/32)

Non Paramteric Bayesian Methods

Beta function B(a, b) = Γ(a)Γ(b)Γ(a+b) , a, b >0 Γ(a) =R

0 e−xxa−1dx Beta(x|a, b) = xa−1B(a,b)(1−x)b−1, x∈[0,1]

Dir(x|α) =

Qn k=1xαkk −1

B(α)

Finite Gaussian Mix p(xi|θ) =PK

k=1ρkN(xik, σk)

Stick breaking process (GEM distribution):

βk∼Beta(1, α) ρk2(1−Pk−1

i=1ρi), k= 1,2, ...

Chinease Restaurant Process P(P) =α|P|

α(n)

Q

τ∈P(|τ| −1)! E[#k] =PN i=1

α

α+i ∼O(αlogN) P[Customern+ 1 joins tableτ ∈ P ∪ {∅}|P] =

( |τ|

α+n τ∈ P

α α+n ow.

Dirichlet Mixture Model

Base Measures µk∼ N(µ0, σ0)

cluster prob. ρ= (ρ1, ρ2, ...)∼GEM(α) Category assignment zi∼Categorical(ρ) Data Sample xi∼ N(µzi, σ) De Finetti’s Theorem

p(X1, ..., Xn) =R Q

p(xi|G)dP(G) Gibbs Sampling

p(zi=k|z−i,x, α,µ)∝p(zi=k|z−i, α)

| {z }

P rior

p(xi|x−i, zi=k,z−i,µ)

| {z }

Likelihood

p(zi=k|z−i,x, α,µ) =

( Nk,−i

α+N−1p(xi|x−i,k,µ) For existingk

α

α+N−1p(xi|µ) Otherwise p(xi,x−i,k|µ) =

Z

p(xik)

 Y

j6=i

p(xjk)

p(µk0, σ0)dµk

Gaussian-mixtures and EM estimation

Parameters θ={πc, µc, σ2c}kc=1

k Gaussian Mix. P(xi|θ) =Pk

c=1πcP(xi|c, µc, σ2c) Pk

c=1πc= 1

log likelihood L(X |θ) = logP(X |θ) =Pn

i=1logP(xi|θ)

=Pn

i=1logPk

c=1πcP(xi|c, µc, σ2c)

Define binary latent variablesMic∈ {0,1}whereMicindicates that xiis generated by componentc. log likelihood

L(X, M|θ) = logQn i=1

Qk

c=1cP(xi|c, µc, σ2c))Mic L(X, M|θ) =Pn

i=1

Pk

c=1Miclog(πcP(xi|c, µc, σc2))

Expectation over the latent va. γic:=EM|X,θ[Mic] Q(θ) :=EM|X[L(X, M|θ)] =Pn

i=1

Pk

c=1γiclog(πcP(xi|c, µc, σc2)) EM-estimation algo •E-stepcomputeγic,θconst

•M-stepcomputeθ,γicconst

E-step: EM|X[Mic] = 1·P(Mic= 1|xi, θ) + 0·P(Mic= 0|xi, θ)

=P(Mic= 1|xi, θ) =P(c|xi, θ) =P(xi|c,θ)P(c|θ) P(xi|θ)

= πcP(xi|c,µc

2 j) Pk

j=1πjP(xi,|j,µj2j)

M-step:

(i)µc fromarg maxθQ(θ):

∂µcQ(θ) = 0 =⇒ µc=

Pn i=1γicxi Pn

i=1γic

(ii)σcfromarg maxθQ(θ):

∂σcQ(θ) = 0 =⇒ σc2=

Pn

i=1γic(xi−µc)2 Pn

i=1γic

(ii)πcfromarg maxθQ(θ): constraintP

cπc= 1 L(θ, λ) =−Q(θ) +λ(Pk

c=1πc−1) ⇒∂π

cL(θ, λ) = 0

⇔Pn

i=1γic=λπc⇔Pk c=1

Pn

i=1 γic=λPc i=1πc

⇔Pn i=1

Pk

c=1γic=Pn

i=11 =λ ⇔ πc=

Pn i=1γic

n

Referenzen

ÄHNLICHE DOKUMENTE

In particular we will provide explicit expansion formulas for symbols of conormal distributions under multiplica- tion (Theorem 2.17 and Theorem 2.19) and nonlinear superposition

“Extracting Risk Neutral Probability Densities by Fitting Implied Volatility Smiles: Some Methodological Points and an Application to the 3M EURIBOR Futures Option Prices.”

Discrepancies between observations and model simulations are larger in most of the ground-based stations for CHBr 3 (lifetime ∼ 20 days in our simulation) than for CH 2 Br 2 , and

Key Words: bilateral Gamma distributions, selfdecomposability, unimodality, bilateral Gamma processes, measure transformations, stock models, option pricing, term structure

He also studied the properties of the loss-function for general measures {P n } introduced in Abaya and Wise (1984), and proved that the convergence of infimums of the loss-function

Keywords and phrases: multivariate discrete distribution; multivariate Binomial distribu- tion; multivariate Poisson distribution; eventology; dependence of events.. AMS 2000

The results in Table 2 show a strong negative relationship between wealth inequality and the value of : This can also be seen from Figure 1, which shows the Lorenz curves for

Having a fundamental solution one is able to construct a solution u of Lu = g, at least if g is a smooth function having compact support, as we shall see in 7.12..