AML Cheat Sheet
Prob., distributions and identitites
SVD X=UDV>,U∈Rn×d,V∈Rd×d Cauchy-Schwarz
P
iuivi
2 ≤P
j|uj|2P
k|vk|2 Cov. (univariate) Cov[x, y] =E[(x−E[x])(y−E[y]) Cov. (mult’vari.) Cov[x, y] =Ex,y[xy>]−Ex[x]Ey[y]>
V[x±y] =V[x] +V[y]±2Cov[x, y]
Vx[Ax+b] =Vx[Ax] =AVx[x]A>
Sum Rule P(X=x) =P
yp(X=x, Y =y) Conditional P(X|Y) =P(X, Y)/P(Y) Bayes’ Rule P(Y|X) = P(X|YP(X))P(Y) Multi Gaussian
p(x|µ,Σ) = ((2π)d· |Σ|)−1/2exp(−12(x−µ)>Σ−1(x−µ)) P(xi|c, µc, σ2c) = √1
2πσ2cexp
−(xi2σ−µ2c)2 c
Markov P[X≥]≤E[X]/, X≥0, >0 Hoeffding L. E[esX]≤exp(s2(b−a)2/8) Hoeffding Thm
P[Sn−ESn≥t]≤exp(−2t2/P
(bi−ai)2) P[Sn−ESn≤ −t]≤exp(−2t2/P
(bi−ai)2) ifSn→X¯n t→n
Kernels
•Gaussian (RBF) kernel:k(x, x0) = exp(−kx−x0k22/h2)
•Dimensionofk(x,x0) = (x>x0+c)dis N+dd
Properties
Symmetry k(x, x0) =k(x0, x) Pos semi-def R
Ωk(x0,x)f(x)f(x0)dxdx0, f∈L2,Ω⊆Rd construct θ:xi7→(√
λtvit)nt=1
Identities
Addition k(x, x0) =k1(x, x0) +k2(x, x0) Multiply k(x, x0) =k1(x, x0)k2(x, x0) Scalar k(x, x0) =ck1(x, x0) forc >0 Transform k(x, x0) =f(k1(x, x0))
f polynom with positive coeff. or exp.
func. multiply f(x)k1(x, x0)f(x0) for anyf
Risks
Q(y, f(x)) =
(y−f(x))2 quadratic loss (regr.) I{y6=f(x)} 0-1 loss (class.)
exp(−βyf(x)) exponential loss (class.) Cond. Exp. Risk R(f, X) =R
YQ(Y, f(X))P(Y|X)dY (Total) Exp. Risk R(f) =EX[R(f, X)]
Emp. Error R(f, X) =ˆ 1nPn
i=1Q(yi, f(Xi))
Maximum Likelihood Estimators
θˆM L ∈arg maxP(X |θ)i.i.d.= Qn
i=1P(xi|θ) Definitions:
Bias bias(ˆθn) =E[ˆθn]−θ
bias[ ˆf(x)] =EDf(x)ˆ −E[Y|x]
Consistency ∀, P[|θˆn−θ|> ]n→∞→ 0 Score Λ(θ) := ∂log∂θP(x|θ)
• EX|θ[Λ] = 0; • EX|θ[Λ] =∂θ∂bθˆ+ 1 Fisher Information I(θ) =V
h∂logP(x|θ)
∂θ
i
Asymptotical efficiency limn→∞(V[ˆθn(x1, ..., xn)]I(θ))−1= 1
Results
Rao-Cramer EX|θ[(ˆθ−θ)2]≥(1 +∂θ∂bθˆ)2/I(n)(θ) +b2ˆ
θ
ML converg. √
n(ˆθnM L−θ0)→ ND (0, J−1(θ0)I(θ0)J−1(θ0)) J(θ) =−E[∂2log∂θ∂θP(x|θ)T ]
ML consist. θˆM Ln
→p θ0
If ˆθM Lforθ g(ˆθM L) forg(θ)
Bayesian Learning
Maximum a Posteriori θˆ∈arg maxθp(x|θ)p(θ) Prediction:
p(X=x|X) = R
p(x|θ)p(θ|X)dθ Rec. Bayesian Est.
p(θ|Xn) = Rp(xn|θ)p(θ|Xn−1) p(xn|θ)p(θ|Xn−1)dθ
Regression
ε∼ N(0, σ) y=Xβ+ε
Least-square fit βˆ= (XTX)−1X>y f∗= opt. estimator βˆ∼ N(β,(XTX)−1σ2) For ˆθ=cTyunbiased: V[aTβ]ˆ ≤V[cTy]
MSE EDEX,Y( ˆf(X)−Y)2= variance +EXED( ˆf(X)−EDf(X))ˆ 2 bias2 +EX(EDf(X)ˆ −EY[Y|X])2 noise +EX,Y(Y −EY[Y|X])2 Ridge βˆ= (XTX+λI)−1X>y Gen. Reg. βˆ= arg minβRSS(β) +λPd
j=1|βj|q bias( ˆf) =E[ ˆf]−f∗ V ar[ ˆf] =E[( ˆf−E[ ˆf])2
Bayesian Linear Regression
M odel: Y =Xβ+, ∼ N(0, σ2) Likelihood: p(Y|X, β, σ2) =N(Xβ, σ2)
P rior: p(β|Λ) =N(0,Λ−1) P osterior: p(β|X,y,Λ) =N(µβ,Σβ)
µβ= (XTX+σ2Λ)−1XTyΣβ=σ2(XTX+σ2Λ)−1 (β−µβ)>Σ−1β (β−µβ) =β>Σ−1β β−2β>Σ−1β µβ+µ>βΣ−1β µβ
Conditioning a Gaussian: p(xa|xb) =N(xa|µa|b,Σa|b) µa|b=µa+ ΣabΣ−1bb(xb−µb),Σa|b= Σaa−ΣabΣ−1bbΣba
Λaa= (Σaa−ΣabΣ−1bbΣba)−1,Λab=−ΛaaΣabΣ−1bb
Gaussian Process
Joint distribution of [y, yn+1] is given by
y|X, σ2∼ N(0,XTΛ−1X+σ2I), kernelized version:
p y
yn+1
|xn+1,X, σ2
=N
0,
Cn k kT c
Cn=K+σ2In c=k(xn+1, xn+1) +σ2 k=k(xn+1,X) K=k(X,X)
p a1
a2
=N a1
a2
| u1
u2
,
Σ11 Σ12
Σ21 Σ22
a1,u1∈Re; Σ11∈Re×e PSD; Σ12∈Re×f PSD a2,u2∈Rd; Σ22∈Rf×f PSD; Σ2,1∈Rf×e PSD
Predictive density:
p(yn+1|xn+1,X,y) =N(µn+1, σn+12 ) µn+1=kTC−1n y σ2n+1=c−kTC−1n k
Numerical Estimation Techniques Cross-Validation
fitted models fˆ−ν∈arg minf∈F 1
|Z\Zν|
P
i6∈Zν(yi−f(xi))2 pred. error Rˆcv=n1P
i≤n(yi−fˆ−κ(i)(xi))2 unbiasedness N(k−1)k ≥m(exam2018 k-fold CV)
Bootstrap
It works if Rstrn ( ˆF ,Fˆ∗)−Rstrn (F,Fˆ)→P 0 bootst. avg risk: Rˆ∗=B1n1PB
b=1
Pn
i=1Q(yi,fˆ∗b(xi)) Sols for overlap: C−i:={j∈[B] :xi6∈ Z∗j}
Rˆ(1)=n1Pn i=1
1
|C−i|
P
b∈C−il(yi, f∗b(xi)) ( ˆR∗fit on trainset) Rˆ.632= 0.368 ˆR∗+ 0.632 ˆR(1)
Rˆ(0.632+)= (1−w)ˆ ∗Rˆ∗+ ˆwRˆ(1) ˆ
w= 0.632
1−0.368 ˆG,Gˆ= Rˆ(1)−Rˆ∗
ˆ
γ−Rˆ∗ ,ˆγ=n12PN i=1
PN
j=1l(yi,fˆ(xj))
Jackknife
Sˆn−1−i (x1, ..., xi−1, xi+1, ..., xn) = ˆSn−1(x1, ..., xi−1, xi+1, ..., xn) S˜n:= 1
n
n
X
i=1
Sˆn−i−i biasJ K := (n−1)( ˜Sn−Sˆn) (Jackknife) Debiasd estm. SˆJ K = ˆSn−biasJ K
Tests and criteria
LetX1, ..., Xn∼Q(x) i.i.d. andH0:Q=P0 H1:Q=P1. Test g(x1, ..., xn) =
(0(accepted) PP0(x1,...,xn)
1(x1,...,xn) > T 1(rejected) PP0(x1,...,xn)
1(x1,...,xn) ≤T
Thenα∗=E0[g(x1, ..., xn)] andβ∗= 1−E1[g(x1, ..., xn)] Assume that we know the log likelihood function (loss) of the model Bayes Factor p(X|Mk)/p(X|Ml) (i)
(i)>1 takeMk p(X|Mk) =R
p(X|θk,Mk)p(θk|Mk)dθk
BIC(minimise) −2 log(ˆp(X|θˆk,Mk)) +k0logn Laplace approx. (k0= #free params inMk) logp(X|Mk) = log p(X|θˆk,Mk)−log(n)k0/2 +O(1) MDL −logp(X|θk)−logp(θk)
AIC −2 log(ˆp(X|θˆk)) + 2k
KL D(p||ˆp) =−R
p(x) log
ˆ p(x|θˆk)
p(x)
dx TIC −2 log(ˆp(X|θˆk)) + 2trace[I1(θk)J1−1(θk)]
AIC is asymptotically equivalent to LOOCV for ordinary linear regression models.
Linear Discriminant Functions
Gradient Descent ak+1=ak−ηk∇J(ak)
J(ak+1)≈J(ak) +∇JT(ak+1−ak) +12(ak+1−ak)TH(ak+1−ak) ηOP T = ∇J||∇J||TH∇J2
Newton’s Rule ak+1=ak−H−1∇J(ak) Percep loss J(a) =P
˜
x∈X˜mc(−aT˜x) Percep update ak+1=ak+ηkP
˜ x∈X˜mcx˜ γ= mini∈X˜mc(ˆaTx˜i) β2= maxi∈X˜mc||˜xi||2 Max steps (γ)−2β2||ˆa||2
Bayesian view:
Prior P(Y =y) =πy
Posterior density p(y|x) = Pπyp(x|y) zπzpz(x)
c(x) =
(y P
zL(z, y)p(z|x) =minρ≤kP
L(z, ρ)p(z|x)≤d D else
Outlier classi.: πOpO(x)≥max{(1−d)p(x),maxzπzpz(x)}
Fisher’s Linear Discriminant Analysis (LDA) sample avg mα=n1
α
P
x∈Xαx, nα=|Xα| projected avg m˜α=n1
α
P
x∈XαwTx=wTmα
class scatter Σα=P
xα∈Xα(x−mα)(x−mα)T within scatter ΣW =P
1≤α≤kΣα
projected scatter Σ˜α=wTΣαw
Fisher’s Separation J(w) = wT(m1−mwΣ2)(m1−m2)Tw)
Ww
yields w∝Σ−1W(m1−m2)
Mean scatter ΣB= (m1−m2)(m1−m2)T resultΣ−1WΣBw =wTΣBw
wTΣwww
Lagrangian Optimization
minf(w) w∈Ω⊆Rd s.t. gi(w)≤0 1≤i≤k
hj(w) = 0 1≤j≤m L(w,α,β) =f(w) +
k
X
i=1
αigi(w) +
m
X
j=1
βjhj(w)
∂L
∂w|w=w∗= 0 max
α,β θ(α,β) withθ(α,β) = inf
w L(w,α,β) s.t. αi≥0
Duality gap ∆ :=L(w∗, α∗, β∗)−θ(α∗, β∗)
Strong duality, i.e. convex obj. fctf & convex domain, then the duality gap is zero.
KKT Conditions: f∈C1 andgi, hi are affine, thenw∗is an optimum ifα∗,β∗satisfy
∂L(w∗,α∗,β∗)
∂w = 0∂L(w∗,α∗,β∗)
∂β = 0
α∗igi(w∗) = 0,gi(w∗)≤0, α∗i ≥0
SVM
Soft Margin Geometric problem formulation Primal minw,ξ1
2wTw+CPn i=1ξi
zi(wTyi+w0)≥1−ξi ξi≥0 Dual maxαP
i≤nαi−12P
i≤n
P
j≤nαiαjzizjyTiyTj C≥αi≥0,P
i≤nziαi= 0
Solution w0∗= (maxi:zi=−1w∗Tyi+ mini:zi=1w∗Tyi)/2 w∗=P
i∈SVα∗iziyi g∗(y) =P
i∈SVziα∗iyTiy+w0∗
By the KKT condition,ξi(αi−C) = 0, non-zero slack variable can only occur ifαi=C.
The optimal margin is given by: w>w=P
i∈SVα∗i Multi-class SVM:wT = (wT1, ...,wTn).
Primal minw,ξ1
2wTw+CP
i≤nξi
ξi≥0 (wTziyi+wzi,0)−maxz6=zi(wTzyi+wz,0)≥1−ξi
Structured SVM:
Primal minw,ξ1
2wTw+CP
i≤nξi s.t.ξi≥0 wTΨ(zi,yi) −maxz6=zi[∆(z, zi) +wTΨ(z,yi)]≥ −ξi
wTΨ(zi,yi) −wTΨ(z,yi)≥∆(z, zi)−ξi ∀i,∀z6=zi
Dualminw,ξ −12Pn
i=1
Pn j=1
P
zk∈KαikαjkΨi(zk)>Ψj(zk) +Pn
i=1
P
zk∈Kαik∆i(zk)
s.t. C≥P
zk∈Kαik≥0,αik≥0,∀i,∀k Prediciton h(y) = arg maxz∈K
w>ψ(z,y)
Ensemble
If we combine different regressors: V[ ˆf(x)]≈σB2 bias[ ˆf(x)] =B1 PB
i=1bias[ ˆfi(x)]
Boosting: Weighted models and weighted training data instead of bootstrapping.
b←
n
X
i=1
wi(b)I{cb(xi)6=yi}/
n
X
i=1
w(b)i
αb←log1−b
b
= log
p(y= 1|x) p(y=−1|x)
(log-odds ratio)
∀i wi←wiexp(αbI{cb(xi)6=yi}) ˆ
cB(x) =sign
B
X
b=1
αbcb(x)
!
avg exp loss= N1 PN
i=1exp(−yiˆcB(xi)) ErrAdaBoost= exp(−h(x)sign(PB
b=1αbyb(x)))
PAC Learning
error error(h) =Px∼D[c(x)6=h(x)]
(, δ) criterion: PX,Y[R(ˆcn)≤R(cBayes) +]>1−δ Strong PAC L.: holds for arbitrarily small
Weak PAC L.: non-trivially large PAC learnability P[R(ˆcn)≤ε]≥1−δ
efficiently Aruns in poly time in 1ε and 1δ Results:
R(ˆc∗n)−infc∈CR(c) ≤2 supc∈C|Rˆn(c)−R(c)|
P[supc∈C|Rˆn(c)−R(c)|> ] ≤2Nexp(−2n2) Implying: R(c)≤Rˆn(c) +p
(logN−log(δ/2))/2n X shattered byAif
{X∩A|A∈ A} contains all subsets ofX VC DimofA= max{n:∃X s.t.|X shattered byA,|X|=n}
score score(A, X) =|{X∩ A|A∈ A}|
shattering coeff s(A, n) = maxX:|X|=nscore(A, X) IfVA>2 (VA= VC dim. of A): s(A, n)≤nVA
P[R(c∗n)−infc∈CR(c)> ] ≤8s(A, n) exp(−n2/32)
Non Paramteric Bayesian Methods
Beta function B(a, b) = Γ(a)Γ(b)Γ(a+b) , a, b >0 Γ(a) =R∞
0 e−xxa−1dx Beta(x|a, b) = xa−1B(a,b)(1−x)b−1, x∈[0,1]
Dir(x|α) =
Qn k=1xαkk −1
B(α)
Finite Gaussian Mix p(xi|θ) =PK
k=1ρkN(xi|µk, σk)
Stick breaking process (GEM distribution):
βk∼Beta(1, α) ρk=β2(1−Pk−1
i=1ρi), k= 1,2, ...
Chinease Restaurant Process P(P) =α|P|
α(n)
Q
τ∈P(|τ| −1)! E[#k] =PN i=1
α
α+i ∼O(αlogN) P[Customern+ 1 joins tableτ ∈ P ∪ {∅}|P] =
( |τ|
α+n τ∈ P
α α+n ow.
Dirichlet Mixture Model
Base Measures µk∼ N(µ0, σ0)
cluster prob. ρ= (ρ1, ρ2, ...)∼GEM(α) Category assignment zi∼Categorical(ρ) Data Sample xi∼ N(µzi, σ) De Finetti’s Theorem
p(X1, ..., Xn) =R Q
p(xi|G)dP(G) Gibbs Sampling
p(zi=k|z−i,x, α,µ)∝p(zi=k|z−i, α)
| {z }
P rior
p(xi|x−i, zi=k,z−i,µ)
| {z }
Likelihood
p(zi=k|z−i,x, α,µ) =
( Nk,−i
α+N−1p(xi|x−i,k,µ) For existingk
α
α+N−1p(xi|µ) Otherwise p(xi,x−i,k|µ) =
Z
p(xi|µk)
Y
j6=i
p(xj|µk)
p(µk|µ0, σ0)dµk
Gaussian-mixtures and EM estimation
Parameters θ={πc, µc, σ2c}kc=1
k Gaussian Mix. P(xi|θ) =Pk
c=1πcP(xi|c, µc, σ2c) Pk
c=1πc= 1
log likelihood L(X |θ) = logP(X |θ) =Pn
i=1logP(xi|θ)
=Pn
i=1logPk
c=1πcP(xi|c, µc, σ2c)
Define binary latent variablesMic∈ {0,1}whereMicindicates that xiis generated by componentc. log likelihood
L(X, M|θ) = logQn i=1
Qk
c=1(πcP(xi|c, µc, σ2c))Mic L(X, M|θ) =Pn
i=1
Pk
c=1Miclog(πcP(xi|c, µc, σc2))
Expectation over the latent va. γic:=EM|X,θ[Mic] Q(θ) :=EM|X,θ[L(X, M|θ)] =Pn
i=1
Pk
c=1γiclog(πcP(xi|c, µc, σc2)) EM-estimation algo •E-stepcomputeγic,θconst
•M-stepcomputeθ,γicconst
E-step: EM|X,θ[Mic] = 1·P(Mic= 1|xi, θ) + 0·P(Mic= 0|xi, θ)
=P(Mic= 1|xi, θ) =P(c|xi, θ) =P(xi|c,θ)P(c|θ) P(xi|θ)
= πcP(xi|c,µc,σ
2 j) Pk
j=1πjP(xi,|j,µj,σ2j)
M-step:
(i)µc fromarg maxθQ(θ):
∂
∂µcQ(θ) = 0 =⇒ µc=
Pn i=1γicxi Pn
i=1γic
(ii)σcfromarg maxθQ(θ):
∂
∂σcQ(θ) = 0 =⇒ σc2=
Pn
i=1γic(xi−µc)2 Pn
i=1γic
(ii)πcfromarg maxθQ(θ): constraintP
cπc= 1 L(θ, λ) =−Q(θ) +λ(Pk
c=1πc−1) ⇒∂π∂
cL(θ, λ) = 0
⇔Pn
i=1γic=λπc⇔Pk c=1
Pn
i=1 γic=λPc i=1πc
⇔Pn i=1
Pk
c=1γic=Pn
i=11 =λ ⇔ πc=
Pn i=1γic
n