Inference in Belief Networks

(1)

Inference in Belief Networks

Stefan Edelkamp

Universität Bremen

May 6, 2015

(2)

Overview

(3)

Bayes Theorem

Notation: P(x)short forP(X =x) Theorem: (Bayes, 17xx)

P(h)= prior probability of (hypothesis)h P(d)= prior probability of (training data)d P(h|d)= probability ofhgivend

P(d |h)= probability ofd givenh

P(h|d) = P(d |h)·P(h) P(d) Proof: By definition ofconditional probability

P(h|d) := P(h∩d)

= P(d |h)·P(h)

(4)

Warm-Up: Naive Bayes Classifier

Each instance described by attributesa₁, . . . ,a_n. Most probable v^∗ = arg max_v

j∈VP(v_j|a₁, . . . ,an)

= arg max_v

j∈V

P(a₁, . . . ,an|v_j)·P(v_j) P(a₁, . . . ,an)

= arg max_v

j∈VP(a₁, . . . ,an|v_j)·P(v_j) Naive Bayes assumptionP(a₁, . . .an|v_j) =Q

iP(a_i |v_j)yields v^∗=arg max_v

j∈VP(v_j)Y

i

P(a_i |v_j)

Example:P(y)P(sun|y)P(wind|y)<P(n)P(sun|n)P(wind|n)

(5)

Overview

(6)

Belief Networks

Belief/Bayesian Network (BN):

graphical representation of

causality and independence using conditional probability tables(CPT)

BNs provide a way to structure knowledge and to exploit the structure for computational gain.

(7)

Definition

DefinitionBN(V,A,P), whereV vertices,AadjacenciesPcompact representation of joint probability distribution over all variables.

BN fully defined by graph(V,A), plus CPTsP(y_i |φ(Y_i)), whereφ(Y_i) denotes immediate predecessors ofY_i in graph, i.e,

P(y₁, . . . ,y_n) =

n

Y

i=1

P(y_i |φ(Y_i)) Naive BN: input nodes attached to one output node

(8)

Inference in Bayesian Networks

How can one infer the (probabilities of) values of one or more network variables, given observed values of others?

BN contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is hard !!!

In practice

Monte Carlo methods “simulate” the network randomly to calculate approximate solutions

Exact inference methods work well for some network structures, e.g., polytrees (one path connecting every two nodes) !!!

(9)

Overview

(10)

Inventor

Judea Pearl, UCLA, Turing Award Winner

(11)

Hardness Results

Gregory F. Cooper (Stanford)

The Computational Complexity of Probabilistic Inference using Bayesian Belief Networks

(12)

Bucket Elimination

Rina Dechter, UCI, AAAI Fellow

Bucket Elimination: A Unifying Framework for Probabilistic Inference

(13)

Overview

(14)

SAT

In SAT we are given a formulaf as a conjunction ofclausesover the literals{x₁, . . . ,xn} ∪ {x₁, . . . ,xn}.

The task to search for an assignmenta= (a₁, . . . ,a_n)∈ {0,1}ⁿfor x₁, . . . ,x_n, so thatf(a) =true.

Theorem(Cook, 1971) SAT is NP-complete.

In 3-SAT the SAT instances consists of clauses of the form l₁∨l₂∨l₃,withl_i ∈ {x₁, . . . ,xn} ∪ {x₁, . . . ,xn} Theorem(Garey/Johnson, 1979) 3-SAT is NP-complete.

(15)

Probabilistic Inference is NP-hard

W.l.o.g. Assume Boolean (propositional) Variables

Typically, PI in BNs means calculatingP(Exp₁|Exp₂), whereExp_i is conjunction of (instantiated) random variables.

Example: P(X =T |Y =T ∧Z =F)

Most restricted: Decision Problem PIBNsP(Y =T)>0 Theorem(Cooper) PIBN is NP hard.

(16)

Towards a Proof

LetC={c₁, . . . ,c_m}be a set of 3-SAT clauses over{u₁, . . . ,u_n}. We construct a belief network for which we show

P(Y =T)>0 if and only C is satisfiable Belief-Network Structure

(17)

Probabilities

Truth-Setting Componnent: one for each variable

P(u₁=T) =P(u₂=T) =. . .=P(u_n=T) =1/2

Clause-Satisfaction Testing Component:one for each clause (w.l.o.g.) u_x∨u_y ∨u_z Variables: u_x,u_y,u_z,C_j

Adjacencies:{(u_x,C_j),(u_y,C_j),(u_z,C_j)}

Conditional Probabilities:

P(C_j =T |u_x =F,u_y =F,u_z =F) =0 P(C_j =T |u_x =F,u_y =F,u_z =T) =1. . . P(C_j =T |ux =T,uy =T,uz =T) =1 Overall-Satisfaction Testing Component Variables:Y

Adjacencies: Link from eachC_j toY Conditional Probabilities:

(18)

Overview

(19)

Bucket Elimination (BE)

Given:

BN structure(V,A,P),

orderingπonnvariables with attached CPT, evidence nodesE_i with valuese_i,

query nodeX.

Algorithm to computeP(X =x | E₁=e₁, . . . ,E_n=e_n)for allx: Createn+1 bucketsb_∅andb_i for variableX_i.

Store in bucket of highest index.

Process each bucket from highest index down to eliminate associated variable.

(20)

Algorithm

3 operators

Join (Combine 2 CPTs):

h(x) = (f⊗g)(x) =f(x |_Vars(f₎)×g(x |_Vars(g)) Eliminate (Remove a Variable):elim_X[f](y) =P

xf(Y =y,x) Project (Delete a Variable):h_X,−Y(x,y) =h_X(x)

Loop:

1 Project evidence in CPTs

2 Process buckets from highest to lowest

1 g_x =elim_x[f_X,1⊗f_X,2⊗. . .⊗f_X,k]is function of∪_iVars(f_X,i)\ {X}

2 Storeg_X intob_Y whereY is hightest index Example: [ Blackboard ]

(21)

Overview

(22)

Learning of Bayesian Networks

Learning (LEARN): Three cases

structure known and observe all variables⇒

I as easy as training a Naive Bayes classifier structure known, variables partially observable⇒

I learn network conditional CPTs using gradient ascent structure not known⇒

I greedy search to add/substract edges and nodes

(23)

Further Inference Tasks

Most Probably Explaination (MPE):Most likely assignment to all hidden variables given evidence

Algorithm: like BE but replaceP

by max

Complexity: NP-complete for general graphs, easy for polytrees Maximum-a-Posteriori (MAP):Most likely assignment to some hidden variables given evidence

Complexity: NP^PP complete for general graphs, NP-hard for polytrees

Value of Information: Which evidence to seek?

Sensitivity Analysis: Which variables are most critical?

Explaination:Why does this happen?

(24)

Overview

(25)

Wrap-up and Outlook

Conclusion

Bayesian a.k.a. Belief Network represents joint probability function compactly

naturalmodel of causality Naive BN special case

PI is hard; MPE, MAP and LEARN as well BE efficient in practice

Outlook

linear (in size CPTs) for chains/trees/polytrees, exponential in size of the largest bucket

BE for PI and MAP more efficient with pruning rules (e.g, prune all