Computer Vision I -
Algorithms and Applications:
Semantic Segmentation
Carsten Rother
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Probabilities - Reminder
β’ Discrete probability distribution: π(π₯) satisfies
π₯ π(π₯) = 1 where π₯ β {0, β¦ , πΎ}
β’ Joint distribution of two variables: π(π₯, π§)
β’ Conditional distribution: π(π₯|π§)
β’ Sum rule: π π§ = π₯ π(π₯, π§)
β’ Product rule: π π₯, π§ = π π§ π₯ π(π₯)
β’ Bayesβ rule: π π₯|π§ = π π§ π₯ π π₯
π(π§)
A Machine Learning View on Models
Modelling a problem:
β’ The data is π and the desired output π
We can identify three different approaches:
[see details in Bishop, page 42ff]:
β’ Generative (probabilistic) models: π(π, π)
β’ Discriminative (probabilistic) models: π(π|π)
β’ Discriminative functions: π(π, π)
Generative Model
Models explicitly (or implicitly) the distribution of the input π and output π
Joint Probablity π(π, π) = π(π|π) π(π)
Comment:
1. The joint distribution does not necessarily have to be decomposed into likelihood and prior, but in practice it (nearly) always is
2. Generative Models are used successfully when input π and output π are very related, e.g. image denoising.
Pros: 1. Possible to sample both: π and π
2. Can be quite easily used for many applications (since prior and likelihood are modeled separately)
3. In some applications, e.g. biology, people want to model
likelihood and Prior explicitly, since the want to understand the model as much possible
4. Probability can be used in bigger systems
Cons: 1. might not always be possible to write down the full distribution (involves a distribution over images π).
likelihood prior
Generative Model β Example De-noising
Joint Probablity π π, π = π π|π π(π)
likelihood prior
Pixel-wise likelihood: π π|π =
ππ΅(π₯
π; π§
π, π) ~
ππππ{
12π2
π§
πβ π₯
π 2}
Data π
(pixel independent Gaussian noise) Label π
π§
ππ΅(π₯
π; π§
π, π)
(sketched)
Generative Model β Example De-noising
Prior: π π =
ππ
πππ{β
ππβπ΅ππ₯
πβ π₯
π}
π
πβ π
πβsketchedβ
Robust Prior:
π π = π
π πππ{β
ππβπ΅π
πππ(β π₯
πβ π₯
π, π}
Joint Probablity π π, π = π π|π π(π)
likelihood prior
Pixel-wise likelihood: π π|π =
ππ΅(π₯
π; π§
π, π) ~
ππππ{
12π2
π§
πβ π₯
π 2}
π π₯
π, π₯
πFollows the statistic of gradients in natural images
Result of more advanced prior models
[Komodiakis et al. CVPR 2009]
Result of more advanced prior models
FoE
Leraned Prior on 5 Γ 5 patch
Change application: in-painting
[Field of Expert, Roth et al IJCV 2008]
FoE
Change application: in-painting
Joint Probablity π π, π = π π|π π(π)
likelihood prior
Pixel-wise likelihood: π π|π =
ππ΅(π₯
π; π§
π, π) ~
ππππ{
12π2
π§
πβ π₯
π 2}
Data π Label π
Pixel-wise likelihood: π π|π = πππππ (for red text)
Data π Label π
π π|π = πΉ(π₯
π= π§
π) (otherwise)
Generative Model for image segmentation
Goal
Given π; derive binary π:
Optmimal solution: π
β= ππππππ₯
π₯π(π, π) for a fixed π
(user-specified pixels are not optimized for)
π = π , πΊ, π΅
ππ = 0,1
πInteractive Segmentation
Statistical model π(π, π) for both images π and data π
(we then later come to π(π|π) and π(π, π))
Generative Model for image segmentation - likelihood
Joint Probablity π π, π = π π|π π(π)
likelihood prior
The red brush strokes give training data for foreground pixels The blue brush strokes give training data for background pixels
Red
Gr ee n
User labelled pixels Gaussian Mixture Model Fit Red
Gr ee n
Gaussian Mixture Model (GMM)
β’ Mixture Model: π π§ = π=1 πΎ π π π π§ π
β’ βπβ is a latent variable we are not interested in
β’ π β 1, β¦ , πΎ represents the πΎ mixtures.
β’ Each mixture π is a 3D Gaussian distribution π π (π§; π π , Ξ£ π ) where π π , is a 3d vector and Ξ£ k a 3 Γ 3 matrix (positive-semidefinite) called covariance matrices:
π π§, π, Ξ£ = 1
2π
π/2Ξ£
12exp{β 1
2 π§ β π
πΞ£
β1(π§ β π)}
β’ π π§ =
π=1πΎπ
ππ π (π§; π π , Ξ£ π )
Mixture coefficient
Gaussian Mixture Model (GMM)
β’ GMM probability π π§ =
π=1πΎπ
ππ π (π§; π π , Ξ£ π )
β’ Unknown parameters: Ξ = (π 1 , β¦ , π πΎ , π 1 , β¦ , π πΎ , Ξ£ 1 , β¦ , Ξ£ πΎ )
β’ How to learn Ξ given data π§ :
β’ Maximum Likelihood estimation using EM (see machine learning lecture ML 1)
β’ Next: simpler version using to learn GMMs (close to k-means)
Example:
A simple procedure for GMM learning /fitting
Let us introduce an assignment variable for each data point (pixel) to which Gaussian it belongs to: π
1, β¦ π
πwhere π
πβ {1, β¦ πΎ}
ππ = ππππππ₯πππ(π§; ππ, Ξ£π)
Extensions of K-means
β’ Choose πΎ automatically
β’ Go to probabilistic version using Expectation Maximization (EM).
Now π π are probabilistic assignments to all Gaussian (not only one)
β’ Faster versions:
β’ Fit GMM to all data points and then only change the mixture coefficients
β’ Use Histograms instead of GMMs
Illustration EM
Soft assignment: π(π
π)
[Bishop page 437]
Some comments on clustering
β’ More in CV2
β’ Clustering without spatial constraints:
(K-means, mean-shift, etc)
β’ Clustering with spatial constraints:
(super-pixels, normalized cut, etc)
β’ Gestalt Theory:
normalized cut
(sketched)
Joint Probability - Likelihood
Joint Probablity π π, π = π π|π π π
π π|π =
π
π π§ π π₯ π =
π
(
π=1 πΎ
π
ππ₯ππ
ππ₯π(π§
π, π
ππ₯π, Ξ£
ππ₯i))
Ξ = (π 1 0 , β¦ , π πΎ 0 , π 1 0 , β¦ , π πΎ 0 , Ξ£ 1 0 , β¦ , Ξ£ πΎ 0 , π 1 1 , β¦ , π πΎ 1 , π 1 1 , β¦ , π πΎ 1 , Ξ£ 1 1 , β¦ , Ξ£ πΎ 1 )
All parameters with superscript 0 belong to background and all with superscript 1 belong to foreground:
Likelihood:
New query
image π§
πJoint Probability - Likelihood
Maximum likelihood estimation π π§
ππ₯
π= 1 π π§
ππ₯
π= 0
π
β= ππππππ
ππ π|π =
π
π(π
π|π
π) New query
image π§
πJoint Probability - Prior
Joint Probablity π π, π = π π|π π π
π(π) = 1
π
π,πβπ΅πΞ
π,π(π₯
π, π₯
π)
(exp{-1}=0.36; exp{0}=1)
π₯
ππ₯
πΞ
π,ππ₯
π, π₯
π= exp{β|π₯
πβ π₯
π|} called βIsing priorβ
π =
π π,πβπ4
Ξ
π,π(π₯
π, π₯
π) Partition function, sum over all possible results π
Joint Probability β Prior (4x4 Grid example)
Best Solutions sorted by probability
Pure Prior model:
βSmoothness prior needs the likelihoodβ
Worst Solutions sorted by probability
π₯π π₯π
π(π) = 1
π
π,πβπ4exp{β|π₯
πβ π₯
π|}
Joint Probability β Prior (4x4 Grid example)
Distribution Samples
2
16configurations
P robab ili ty
Pure Prior model: π(π) = 1
π
π,πβπ΅πexp{β|π₯
πβ π₯
π|}
π₯π π₯π
Joint Probability β Result
Global optimum
π
β= ππππππ
ππ π, π
ML solution:
π
β= ππππππ
ππ π|π
Joint Probablity
π π, π = π π|π π π =
π
(
π=1 πΎ
π
πππ₯ππ
ππ₯π(π§
π, π
ππ₯π, Ξ£
ππ₯i)) 1
π
π,πβπ4exp{β|π₯
πβ π₯
π|}
π(π₯
π= 0) = 0; π(π₯
π= 1) = 1;
Hard constraint:
Sample from the model
π Samples: True image:
Most likely:
π π, π = π π|π π π
π
π π
π π
Why does it still work?
β’ We only evaluate π for a given π
Global optimum other likely solutions will
look similar (sketched)
Best Prior Models for Images
Best prior models for images π(π) can give such results:
Looks good on texture level but not on global level (e.g. scene layout) Sampled π
Simple model for segmentations:
Remind denoising: π π, π = π π|π π(π)
Pixel-wise likelihood: π π|π =
ππ΅(π₯
π; π§
π, π) ~
ππππ{
12π2
π§
πβ π₯
π 2}
[Field of Expert, Roth et al IJCV 2008]
Is it the best we can do?
4-connected segmentation
Zoom-in on image
zoom zoom
Reminder: Going to 8-connectivty
Larger connectivity can model true Euclidean length (also other metric possible)
Eucl.
Length of the paths:
4-con.
5.65 8 1
8-con.
6.28 6.28
5.08 6.75
[Boykov et al. β03; β05]
Going to 8-connectivty
4-connected Euclidean
8-connected Euclidean (MRF)
Zoom-in image
Modelling edges
β’ How to put this into our model?
β’ π(π₯) cannot depend on data!
β’ π π§ π₯ =
π,πβπ4π(π§
ππ|π₯
ππ) must be extended to model all possible
pairwise transitions from training data (e.g. with 6D Gaussian). But:
β’ This is difficult for the user to label
β’ Hard to get from other images
β’ There is a much simpler way:
model only π π₯ π§
β’ A transition is likely when two
neighboring pixels have different color.
Half way slide
3 Minutes break
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Discriminative model
π π π = 1
π exp βπΈ π, π where π =
π
exp{βπΈ(π, π)}
Models that model the Posterior directly are discriminative models.
In Computer Vision we use mostly the Gibbs distribution with an Energy πΈ:
These are also called: βConditional random fieldβ
Pros: 1. Simpler to write down than generative model (no need to model π)
and goes directly for the desired output π 2. More flexible since energy is arbitrary
3. Probability can be used in bigger systems
Cons: we can no longer sample images π
Discriminative model
β’ Relation: Posterior and Joint: π π π =
1π π
π π, π
β’ π(π, π), π π π and πΈ(π, π) all have the same optimal solution π
βgiven z:
β’ π
β= ππππππ₯
ππ π, π given π
β’ π
β= ππππππ₯
ππ π|π given π (since π π π =
1π π
π π, π )
β’ π
β= ππππππ
ππΈ π, π (since βlog π π π = log π + πΈ(π, π))
How does πΈ looks like for our segmentation example?
β’ So that π(π|π), π(π, π) have the same optimal solution π β we need:
βlog π π, π = πΈ π, π + constant = !
πΞ
ππ₯
π, π +
π,πβπ4Ξ
πππ₯
π, π₯
π, π
+constantπ π, π ~ π π π = 1
π ππ₯π {βπ¬(π, π)}
!
~means up to scale=
+ constantComment on Generative Models
One may also write the joint distribution π(π, π) as a Gibbs distribution:
π(π, π) = 1
π exp βπΈ π, π where π =
π,π
exp{βπΈ(π, π)}
If likelihood and prior are no longer modelled separately:
β’ sampling π, π gets very difficult
β’ We can no longer learn prior and
likelihood separately (as in de-noising)
β’ We train π π, π =
1π
exp βπΈ π, π , π(π|π) =
1π
exp βπΈ π, π in a similar way.
(see CV 2 lectures)
π Samples:
π
π π
The advantage of a generative model over a discriminative model are gone But β¦ it lost the meaning of a βgenerativeβ model, since we donβt have
a likelihood which says how the data was βgeneratedβ.
Adding a contrast term
πΈ π, π =
π
Ξ
ππ₯
π, π +
π,πβπ4
Ξ
πππ₯
π, π₯
π, π
π½ = 2 π
4ππβπ4
π§
πβ π§
π 22β1
Ξ
πππ₯
π, π₯
π, π = π₯
πβ π₯
π(ππ₯π βπ½ π§
πβ π§
π 2)
ππ₯πβπ½π§πβπ§π2
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Discriminative functions
πΈ π, π : π² π β πΉ
Models that model the classification problem via a function
Examples:
- Energy
- support vector machines - nearest neighbour classifier
Pros: most direct approach to model the problem Cons: no probabilities
π β = ππππππ π πΈ π, π
This is the most used approach in computer vision!
Recap
Modelling a problem:
β’ The input data is π and the desired output π
We can identify three different approaches:
[see details in Bishop, page 42ff]:
β’ Generative (probabilistic) models: π(π, π)
β’ Discriminative (probabilistic) models: π(π|π)
β’ Discriminative functions: π(π, π)
The key difference are:
β’ Probabilistic or none-probabilistic model
β’ Generative models model also the data π
β’ Differences in Training (see CV 2)
Simple example: Learning Discriminative functions
π =0 π =10
π =200 π =40
πΈ π, π = Ξ (π₯ , π) + π Ξ (π₯ , π₯ , π)
Simple example: Learning Discriminative functions
Testing phase:
Training phase: infer π given a set of training images
π
π‘, π
π‘β π
π, π β π
where π‘ denotes all training images (here around 50 images)
β
π
π‘π
π‘π
π‘π‘ = 1 π‘ = 2
π π π
βA simple procedure: Learning Discriminative functions
Questions:
- Is it the best and only way?
- Can we over-fit to training data?
1. Iterate π = 0, β¦ , 500
2. Compute π
β,π‘for all training images π
π‘, π
π‘3. Compute average error πΈππππ =
1π π‘
C(π
π‘, π
β,π‘)
with loss/cost function: C π, πβ² =
π|π₯
πβ π₯
πβ²| (called Hamming Error) 4. Take π with smallest πΈππππ
Hamming error: number of misclassified pixels
π
π π
βπΈππ ππ
Big Picture: Learning
Probabilistic Learning (for generative, discriminative models):
1. Training: Fit distribution π(π, π), or π(π|π) to set training images (use e.g. maximum likelihood learning)
2. Test: Make a decision according to some cost (loss) function Ξ
(depending on the cost function one computes optimal solution or marginal, etc)
Loss-based Learning (for discriminative functions)
1. Training: Fit π(π, π) given a certain cost (loss) function (see above) 2. Test: Compute optimal value π
βof function wrt test image
These are only high-level comments, we dive into that in CV 2 !
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Are we done?
[GrabCut, Rother et al. Sigraph 2004]
GrabCut Segmentation
Image z and user input
Global optimal solution
How to prevent the trivial solution?
πΈ π =
π
Ξ
π(π₯
π) + π
π,πβπ4
π€
ππ|π₯
πβ π₯
π|
So far we had both foreground and background brushes
Ξ
π(π₯
π= 0) = β; Ξ
π(π₯
π= 1) = 0
Hard constraint:
Ξ
π(π₯
π= 0) = 0; Ξ
π(π₯
π= 1) = β
What is a good segmentation?
Objects (fore- and background) are self-similar wrt appearance
Input Image
Option 1 Option 2 Option 3
πΈπ’ππππ¦ = 460000 πΈπ’ππππ¦ = 482000 πΈπ’ππππ¦ = 483000
foreground background foreground background foreground background
ΞπΉ Ξπ΅
πΈπ’ππππ¦ π₯, ΞπΉ, Ξπ΅ = βππππ π§ π₯, ΞπΉ, Ξπ΅ =
π
β log π π§π ΞπΉ, π₯π = 1 π₯π β log π π§π Ξπ΅, π₯π = 0 (1 β π₯π)
ΞπΉ Ξπ΅ ΞπΉ Ξπ΅
Full GrabCut functional
Gibbs distribution with energy:
πΈ π, Ξ
πΉ, Ξ
π΅=
πβ log π π§
πΞ
πΉ, π₯
π= 1 π₯
πβ log π π§
πΞ
π΅, π₯
π= 0 (1 β π₯
π) +
ππ
(βexp{βΓ||π§
πβ π§
π||}) |π₯
πβ π₯
π|
Goal is to compute optimal solution (we could also marginalize over Ξ):
π₯
β= ππππππ₯
π₯( max
ΞπΉ,Ξπ΅
πΈ(π₯, Ξ
πΉ, Ξ
π΅))
β’ So far, Ξ was determined from brush strokes (training data)
β’ Now we estimate Ξ from the segmentation π
π π§
ππ₯
π= 0, Ξ
π΅=
π=1πΎπ
ππ΅π
ππ΅(π§
π, π
ππ΅, Ξ£
ππ΅)
π π§
ππ₯
π= 1, Ξ
πΉ=
π=1πΎπ
ππΉπ
ππΉ(π§
π, π
ππΉ, Ξ£
ππΉ)
Full GrabCut functional
Background
Foreground
G R
Output GMMs ΞΈ
F,ΞΈ
BProblem: Joint optimization of x,ΞΈ
F,ΞΈ
Bis NP-hard
Image z
and user input Output xΟ΅ {0,1}
Goal is to compute optimal solution:
π₯
β= ππππππ₯
π₯( max
ΞπΉ,Ξπ΅
πΈ(π₯, Ξ
πΉ, Ξ
π΅))
Comment: Using histograms for Color models one can transform the problem to
a higher-order Random Field Model which can be solved (sometimes) globally
optimal for all unknowns: segmentation π and Ξ with Dual-Decomposition,
[Vicente, Kolmogorov, Rother, ICCV β09] (see CV 2)
GrabCut - optimization
GMM fitting to the current segmentation
Graph cut to infer segmentation min πΈ(π₯, π
πΉ, π
π΅) x
π
πΉ, π
π΅min πΈ(π₯, π
πΉ, π
π΅)
Image π and user input
Initial segmentation π
Initial segmentation
GrabCut - Optimization
1 2 3 4
Energy after each Iteration Result
0
GrabCut - Optimization
Background
Foreground &
Background
G
R
Background
Foreground
G R
At initialization In the end
Comparison
input image
Roadmap this lecture (chapter 14.4.3, 5.5 in book)
β’ Interactive Image Segmentation
β’ From Generative models to
β’ Discriminative models to
β’ Discriminative function
β’ Image Segmentation using GrabCut
β’ Semantic Segmentation
Semantic Segmentation
The desired output
Label each pixel with one out of 21 classes
[TextonBoost; Shotton et al, β06]
Failure cases
TextonBoost: How it is done
(color model) (location prior)
(class)
(edge aware smoothess prior)
Define Energy:
[TextonBoost; Shotton et al, β06]
Location Prior:
grass
As in GrabCut
πππ π₯π, π₯π = π€ππ|π₯π β π₯π|