Computer Vision –
eine Herausforderung in der Künstlichen Intelligenz
Prof. Carsten Rother
Computer Vision Lab Dresden Institute of Artificial Intelligence
Computer Vision – a hard case for AI 11/12/2013
Roadmap for this lecture
• A few more words on the history of AI and subareas of AI
• An introduction to Computer Vision
• What is it?
• Why is it hard?
• How can we solve it?
• What can we do with it?
• Roadmap for the remaining lecture
Roadmap for this lecture
• A few more words on the history of AI and subareas of AI
• An introduction to Computer Vision
• What is it?
• Why is it hard?
• How can we solve it?
• What can we do with it?
• Roadmap for the remaining lecture
11/12/2013 Computer Vision – a hard case for AI 3
From first lecture
Going back to 1973
• Sir James Lighthill report to the British Parliament
11/12/2013 Computer Vision – a hard case for AI 5
Full report on Youtube: http://www.youtube.com/watch?v=FLnqHzpLPws&list=PL27303EC6EC90FD5A
The general purpose robot is a mirage
“Ein Roboter der alles kann ist eine Illusion”
Going back to 1973
• Sir James Lighthill report to the British Parliament
What do we have today … Personal Conclusion
• He is correct … we don’t have the general purpose robot.
• AI Research split into many sub/related areas:
Machine Learning, Computer Vision, … (more later)
• In some areas we are doing a very good job:
• Natural Language Processing (NLP)
• Playing chess
• In some areas turned out to be very hard:
• Robotics
• Computer Vision seems like one of the hardest ones (a few success stories come later)
11/12/2013 Computer Vision – a hard case for AI 7
Scene understanding … in the 70s
Scene understanding - today
11/12/2013 Computer Vision – a hard case for AI 9
We are getting there … 40 years later
[Xiao et al. NIPS 2012]
Today: Topics / Subareas in AI
Applications:
• Natural Language Processing
• Planning
• Computer Vision
• Robotics
• Biology
• Human-Computer Interaction
Theory:
• Logic
• Machine Learning
• Probability Theory
• Decision Theory
• Automated Reasoning
Models:
• Knowledge representation
• Undirected graphical models
• Directed Graphical models
• Unstructured models Algorithms:
• Search
• Discrete Optimization
• Continuous Optimization
• Probabilistic Inference
Today: Topics / Subareas in AI
11/12/2013 Computer Vision – a hard case for AI 11
Applications:
• Natural Language Processing
• Planning
• Computer Vision
• Robotics
• Biology
• Human-Computer Interaction
Theory:
• Logic
• Machine Learning
• Probability Theory
• Decision Theory
• Automated Reasoning
[derived from first lecture]
Models:
• Knowledge representation
• Undirected graphical models
• Directed Graphical models
• Unstructured models Algorithms:
• Search
• Discrete Optimization
• Continuous Optimization
• Probabilistic Inference
• Learning
• AI overlaps with many disciplines
• There is not one unique, overarching theory
• AI has impact in many domains
Books for the following lecture
• Artificial Intelligence: A modern Approach Russell, Norvig (Third Edition, English)
(we cover: (parts of) sections: 4,5,6)
• Pattern recognition and machine learning, Bishop. Springer 2006.
• Learning from data: A short course, Abu-Mostafa, Magdon- Ismail,Hsuan-Tien Lin. AMLbook.
• Markov Random Fields for Vision and Image Processing, Blake, Kohli, Rother. MIT-Press 2011
•
Roadmap for this lecture
• A few more words on history of AI and subareas of AI
• An introduction to Computer Vision
• What is it?
• Why is it hard?
• How can we solve it?
• What can we do with it?
• Roadmap for the remaining lecture
11/12/2013 Computer Vision – a hard case for AI 13
What is computer Vision?
(Potential) Definition:
Developing computational models and algorithms
to interpret digital images and visual data in order
to understand the visual world we live in.
What is computer Vision?
11/12/2013 Computer Vision I: Introduction 15
(Potential) Definition:
Developing computational models and algorithms
to interpret digital images and visual data in order
to understand the visual world we live in.
What does it mean to “understand”?
Physics-based vision:
Geometry Segmentation
Camera parameters Emitted light (sun)
Surface properties: Reflectance, material
Semantic-based vision:
Objects: class, pose Scene: outdoor,…
Attributes/Properties:
(Potential) Definition:
Developing computational modelsand algorithmsto
Image-formation model
11/12/2013 Computer Vision I: Introduction 17
[Slide Credits: John Winn, ICML 2008]
Image
Very many sources of
variability
Image-formation model
Scene type Scene geometry
Street scene
Image-formation model
11/12/2013 Computer Vision I: Introduction 19
[Slide Credits: John Winn, ICML 2008]
Scene type Scene geometry Object classes
Street scene
Sky
Building×3 Road
Sidewalk Tree×3 Person×4
Bicycle Car×5 Bench
Bollard
Image-formation model
Street scene
Sky
Building×3 Road
Sidewalk Tree×3 Person×4
Bicycle Car×5 Bench
Bollard
Scene type
Scene geometry
Object classes
Object position
Object orientation
Image-formation model
11/12/2013 Computer Vision I: Introduction 21
[Slide Credits: John Winn, ICML 2008]
Scene type Scene geometry Object classes Object position Object orientation Scene type Scene geometry Object classes Object position Object orientation Object shape
Street scene
Image-formation model
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Image-formation model
11/12/2013 Computer Vision I: Introduction 23
[Slide Credits: John Winn, ICML 2008]
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Image-formation model
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Illumination
Shadows
Image-formation model
11/12/2013 Computer Vision I: Introduction 25
[Slide Credits: John Winn, ICML 2008]
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Illumination
Shadows
Image-formation model
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Illumination
Shadows
Image-formation model
11/12/2013 Computer Vision I: Introduction 27
[Slide Credits: John Winn, ICML 2008]
Scene type
Scene geometry
Object classes
Object position
Object orientation
Object shape
Depth/occlusions
Object appearance
Illumination
Shadows
Motion blur
Camera effects
The “Scene Parsing” challenge ---
a “grand challenge” of computer vision
(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}
Single image
Why is “scene parsing” hard?
11/12/2013 Computer Vision I: Introduction 29
Computer Vision Computer Graphics
3D Rich Representation,
2D pixel representation
Computer Vision can be seen as “inverse graphics”
Script = {Camera, Light,
Geometry, Material, Objects,
Scene, Attributes, Others}
Example of a recent work
Input
Scene gr aph
Example: General Object recognition & segmentation
11/12/2013 Computer Vision I: Introduction 31
[TextonBoost; Shotton et al, ‘06]
Good results …
Example: General Object recognition & segmentation
Failure cases…
Comparison: CV to NLP
Computer Vision (Scene Understanding)
• Amount of Input Data: 10 Mpixel /second for a robot
• Images are 2D (much harder inference!)
• Rules/Models are hard to define since images are so varied (see next lecture)
• Scene Understand is far from being solved, best method has a 47% of being correct for 20 object classes
11/12/2013 Computer Vision – a hard case for AI 33
Natural Language Processing
• Amount of input data: (Audiobooks have 2.2 words per second, i.e. ~20 letters per second)
• Sound is 1D
• Strong rule (context free grammars exists)
• Real-time Speech translation exists more or less
“Real-time Speech translation”
• Scene Understand is far from being solved,
best method has a 47% of being correct for
20 object classes
What is computer Vision?
11/12/2013 Computer Vision I: Introduction 35
(Potential) Definition:
Developing computational models and algorithms
to interpret digital images and visual data in order
to understand the visual world we live in.
Visual Data is everywhere
• Visual Data is dense, structured data
• Real world:
• RGB photo/video cameras
• Mobile phones
• Depth cameras
• Laser scanners
• Robotics
• Medicine
• Microscopy
• Surveillance
•
How can we interpret visual data?
11/12/2013 Computer Vision I: Introduction 37
• What general (prior) knowledge of the world (not necessarily visual) can be exploit?
• What properties / cues from the image can be used?
2D pixel representation
3D Rich Representation,
Both aspects are quite well understood (a lot is based on physics) … but how to use them is efficiently is open challenged (see later)
Computer Graphics
Computer Vision
Script = {Camera, Light, Geometry, Material, Objects, Scene,
Attributes, Others}
How can we interpret visual data?
• What general (prior) knowledge of the world (not necessarily visual) can be exploit?
• What properties / cues from the image can be used?
2D pixel representation
3D Rich Representation,
Computer Graphics
Computer Vision
Script = {Camera, Light, Geometry, Material, Objects, Scene,
Attributes, Others}
Prior knowledge (examples)
• “Hard” prior knowledge
• Trains do not fly in the air
• Objects are connected in 3D
• “Soft” prior knowledge:
• The camera is more likely 1.70m above ground and not 0.1m.
• Self-similarity: “all black pixels belong to the same object”
11/12/2013 Computer Vision I: Introduction 39
Prior knowledge – harder to describe
• Describe Image Texture
• Microscopic Images. What is the true shape of these objects
Not a real Image zoom
Real Image zoom
The importance of Prior knowledge
11/12/2013 Computer Vision I: Introduction 42
[Edward Adelson]
Which patch is brighter: A or B?
The importance of Prior knowledge
The importance of Prior knowledge
11/12/2013 Computer Vision I: Introduction 44
Direct Light
The most likely 3D representation 2D Image - local
What the computer sees
This is what humans see
implicitly. Ideally the computer sees the sane.
True colours In 3D world A
B
A B Ambient
Light
An unlikely 3D representation (hard to see for a human)
2D 3D 3D
True colors in 3D world A
B
The importance of Prior knowledge
2D Image
Light
3D representation Humans see an image not as a set of 2D pixels. They understand an image as a projection of the 3D world we live in
Humans have the prior knowledge about the world encoded, such as:
• Light cast shadows
Male or Female?
11/12/2013 Computer Vision I: Introduction 46
How can we interpret visual data?
• What general (prior) knowledge of the world (not necessarily visual) can be exploit?
• What properties / cues from the image can be used?
2D pixel representation
3D Rich Representation,
Computer Graphics
Computer Vision
Script = {Camera, Light, Geometry, Material, Objects, Scene,
Attributes, Others}
Cue: Appearance (Colour, Texture) for object recognition
11/12/2013 Computer Vision I: Introduction 48
To what object does the patch belong to ?
Cue: Outlines (shape) for object recognition
Guess the Object
11/12/2013 Computer Vision I: Introduction 50
Colour
Texture Shape
[from JohnWinn ICML 2008]
Cue: Context for object recognition
Cue: Context for object recognition
11/12/2013 Computer Vision I: Introduction 52
Cue: stereo vision (2 frames) for geometry estimation
Cue: Multiple Frames for geometry estimation
11/12/2013 Computer Vision I: Introduction 54
Cue: Shading & shadows for geometry and Light estimation
Texture gradient for geometry estimation
11/12/2013 Computer Vision I: Introduction 56
The “Scene Parsing” challenge ---
a “grand challenge” of computer vision
(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}
Many applications do not have to extract the full probabilistic script but only a subset, e.g. “does the image contain a car?”
Single image
… many application scenarios are in reach
To simplify the problem:
1) Richer Input:
- Modern sensing technology - Moving images
- User involvement
2) Rich Data to learn from:
- use the web
- crowdsourcing to get labels
(online games, mechanical turk) - Powerful graphics engines
11/12/2013 Computer Vision I: Introduction 58
Real-time pedestrian detection
Animate the world
11/12/2013 Computer Vision I: Introduction 60
[Chen et al. UIST ‘12]
Example: Xbox people tracking
Example: people tracking (test data)
11/12/2013 Computer Vision – a hard case for AI 63
Body tracking and Gesture Recognition has many applications
StartUp 2012: Try Fashion online
Start-Up Company: Like.com
11/12/2013 Computer Vision I: Introduction 66
What is computer Vision?
(Potential) Definition:
Developing computational models and algorithms
to interpret digital images and visual data in order
to understand the visual world we live in.
Example: Image Segmentation
11/12/2013 Introducing the Computer Vision Lab Dresden 68
Image with User Input
Typically 𝑛is large ≥ 1𝑀
𝒙 = 0,1
𝑛output
Undirected graphical models 𝜃
𝑖𝑗(𝑦
𝑖, 𝑦
𝑗) 𝑦
𝑗𝜃
𝑖(𝑦
𝑖)
𝑦
𝑖Example: Image Segmentation
Image with User Input
I nference/Optimization: 𝒚 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝒚 𝑃(𝒚|𝜽)
Typically 𝑛is large ≥ 1𝑀
𝒚 = 0,1
𝑛M odelling: How to formulate the graphcial model, 𝑒. 𝑔. 𝑃 𝒚|𝜽
(this this is one of many tasks)
Graphical models
𝜃𝑖𝑗(𝑦𝑖, 𝑦𝑗)
𝑦
𝑗𝜃𝑖(𝑦𝑖) 𝑦𝑖
What is Learning?
11/12/2013 Introducing the Computer Vision Lab Dresden 70
Error Function to say how
we compare results find weights 𝜽
∗(can be up to 10M
parameters)
Probabilistic model: P 𝒚 𝜽
∗) Image and Ground Truth
Inference: Maximum Probability:
𝒚
∗= 𝑎𝑟𝑔𝑚𝑎𝑥
𝒚P 𝒚 𝜽
∗)
Training:
Testing:
Model versus Inference (Algorithm)
[Data courtesy from Oliver Woodford]
Input:
Image sequence
Output: New view
Another Example: Model versus Algorithm
11/12/2013 Computer Vision I: Introduction 72
Belief Propagation ICM, Simulated Annealing
Ground Truth Graph Cut
with truncation[Rother et al. ‘05]
Why is the result not perfect?
Model or Inference
(approximate solution) (exact solution)
QPBOP
[Boros et al. ’06;
Rother et al. ‘07]
(approximate solution)
(approximate solution)
Summary: The key questions for the upcoming lectures
• What is the modelling language:
undirected / directed Graphical models; unstructured models
• How does the model look like:
• What is the structure?
• How do the functions look like?
• Can we learn the Model from Data:
• Learn structure
• Learn potential functions
• Probabilistic Learning / Discrimantive Learning
• How do we optimize the model (perform inference):
Is Machine Learning feasible?
• We are looking at a mapping:
𝑋 = 0,1 3 → 𝑌 = {0,1}
• We are given 5 training data instances:
11/12/2013 Computer Vision – a hard case for AI 74
[example from book: Learning from data; Abu-Mustafa et al.]
Is Machine Learning feasible?
• We are looking at a mapping:
𝑋 = 0,1 3 → 𝑌 = {0,1}
• We are given 5 training data instances:
?
?
?
Is Machine Learning feasible?
11/12/2013 Computer Vision – a hard case for AI 76
• Let us look at all possible functions: 𝑓 𝑥 1 , 𝑥 2 , 𝑥 3 = 𝑦
• We have in total 2 2
3= 256 possible functions
• Given the training data fixed we have 8 remaining functions:
?
?
?
• Without any information about 𝑓 any solution for f is good !
• We need information about 𝑓
[example from book: Learning from data; Abu-Mustafa et al.]
Is Machine Learning feasible?
Assume 𝑓 is “smooth” in 3D space (𝑥
1, 𝑥
2, 𝑥
3), i.e. few “0-1” transitions in Manhattan-space (neighborhood drawn by lines)
𝑥
1𝑥
2𝑥
36 Transitions (optimal) 𝑥
1𝑥
2𝑥
3𝑥
2𝑥
3𝑥
2𝑥
3Roadmap for this lecture
• A few more words on history of AI and subareas of AI
• An introduction to Computer Vision
• What is it?
• Why is it hard?
• How can we solve it?
• What can we do with it?
• Roadmap for the remaining lecture
11/12/2013 Computer Vision – a hard case for AI 78
Roadmap for next lectures
• 11.12 (1): Computer Vision – a hard case for AI
• 11.12 (2): Introduction to probability theory
• 18.12 (1): Exercise: probability theory
18.12 (2): Unstructured models: Decision theory
• 8.1 (1): Unstructured models: Probabilistic Learning
• 8.1 (2): Unstructured models: Discriminative Learning Intro
• 15.1 (1): Exercise: Learning
• 15.1 (2): Unstructured models: Discriminative Learning
Roadmap for next lectures
• 22.1 (1): Undirected Graphical models: Models and Inference
• 22.1 (2): Undirected Graphical models: Models and Inference
• 29.1 (1): Exercise: Learning
• 29.1 (2): Undirected Graphical models: Learning
• 5.2 (1): Directed Graphical models
• 5.2 (2): Wrap up; Putting theory to practice
11/12/2013 Computer Vision – a hard case for AI 80