Computer Vision –

(1)

Computer Vision –

eine Herausforderung in der Künstlichen Intelligenz

Prof. Carsten Rother

Computer Vision Lab Dresden Institute of Artificial Intelligence

Computer Vision – a hard case for AI 11/12/2013

(2)

Roadmap for this lecture

• A few more words on the history of AI and subareas of AI

• An introduction to Computer Vision

• What is it?

• Why is it hard?

• How can we solve it?

• What can we do with it?

• Roadmap for the remaining lecture

(3)

Roadmap for this lecture

• A few more words on the history of AI and subareas of AI

• An introduction to Computer Vision

• What is it?

• Why is it hard?

• How can we solve it?

• What can we do with it?

• Roadmap for the remaining lecture

11/12/2013 Computer Vision – a hard case for AI 3

(4)

From first lecture

(5)

Going back to 1973

• Sir James Lighthill report to the British Parliament

Full report on Youtube: http://www.youtube.com/watch?v=FLnqHzpLPws&list=PL27303EC6EC90FD5A

The general purpose robot is a mirage

“Ein Roboter der alles kann ist eine Illusion”

(6)

Going back to 1973

• Sir James Lighthill report to the British Parliament

(7)

What do we have today … Personal Conclusion

• He is correct … we don’t have the general purpose robot.

• AI Research split into many sub/related areas:

Machine Learning, Computer Vision, … (more later)

• In some areas we are doing a very good job:

• Natural Language Processing (NLP)

• Playing chess

• In some areas turned out to be very hard:

• Robotics

• Computer Vision seems like one of the hardest ones (a few success stories come later)

(8)

Scene understanding … in the 70s

(9)

Scene understanding - today

We are getting there … 40 years later

[Xiao et al. NIPS 2012]

(10)

Today: Topics / Subareas in AI

Applications:

• Natural Language Processing

• Planning

• Computer Vision

• Robotics

• Biology

• Human-Computer Interaction

Theory:

• Logic

• Machine Learning

• Probability Theory

• Decision Theory

• Automated Reasoning

Models:

• Knowledge representation

• Undirected graphical models

• Directed Graphical models

• Unstructured models Algorithms:

• Search

• Discrete Optimization

• Continuous Optimization

• Probabilistic Inference

(11)

Today: Topics / Subareas in AI

Applications:

• Natural Language Processing

• Planning

• Computer Vision

• Robotics

• Biology

• Human-Computer Interaction

Theory:

• Logic

• Machine Learning

• Probability Theory

• Decision Theory

• Automated Reasoning

[derived from first lecture]

Models:

• Knowledge representation

• Undirected graphical models

• Directed Graphical models

• Unstructured models Algorithms:

• Search

• Discrete Optimization

• Continuous Optimization

• Probabilistic Inference

• Learning

• AI overlaps with many disciplines

• There is not one unique, overarching theory

• AI has impact in many domains

(12)

Books for the following lecture

• Artificial Intelligence: A modern Approach Russell, Norvig (Third Edition, English)

(we cover: (parts of) sections: 4,5,6)

• Pattern recognition and machine learning, Bishop. Springer 2006.

• Learning from data: A short course, Abu-Mostafa, Magdon- Ismail,Hsuan-Tien Lin. AMLbook.

• Markov Random Fields for Vision and Image Processing, Blake, Kohli, Rother. MIT-Press 2011

•

(13)

Roadmap for this lecture

• A few more words on history of AI and subareas of AI

• An introduction to Computer Vision

• What is it?

• Why is it hard?

• How can we solve it?

• What can we do with it?

• Roadmap for the remaining lecture

(14)

What is computer Vision?

(Potential) Definition:

Developing computational models and algorithms

to interpret digital images and visual data in order

to understand the visual world we live in.

(15)

What is computer Vision?

11/12/2013 Computer Vision I: Introduction 15

(Potential) Definition:

Developing computational models and algorithms

to interpret digital images and visual data in order

to understand the visual world we live in.

(16)

What does it mean to “understand”?

Physics-based vision:

Geometry Segmentation

Camera parameters Emitted light (sun)

Surface properties: Reflectance, material

Semantic-based vision:

Objects: class, pose Scene: outdoor,…

Attributes/Properties:

(Potential) Definition:

Developing computational modelsand algorithmsto

(17)

Image-formation model

[Slide Credits: John Winn, ICML 2008]

Image

Very many sources of

variability

(18)

Image-formation model

Scene type Scene geometry

Street scene

(19)

Image-formation model

Scene type Scene geometry Object classes

Street scene

Sky

Building×3 Road

Sidewalk Tree×3 Person×4

Bicycle Car×5 Bench

Bollard

(20)

Image-formation model

Street scene

Sky

Building×3 Road

Sidewalk Tree×3 Person×4

Bicycle Car×5 Bench

Bollard

Scene type

Scene geometry

Object classes

Object position

Object orientation

(21)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Scene type Scene geometry Object classes Object position Object orientation Object shape

Street scene

(22)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

(23)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

(24)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

Illumination

Shadows

(25)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

Illumination

Shadows

(26)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

Illumination

Shadows

(27)

Image-formation model

Scene type

Scene geometry

Object classes

Object position

Object orientation

Object shape

Depth/occlusions

Object appearance

Illumination

Shadows

Motion blur

Camera effects

(28)

The “Scene Parsing” challenge ---

a “grand challenge” of computer vision

(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}

Single image

(29)

Why is “scene parsing” hard?

Computer Vision Computer Graphics

3D Rich Representation,

2D pixel representation

Computer Vision can be seen as “inverse graphics”

Script = {Camera, Light,

Geometry, Material, Objects,

Scene, Attributes, Others}

(30)

Example of a recent work

Input

Scene gr aph

(31)

Example: General Object recognition & segmentation

[TextonBoost; Shotton et al, ‘06]

Good results …

(32)

Example: General Object recognition & segmentation

Failure cases…

(33)

Comparison: CV to NLP

Computer Vision (Scene Understanding)

• Amount of Input Data: 10 Mpixel /second for a robot

• Images are 2D (much harder inference!)

• Rules/Models are hard to define since images are so varied (see next lecture)

• Scene Understand is far from being solved, best method has a 47% of being correct for 20 object classes

Natural Language Processing

• Amount of input data: (Audiobooks have 2.2 words per second, i.e. ~20 letters per second)

• Sound is 1D

• Strong rule (context free grammars exists)

• Real-time Speech translation exists more or less

“Real-time Speech translation”

(34)

• Scene Understand is far from being solved,

best method has a 47% of being correct for

20 object classes

(35)

What is computer Vision?

(Potential) Definition:

Developing computational models and algorithms

to interpret digital images and visual data in order

to understand the visual world we live in.

(36)

Visual Data is everywhere

• Visual Data is dense, structured data

• Real world:

• RGB photo/video cameras

• Mobile phones

• Depth cameras

• Laser scanners

• Robotics

• Medicine

• Microscopy

• Surveillance

•

(37)

How can we interpret visual data?

• What general (prior) knowledge of the world (not necessarily visual) can be exploit?

• What properties / cues from the image can be used?

2D pixel representation

3D Rich Representation,

Both aspects are quite well understood (a lot is based on physics) … but how to use them is efficiently is open challenged (see later)

Computer Graphics

Computer Vision

Script = {Camera, Light, Geometry, Material, Objects, Scene,

Attributes, Others}

(38)

How can we interpret visual data?

• What general (prior) knowledge of the world (not necessarily visual) can be exploit?

• What properties / cues from the image can be used?

Computer Graphics

Computer Vision

Attributes, Others}

(39)

Prior knowledge (examples)

• “Hard” prior knowledge

• Trains do not fly in the air

• Objects are connected in 3D

• “Soft” prior knowledge:

• The camera is more likely 1.70m above ground and not 0.1m.

• Self-similarity: “all black pixels belong to the same object”

(40)

Prior knowledge – harder to describe

• Describe Image Texture

• Microscopic Images. What is the true shape of these objects

Not a real Image zoom

Real Image zoom

(41)

The importance of Prior knowledge

[Edward Adelson]

Which patch is brighter: A or B?

(42)

The importance of Prior knowledge

(43)

The importance of Prior knowledge

Direct Light

The most likely 3D representation 2D Image - local

What the computer sees

This is what humans see

implicitly. Ideally the computer sees the sane.

True colours In 3D world A

B

A B Ambient

Light

An unlikely 3D representation (hard to see for a human)

2D 3D 3D

True colors in 3D world A

B

(44)

The importance of Prior knowledge

2D Image

Light

3D representation Humans see an image not as a set of 2D pixels. They understand an image as a projection of the 3D world we live in

Humans have the prior knowledge about the world encoded, such as:

• Light cast shadows

(45)

Male or Female?

(46)

How can we interpret visual data?

• What general (prior) knowledge of the world (not necessarily visual) can be exploit?

• What properties / cues from the image can be used?

Computer Graphics

Computer Vision

Attributes, Others}

(47)

Cue: Appearance (Colour, Texture) for object recognition

To what object does the patch belong to ?

(48)

Cue: Outlines (shape) for object recognition

(49)

Guess the Object

 Colour

 Texture ^ Shape

[from JohnWinn ICML 2008]

(50)

Cue: Context for object recognition

(51)

Cue: Context for object recognition

(52)

Cue: stereo vision (2 frames) for geometry estimation

(53)

Cue: Multiple Frames for geometry estimation

(54)

Cue: Shading & shadows for geometry and Light estimation

(55)

Texture gradient for geometry estimation

(56)

The “Scene Parsing” challenge ---

a “grand challenge” of computer vision

(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}

Many applications do not have to extract the full probabilistic script but only a subset, e.g. “does the image contain a car?”

Single image

(57)

… many application scenarios are in reach

To simplify the problem:

1) Richer Input:

- Modern sensing technology - Moving images

- User involvement

2) Rich Data to learn from:

- use the web

- crowdsourcing to get labels

(online games, mechanical turk) - Powerful graphics engines

(58)

Real-time pedestrian detection

(59)

Animate the world

[Chen et al. UIST ‘12]

(60)

Example: Xbox people tracking

(61)

Example: people tracking (test data)

(62)

Body tracking and Gesture Recognition has many applications

StartUp 2012: Try Fashion online

(63)

Start-Up Company: Like.com

(64)

What is computer Vision?

(Potential) Definition:

Developing computational models and algorithms

to interpret digital images and visual data in order

to understand the visual world we live in.

(65)

Example: Image Segmentation

11/12/2013 Introducing the Computer Vision Lab Dresden 68

Image with User Input

Typically 𝑛is large ≥ 1𝑀

𝒙 = 0,1

^𝑛

output

Undirected graphical models 𝜃

_𝑖𝑗

(𝑦

_𝑖

, 𝑦

_𝑗

) 𝑦

_𝑗

𝜃

_𝑖

(𝑦

_𝑖

)

𝑦

_𝑖

(66)

Example: Image Segmentation

Image with User Input

I nference/Optimization: 𝒚 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝒚 𝑃(𝒚|𝜽)

Typically 𝑛is large ≥ 1𝑀

𝒚 = 0,1

^𝑛

M odelling: How to formulate the graphcial model, 𝑒. 𝑔. 𝑃 𝒚|𝜽

(this this is one of many tasks)

Graphical models

𝜃_𝑖𝑗(𝑦_𝑖, 𝑦_𝑗)

𝑦

_𝑗

𝜃_𝑖(𝑦_𝑖) 𝑦_𝑖

(67)

What is Learning?

11/12/2013 Introducing the Computer Vision Lab Dresden 70

Error Function to say how

we compare results find weights 𝜽

^∗

(can be up to 10M

parameters)

Probabilistic model: P 𝒚 𝜽

^∗

) Image and Ground Truth

Inference: Maximum Probability:

𝒚

^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥

_𝒚

P 𝒚 𝜽

^∗

)

Training:

Testing:

(68)

Model versus Inference (Algorithm)

[Data courtesy from Oliver Woodford]

Input:

Image sequence

Output: New view

(69)

Another Example: Model versus Algorithm

Belief Propagation ICM, Simulated Annealing

Ground Truth Graph Cut

with truncation

[Rother et al. ‘05]

Why is the result not perfect?

Model or Inference

(approximate solution) (exact solution)

QPBOP

[Boros et al. ’06;

Rother et al. ‘07]

(approximate solution)

(70)

Summary: The key questions for the upcoming lectures

• What is the modelling language:

undirected / directed Graphical models; unstructured models

• How does the model look like:

• What is the structure?

• How do the functions look like?

• Can we learn the Model from Data:

• Learn structure

• Learn potential functions

• Probabilistic Learning / Discrimantive Learning

• How do we optimize the model (perform inference):

(71)

Is Machine Learning feasible?

• We are looking at a mapping:

𝑋 = 0,1 ³ → 𝑌 = {0,1}

• We are given 5 training data instances:

[example from book: Learning from data; Abu-Mustafa et al.]

(72)

Is Machine Learning feasible?

• We are looking at a mapping:

𝑋 = 0,1 ³ → 𝑌 = {0,1}

• We are given 5 training data instances:

?

(73)

Is Machine Learning feasible?

• Let us look at all possible functions: 𝑓 𝑥 ₁ , 𝑥 ₂ , 𝑥 ₃ = 𝑦

• We have in total 2 ²

³

= 256 possible functions

• Given the training data fixed we have 8 remaining functions:

?

• Without any information about 𝑓 any solution for f is good !

• We need information about 𝑓

[example from book: Learning from data; Abu-Mustafa et al.]

(74)

Is Machine Learning feasible?

Assume 𝑓 is “smooth” in 3D space (𝑥

₁

, 𝑥

₂

, 𝑥

₃