Computer Vision I -

(1)

Computer Vision I - Introduction

Carsten Rother

21/10/2014

(2)

Admin Stuff

• Language: German/English; Slides: English (all the terminology and books are in English)

• Lecturer: Carsten Rother and Holger Heidrich

• Exercises: Dmitri Schlesinger and Holger Heidrich

• Staff Email: carsten.rother@tu-dresden.de

• Announcements: online (to be set up)

• Course Books:

• Image Processing/Geometry:

Computer Vision: Algorithms and Applications

by Rick Szeliski; Springer 2011. An earlier version of the book is online:

http://szeliski.org/Book/

• Geometry:

Multiple View Geometry; Hartley and Zisserman;

Cambridge Press 2004. Second edition. Parts of book are online:

http://www.robots.ox.ac.uk/~vgg/hzbook/

• Also pointers to conference and journal articles

(3)

Course Overview (total 14 lectures)

VL1 (17.10): Introduction Ex1: Intro to OpenCV

VL2 (24.10): Image Processing – Part 1 Ex2: Intro to exercise: Image Processing VL3 (7.11): Image Processing – Part 2 Ex3: homework

VL4 (14.11): Fourier Analysis – Part 1 Ex4: homework

VL5 (21.11): Fourier Analysis – Part 2 Ex5: homework

VL6 (28.11): Projective Geometry

Ex6: Intro to exercise: Fourier Analysis VL7 (5.12): Image Formation Process Ex7: homework

(4)

Course Overview (total 14 lectures)

VL8 (12.12): 2-view Geometry – Part 1 Ex8: homework

VL9 (19.12): 2-view Geometry – Part 2 Ex9: homework

VL10 (9.1): Multi-View Geometry – Part 1

Ex10: Intro to exercise: Panoramic Stitching / Geometry VL11 (16.1): Multi-View Geometry – Part 2

Ex11: homework

VL12 (23.1): Tracking – Part 1 Ex12: homework

VL13 (30.1): Tracking – Part 2 Ex13: homework

VL14 (6.2): Wrap-Up: 100 things we have learned Ex14: homework

(5)

Exams and Exercises

• Exam: in person

• Exercises/homework:

• There are 3 blocks

• Each block has several exercises with different points

• The exercises have to be handed in until end of semester (ideally after each block)

• Last possible date to hand in is end of semester (end of January)

• Collaboration:

• You are encouraged to discuss the topics

• You are not allowed to copy any code for the homework from other people

(6)

CVLD Lectures

• WS 14/15

• Computer Vision 1 (2+2)

• Machine Learning 1 (2+2)

• Intelligent Systems (Vordipolm) (2+2)

• SS 15

• Computer Vision 2 (2+2)

• Machine Learning 2 (2+2)

• Image processing (2+2)

• For doing a Master/PhD in the CVLD one should do the computer vision or machine learning track

• Computer graphics (Prof. Gumhold) (Introduction, I, II)

3D Scanning with structured light; Illumination models; Geometry

(7)

Before we start … some Advertisement

CVLD Overview

Interactive Image and Data manipulation

Applied Optimization, Models, and Learning 3D Scene Understanding

Inverse rendering from moving images Benchmarking and Label collection

BioImaging

(8)

Future in Computer Vision

A project work in the CVLD is a good stepping stone if you:

• want to do a PhD in computer vision, graphics, machine learning

• want to become a researcher or software developer in one of the big research labs (Microsoft Research, Google, Adobe, TechniColor, etc)

• If you are interested in doing a start-up

• Other “computer vision related” industry

(9)

Introduction to Computer Vision

What is computer Vision?

(Potential) Definition:

Developing computational models and algorithms to interpret digital images and visual data in order to understand the visual world we live in.

(10)

Introduction to Computer Vision

(11)

What does it mean to “understand”?

Physics-based vision:

Geometry Segmentation

Camera parameters Emitted light (sun)

Surface properties: Reflectance, material

Semantic-based vision:

Objects: class, pose Scene: outdoor,…

Attributes/Properties:

- old-fashioned train - A-on-top-of-B

Developing computational modelsand algorithmsto interpretdigital images and visual data in order to understandthe visual world we live in.

(12)

Image-formation model

[Slide Credits: John Winn, ICML 2008]

Image

Very many sources of

variability

(13)

Image-formation model

Scene type Scene geometry

Street scene

(14)

Image-formation model

Scene type Scene geometry Object classes

Street scene

Sky

Building×3 Road

Sidewalk Tree×3 Person×4

Bicycle Car×5 Bench

Bollard

(15)

Image-formation model

Street scene

Sky

Building×3 Road

Sidewalk Tree×3 Person×4

Bicycle Car×5 Bench

Bollard

Scene type Scene geometry Object classes Object position Object orientation

(16)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Scene type Scene geometry Object classes Object position Object orientation Object shape

Street scene

(17)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions

(18)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance

(19)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows

(20)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows

(21)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows Motion blur Camera effects

(22)

Image-formation model

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows Motion blur Camera effects

(23)

The “Scene Parsing” challenge ---

a “grand challenge” of computer vision

(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}

Many applications do not have to extract the full probabilistic script but only a subset, e.g. “does the image contain a car?”

… many examples to come later

Single image

(24)

Why is “scene parsing” hard?

Computer Vision Computer Graphics

3D Rich Representation,

2D pixel representation

Computer Vision can be seen as “inverse graphics”

Script = {Camera, Light,

Geometry, Material, Objects, Scene, Attributes, Others}

(25)

Example of a recent work

Input

Scene gr aph

[Gupta, Efros, Herbert, ECCV ‘10]

(26)

Why is “scene parsing” hard?

[Sussman, Lamport, Guzman 1966]

[Slide credits Andrew Blake]

[Xiao et al. NIPS 2012]

(27)

Introduction to Computer Vision

(28)

How can we interpret visual data?

• What general (prior) knowledge of the world (not necessarily visual) can be exploit?

• What properties / cues from the image can be used?

Both aspects are quite well understood (a lot is based on physics) … but how to use them is efficiently is open challenged (see later)

Computer Graphics

Computer Vision

Script = {Camera, Light, Geometry, Material, Objects, Scene,

Attributes, Others}

(29)

How can we interpret visual data?

Computer Graphics

Computer Vision

Attributes, Others}

(30)

Prior knowledge (examples)

• “Hard” prior knowledge

• Trains do not fly in the air

• Objects are connected in 3D

• “Soft” prior knowledge:

• The camera is more likely 1.70m above ground and not 0.1m.

• Self-similarity: “all black pixels belong to the same object”

(31)

Prior knowledge – harder to describe

• Describe Image Texture

• Microscopic Images. What is the true shape of these objects Not a real Image zoom Real Image zoom

(32)

The importance of Prior knowledge

[Edward Adelson]

Which patch is brighter: A or B?

(33)

The importance of Prior knowledge

[Edward Adelson]

Which patch is brighter: A or B?

(34)

The importance of Prior knowledge

Direct Light

The most likely 3D representation 2D Image - local

What the computer sees

This is what humans see

implicitly. Ideally the computer sees the sane.

True colours In 3D world A

B

A B Ambient

Light

An unlikely 3D representation (hard to see for a human)

2D 3D 3D

True colors in 3D world A

B

(35)

The importance of Prior knowledge

2D Image

Light

3D representation Humans see an image not as a set of 2D pixels. They understand an image as a projection of the 3D world we live in.

Humans have the prior knowledge about the world encoded, such as:

• Light cast shadows

• Objects do not fly in the air

• A car is likely to move but a table is unlikely to move

We have to teach the computer this prior knowledge to understand 2D images as picture of the 3D world

(36)

The importance of Prior knowledge

Which monster is bigger?

(37)

The importance of Prior knowledge

Which monster is bigger?

In the 2D Image

In the 3D world (true)

1meter 2meter

(38)

Two Explanations:

a) People are different height and room right shape b) People are same height but room weirdly shaped

(39)

Human Vision can be fooled

(40)

Male or Female

(41)

How can we interpret visual data?

Computer Graphics

Computer Vision

Attributes, Others}

(42)

Cue: Appearance (Colour, Texture) for object recognition

To what object does the patch belong to ?

(43)

Cue: Outlines (shape) for object recognition

(44)

Guess the Object

 Colour

 Texture ^ Shape

[from JohnWinn ICML 2008]

(45)

Guess the ob ject

?

^ ^Colour

 Texture ^ Shape

[from JohnWinn ICML 2008]

(46)

Cue: Context for object recognition

(47)

Cue: Context for object recognition

(48)

Cue: stereo vision (2 frames) for geometry estimation

Ground truth Algorithmic output

(49)

Cue: Multiple Frames for geometry estimation

(50)

Cue: Convergence for geometry estimation

vp Lines with same vanishing point

may also be parallel in 3D

(51)

Cue: Shading & shadows for geometry and Light estimation

(52)

Texture gradient for geometry estimation

(53)

The “Scene Parsing” challenge ---

a “grand challenge” of computer vision

(Probabilistic) Script = {Camera, Light, Geometry, Material, Objects, Scene, Attributes, Others}

Many applications do not have to extract the full probabilistic script but only a subset, e.g. “does the image contain a car?”

… many examples to come later

Single image

(54)

… many application scenarios are in reach

To simplify the problem:

1) Richer Input:

- Modern sensing technology - Moving images

- User involvement

2) Rich Data to learn from:

- use the web

- crowdsourcing to get labels

(online games, mechanical turk) - Powerful graphics engines

3) For many practical applications:

We do not have to infer the full probabilistic script

(55)

Kinect has simplified (revolutionized) computer vision

[Izadi et al. ´11]

(56)

Animate the world

[Chen et al. UIST ‘12]

(57)

New hardware design …

(58)

Kinect Body Pose estimation and tracking

(59)

Kinect Body Pose estimation and tracking

(60)

behind the scene …

Graphics simulation

Synthetic (graphics) Real (hand-labelled)

(61)

Body tracking and Gesture Recognition has many applications

Very large impact in many field:

Gaming, Robotics, HCI, Medicine, …

StartUp 2012: Try Fashion online

(62)

Real-time pedestrian detection

(63)

Real-time Face recognition

e.g. Canon powershot

(64)

General Object recognition & segmentation

[TextonBoost; Shotton et al, ‘06]

Good results …

(65)

General Object recognition & segmentation

[TextonBoost; Shotton et al, ‘06]

Failure cases…

(66)

Start-Up Company: Like.com

(67)

Interactive Image manipulation

[Agrawal et al ’04]

(68)

Interactive Image manipulation

(69)

Image de-convolution

Input Output Output –

kernel

[Schmidt, Rother, Nowozin, Jancsary, Roth 2013] Best Student Paper award

(70)

Image de-convolution (other domains)

input output

(71)

Video Editing

[Rav-Acha et al. ‘08]

(72)

Automatic Video Summary (StartUp: Magisto)

(73)

Automatic Photo Summary - Commercial

AutoCollage 2008 - Microsoft Research [Rother et al. Siggraph 2006]

(74)

Movie Industry

Pirates of the Caribbean, Industrial Light and Magic

(75)

Robotics

Robocup

Nasa Mars exploration

(76)

Introduction to Computer Vision

(77)

Interactive Segmentation

(78)

Model versus Algorithm

Goal

Given z; derive binary x:

Algorithm to minimization: 𝒙^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛_𝑥 𝐸(𝒙)

(user-specified pixels are not optimized for)

𝒛 = 𝑅, 𝐺, 𝐵 ^𝑛 x = 0,1 ^𝑛

Model: Energy function 𝑬 𝒙 (implicitly models a statistical model 𝑷(𝒙|𝒛) )

Example: Interactive Segmentation

(79)

Model for a starfish

Goal: formulate 𝑬(𝒙) such that

Optimal solution 𝒙^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛_𝑥 𝐸(𝒙)

𝑬 𝒙 = 0.01 𝑬 𝒙 = 0.05 𝑬 𝒙 = 0.05 𝑬 𝒙 = 0.1

(80)

How does the energy looks like?

Unary terms Pairwise terms Energy function (sum of terms 𝜃):

𝑬(𝒙) =

𝑖

𝜃_𝑖 𝑥_𝑖 +

𝑖,𝑗

𝜃_𝑖𝑗(𝑥_𝑖, 𝑥_𝑗)

(81)

How does the energy looks like?

Visualization:

Undirected graphical models

𝜃_𝑖𝑗(𝑥_𝑖, 𝑥_𝑗)

“pairwise terms”

𝑥_𝑗

𝜃_𝑖(𝑥_𝑖)

“unary terms”

𝑥_𝑖

(82)

Unary term

Red

Gr een

Red

Gr een

User labelled pixels Gaussian Mixture Model Fit

(83)

Unary term

Optimum with unary terms only

Dark means likely background

Dark means likely foreground

𝜃_𝑖(𝑥_𝑖 = 0) 𝜃_𝑖(𝑥_𝑖 = 1)

New query image 𝑧_𝑖

(84)

Pairwise term

Most likely Most likely Intermediate likely

“Ising Prior”

most unlikely

This models the assumption that the object is spatially coherent Next step could be: model shapes of starfishes

𝜃_𝑖𝑗 𝑥_𝑖, 𝑥_𝑗 = |𝑥_𝑖 − 𝑥_𝑗|

When is 𝜃_𝑖𝑗(𝑥_𝑖, 𝑥_𝑗) small, i.e. likely configuration ?

(85)

Energy minimization (optimization)

𝝎 = 10 𝝎 = 0

𝝎 = 200 𝝎 = 40

𝑬(𝒙) =

𝑖

𝜃_𝑖 𝑥_𝑖 +

𝜔

𝑖,𝑗

|𝑥_𝑖 − 𝑥_𝑗|

(86)

The key Questions

• What type of modelling language should be chosen:

undirected or directed discrete Graphical models, Continuous-Domain models

• How does the exact model look like:

• What is the structure

• How do the terms look like

• Can we learn the Model from Data:

• Learn structure

• Learn potential functions

• How do we optimize the model (perform inference):

• fast, approximate

• Exactly solvable?

• NP-hard?

This is the focus of the course (SS 15):

Computer Vision 2, and Machine Learning 2 This lecture is more physics-based vision:

Geometry, Image Processing and Tracking

(87)

Another Example: Model versus Algorithm

[Data courtesy from Oliver Woodford]

Model: Minimize a binary 4-connected pair-wise graph

(choose a colour-mode at each pixel)

Input:

Image sequence

Output: New view

[Fitzgibbon et al. ‘03]

(88)

Another Example: Model versus Algorithm

Belief Propagation ICM, Simulated Annealing

Ground Truth Graph Cut with truncation

[Rother et al. ‘05]

Why is the result not perfect?

Model or Optimization

(approximate solution) (exact solution)

QPBOP

[Boros et al. ’06;

Rother et al. ‘07]

(approximate solution)

(89)

Why is computer vision interesting (to you)?

• It is a challenging problem that is far from being solved

• It combines insights and tools from many fields and disciplines:

• Mathematics and statistics

• Cognition and perception

• Engineering (signal processing)

• And of course, computer science

(90)

Why is computer vision interesting (to you)?

• Allows you to apply theoretical skills

... that you may otherwise only use rarely.

• Quite rewarding:

• Often visually intuitive and encouraging results.

• It is a growing field:

• Cameras are becoming more and more popular

• There are a lot of companies (big, small, startups) working in vision

• Conferences are growing rapidly.

(91)

Relationship to other fields

[Wikipedia]

(92)

Relationship to other fields – my personal view

Biology Robotics

AI (many more)

Human-Computer Interaction

Applications Medicine

Computer Vision

(93)

Reading for next class

This lecture: Chapter 1 (in particular: 1.1)

Next lecture:

• Chapter 3 (in particular: 3.2, 3.3) - Basics of Digital Image Processing

• Chapter 4.2 and 4.3 - Edge and Line detection