From Point Clouds to High-Fidelity Models - Advanced Methods for Image-Based 3D Reconstruction

(1)

From Point Clouds to High-Fidelity Models —Advanced Methods for

Image-Based 3D Reconstruction

A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Zurich)

presented by Audrey Richard

Master of Science (MSc), INSA de Lyon born on 25. March 1991

citizen of France

accepted on the recommendation of

Prof. Dr. Konrad Schindler, ETH Zurich, Switzerland Prof. Dr. Marc Pollefeys, ETH Zurich, Switzerland Prof. Dr. Vincent Lepetit, ENPC ParisTech, France

2021

(2)

Audrey Richard, From Point Clouds to High-Fidelity Models —Advanced Methods for Image-Based 3D Reconstruction

Copyright © 2021, Audrey Richard Published by:

Institute of Geodesy and Photogrammetry ETH ZURICH

CH-8093 Zurich

All rights reserved

ISBN 978-3-03837-011-6

ISSN 0252-9335

(3)

Abstract

Capturing automatically a virtual 3D model of an object or a scene from a collection of images is a useful capability with a wide range of applications, including virtual/augmented reality, heritage preservation, consumer digital entertainment, autonomous robotics, naviga- tion, industrial vision or metrology, and many more. Since the early days of photogrammetry and computer vision, it has been a topic of intensive research but has eluded a general solu- tion for it. 3D modeling requires more than reconstructing a cloud of 3D points from images;

it requires a high-fidelity representation whose form is often dependent on individual objects.

This thesis guides you in the journey of image-based 3D reconstruction through several advanced methods that aims to push its boundaries, from precise and complete geometry to detailed appearance, using both theory with elegant mathematics and more recent breakthroughs in deep learning. To evaluate these methods, thorough experiments are conducted at scene level (and large-scale) where efficiency is of key importance, and at object level where accuracy, completeness and photorealism can be better appreciated. To show the individual potential of each of these methods, as well as the possible wide coverage in terms of applications, different scenarios are considered and serve as a proof-of-concept.

Thereby, the journey starts with large-scale city modeling using aerial photography from the cities of Z¨ urich (Switzerland), Enschede (Netherlands) and Dortmund (Germany), followed by single object completion using the synthetic dataset ShapeNet, that includes objects like cars, benches or planes that can be found in every city, to finish with the embellishment of these digital models via high-resolution texture mapping using a multi-view 3D dataset of real and synthetic objects, like for example statues and fountains that also dress the landscape of cities. Combining them together into an incremental pipeline dedicated to a specific application would require further tailoring but is quite possible.

Keywords: image-based modeling, dense 3D reconstruction, semantic understanding, multi-

view, shape completion, appearance modeling, texture mapping, texture super-resolution, con-

vex optimisation, finite elements discretisation, deep learning, convolutional neural network.

(4)

R´ esum´ e

La capture automatique d’un mod` ele virtuel en 3D d’un objet ou d’une sc` ene ` a partir d’une collection d’images est d’une grande utilit´ e pour un large ´ eventail d’applications, notamment pour la r´ ealit´ e virtuelle ou augment´ ee, la pr´ eservation du patrimoine, le divertissement num´ erique grand public, la robotique autonome, la navigation, la vision industrielle ou la m´ etrologie, et bien d’autres encore. Depuis les d´ ebuts de la photogramm´ etrie et de la vision par ordinateur, ce sujet a fait l’objet de recherches intensives, sans toutefois aboutir ` a une solution g´ en´ erique. La mod´ elisation 3D ne se limite pas ` a la reconstruction d’un simple nuage de points 3D ` a partir d’images; elle n´ ecessite une repr´ esentation de haute-fid´ elit´ e dont la forme doit souvent ˆ etre adapt´ ee individuellement ` a l’objet ou ` a la sc` ene en question.

Cette th` ese vous guide dans la reconstruction 3D ` a partir d’images, au travers de m´ ethodes avanc´ ees qui visent ` a en repousser ses limites, de la g´ eom´ etrie pr´ ecise et compl` ete ` a l’apparence d´ etaill´ ee, en utilisant des outils math´ ematiques puissants ainsi que les r´ ecentes avanc´ ees de l’apprentissage approfondi (r´ eseaux neuronaux). Pour ´ evaluer ces m´ ethodes, des exp´ eriences rigoureuses sont men´ ees, aussi bien au niveau de la sc` ene (et ` a grande

´ echelle) o` u l’efficacit´ e est primordiale, qu’au niveau de l’objet o` u la pr´ ecision, l’int´ egrit´ e et le photor´ ealisme peuvent ˆ etre mieux appr´ eci´ es. Pour montrer le potentiel individuel de chacune de ces m´ ethodes, ainsi que l’´ etendue des applications concern´ ees, diff´ erents sc´ enarios sont envisag´ es et servent de preuve de concept. Ainsi, le circuit commence par une mod´ elisation de la ville ` a grande ´ echelle, bas´ ee sur des photographies a´ eriennes des villes de Zurich (Suisse), Enschede (Pays-Bas) et Dortmund (Allemagne), suivie de la reconstruction int´ egrale d’objets pr´ esentant des parties manquantes et ceci ` a l’aide d’un ensemble de donn´ ees synth´ etiques ShapeNet, qui comprend des objets du quotidien tels que des voitures, des bancs ou des avions qu’on peut retrouver dans n’importe quelle ville, pour finir par l’embellissement de ces mod` eles num´ eriques grˆ ace ` a l’application d’une texture haute-r´ esolution utilisant des donn´ ees 3D multi-vues d’objets r´ eels et synth´ etiques, comme par exemple des statues et fontaines qui pars` ement ´ egalement les paysages urbains. Il est tout ` a fait possible d’envisager de combiner ces diff´ erentes m´ ethodes, moyennant quelques ajustements, afin de construire un syt` eme incr´ emental complet et d´ edi´ e ` a une application sp´ ecifique. Bonne lecture !

Mots-cl´ es: mod´ elisation ` a partir d’images, reconstruction 3D dense, interpr´ etation s´ e-

mantique, multi-vues, remplissage de forme, mod´ elisation d’apparence, application de texture,

super-r´ esolution de texture, optimisation convexe, m´ ethode des ´ el´ ements finis, apprentissage

approfondi, r´ eseau neuronal convolutif.

(5)

Acknowledgements

This thesis is the fruit of five intensive years of research work and self-development, during which I had the chance to collaborate with talented people from whom I have learned a lot. Just freshly graduated from electrical engineering in France, I had at this time little experience in computer vision. The last six months of my Master’s degree, during which I did an internship in 3D reconstruction, were decisive. This was crystal-clear, I wanted to pursue my career in that field!

First of all, I want to thank my supervisor, Prof. Dr. Konrad Schindler, very much for giving me the opportunity to pursue my PhD under his supervision and for guiding me into the world of 3D reconstruction. His deep expertise associated with his valuable feedback and time accorded to my work created an excellent learning environment for me. He provided me the invaluable time and freedom to build a stronger knowledge, to explore my interests and to elaborate my research work. I am also grateful that I could leave the group twice to do internships in a company.

Monique Berger Lande, ”MERCI POUR TOUT” (thanks for everything)! Despite being an amazing secretary, taking care of all the administrative details, coffee-related matters and mental support during deadline crisis, she is my Swiss mummy. I did really enjoy our many discussions, about everything, during coffees or lunches and I will definitely miss them.

Her support started immediately from the first email for my apartment search and hasn’t stopped back since. I am thankful to her for all of this. The thesis ends with many memories in mind, some crazier than others, but for sure more are yet to come.

Maros is another special person who has marked my doctorate. He was my first and best office-mate who created an office atmosphere with the perfect mix of serious work and fun.

We collaborated a lot in the first two years of my thesis, leading to several publications and incredible conferences together. This also allowed me to gently gain confidence and experience without being alone in my research. I could always count on him. His impeccable French led us to many laughs, not to mention our Friday afternoon breakdowns with battles of NERF.

In the second half of my PhD, I was fortunate to work closely with people, now friends, of the Computer Vision Group at ETH: Ian, Martin and Vagia.

I quickly entered the deep learning world with Ian’s help and support. In addition to imme- diately getting along with each other, we were very complementary in the way we work. It was thus very pleasant to work with him and produced great scientific publications together.

Being able to take breaks and talk about everything and nothing in French had its import- ance to clear my mind and I really enjoyed them. Moreover, our numerous scientific-related discussions and often very long meetings have been so beneficial to me that I can only keep repeating a deep thank you.

Martin provided us with very good projects supervision. His strong knowledge and varied

experience were of key importance. He always found time to receive us for discussions and to

help us find solutions in case of blockage. I am very grateful to him for his advice, explanations

and mentorship which allowed me to mature even more in my research.

(6)

At last, thanks to Vagia who gave me all the time I needed to get to grips with her code and data used in our joint publication. Always cheering with a big smile, she was highly supportive and available to me, making the task easier for me.

Scientific publications are not the result of one person’s work. I would therefore like to thank my other co-authors for their teamwork and their knowledge that I highly benefited from:

Dr. Christoph Vogel, Dr. Jan D. Wegner (as well as the pleasant moments and tips around dinners and coffees, without forgetting the discovery of Northern Germany), Dr. Mathias Rothermel, Dr. Torsten Sattler and Prof. Dr. Thomas Pock.

Thanks to everyone with whom I shared time and memories in the lab: my dear Laura who left the lab too soon after my arrival, Michal, Charis, Silvano, Manu, Nikolai, Nico, Andres, Nadine, Mikhail for his continuous and efficient IT support (you saved me from a couple of nervous breakdowns), Riccardo, Ozgur, Corinne, Priyanka, Cenek, Stefano, Jillian, Katrin, Yujia, Binbin. As well as Daniel, Prof. Dr. Klingel´ e, Christian and Andrea, whom I had the chance to meet at ETH.

This thesis could not have happened without the professional IT support of G¨ urkan, Patrick and Christian who were always available in case of problems.

Not forgetting a last special colleague and friend, Helena. We met thanks to Maros and we shared so many memories together. I already miss our numerous coffees, discussions and sofa time together, which made my PhD life sweeter.

My gratitude goes also to Prof. Dr. Marc Pollefeys and Prof. Dr. Vincent Lepetit who have agreed to be my co-examiners and spend time to review my work. At this point, I also take the opportunity to give warm thanks to all the persons who have proofread this thesis, whether it is at the technical or spelling levels, and for their precious feedback: Konrad, Ian, Martin and my dear Warren.

Finally, I would like to deeply thank my family who have always supported me in what I was undertaking and who have given me the means to do so. They taught me the values of seriousness and work well done while knowing how to relax and keep amusement in my life.

Last but not least, I would like to thank my partner Charles from the bottom of my heart who did not hesitate for a second to quit everything and follow me for this new adventure in Z¨ urich. This implied lots of sacrifice with many difficult steps for him and new challenges to overcome, but he gave me the best support and love I could ever have received. He truly put me ahead of himself. That is why I want to dedicate this thesis to him. After INSA Lyon, we did it together once again and I am really proud of us.

Audrey Richard,

May 2020

(7)

List of Figures 11

List of Tables 14

1 Introduction 15

1.1 Topics in this Thesis . . . . 19

1.1.1 Semantic 3D Reconstruction . . . . 20

1.1.2 Point Cloud Completion . . . . 21

1.1.3 Texture Super-Resolution . . . . 21

1.2 Relevance for Society and Economy . . . . 22

1.3 Publications . . . . 24

2 Preliminaries 25 2.1 Image-based Modeling . . . . 25

2.1.1 3D Representations . . . . 26

2.1.2 Structure-from-Motion (SfM) . . . . 30

2.1.3 Dense Stereo . . . . 31

2.1.4 Volumetric Fusion . . . . 32

2.1.5 Semantic Understanding . . . . 34

2.1.6 Appearance Modeling . . . . 35

2.1.6.1 Lighting and Shading . . . . 36

2.1.6.2 Texture Mapping . . . . 37

2.1.6.3 Rethinking Texture Mapping . . . . 41

2.2 Inference and Discretisation . . . . 41

2.2.1 Convex Optimisation . . . . 41

2.2.1.1 Terminology and Problem Definition . . . . 42

2.2.1.2 Duality . . . . 43

2.2.1.3 Primal-Dual Algorithm . . . . 47

2.2.1.4 Convex Relaxation and Functional Lifting . . . . 48

2.2.2 Discretisation with Finite Elements . . . . 49

2.2.2.1 Quick Overview of FEM . . . . 52

2.2.2.2 Weak Formulation . . . . 52

2.2.2.3 Discretisation Strategy . . . . 56

2.2.2.4 Comparison of Finite Elements and Finite Differences . . . . 61

2.3 Neural Networks . . . . 63

2.3.1 Multi-Layer Perceptrons (MLP) . . . . 64

2.3.2 Convolutional Neural Networks . . . . 65

7

(8)

2.3.3 Training a CNN . . . . 67

2.3.4 Applications and Architectures . . . . 70

3 Semantic 3D Reconstruction with Finite Element Basis 73 3.1 Related Work . . . . 74

3.2 Method . . . . 76

3.2.1 Convex Relaxation . . . . 76

3.2.2 Finite Element Spaces . . . . 76

3.2.2.1 Lagrange Elements . . . . 77

3.2.2.2 Raviart-Thomas Elements . . . . 78

3.2.3 Discretisation . . . . 80

3.2.3.1 Lagrange Basis . . . . 80

3.2.3.2 Raviart-Thomas Basis . . . . 80

3.2.4 Non-Metric Extension . . . . 81

3.2.4.1 Non-Metric Priors: Continuous vs. Discrete . . . . 82

3.2.4.2 Lagrange Basis . . . . 82

3.2.4.3 Raviart-Thomas Basis . . . . 84

3.3 Semantic Reconstruction Model . . . . 84

3.3.1 Data Term for Lagrange Basis . . . . 84

3.3.2 Optimisation . . . . 85

3.3.3 Interesting Features of the Scheme . . . . 86

3.3.3.1 Grid vs. P1 . . . . 86

3.3.3.2 Adaptiveness . . . . 87

3.4 Evaluation . . . . 89

3.4.1 Input Data . . . . 89

3.4.2 2D Lagrange Results . . . . 90

3.4.3 Influence of the Control Mesh . . . . 91

3.4.4 3D Lagrange Results . . . . 92

3.4.5 Raviart-Thomas in 2D . . . . 94

3.4.6 Raviart-Thomas in 3D . . . . 95

4 Shape Completion 98 4.1 Related Work . . . . 99

4.1.1 Traditional Shape Completion Approaches . . . . 99

4.1.2 Learned Shape Completion Approaches . . . . 99

4.1.3 Learned Shape Representations . . . 100

4.1.4 Point Cloud-based Learning and Descriptors . . . 100

4.2 Method . . . 101

4.2.1 From 3D Points to 2D Images . . . 102

4.2.2 Descriptor Network . . . 103

4.2.3 Coarse-to-Fine Approach . . . 103

4.2.4 Loss Function . . . 104

4.3 Implementation Details . . . 105

4.3.1 KAPLAN Precomputation (3D to 2D) . . . 105

4.3.1.1 Valid Flag Attribution . . . 105

4.3.1.2 Depth Aggregation . . . 106

4.3.1.3 Difference between Valid Flag and Depth Value . . . 107

(9)

4.3.2 Point Prediction (2D to 3D) . . . 107

4.3.3 Prediction Filtering . . . 108

4.3.3.1 Inter-KAPLAN Consistency . . . 108

4.3.3.2 Representative Query Points . . . 108

4.3.4 Architecture of our U-net Encoder-Decoder . . . 108

4.4 Experiments . . . 109

4.4.1 Data Generation . . . 109

4.4.2 Design Choices using Ground Truth KAPLAN . . . 110

4.4.3 Baselines . . . 112

4.4.4 Choice of Metrics . . . 112

4.4.5 Results . . . 113

4.4.5.1 Missing Region Detection . . . 113

4.4.5.2 Quantitative Evaluation . . . 113

4.4.5.3 Qualitative Evaluation . . . 114

4.4.5.4 Ablation and Parameter Study . . . 118

4.4.5.5 Discussion . . . 119

5 Learned Multi-View Texture Super-Resolution (SR) 121 5.1 Related Work . . . 122

5.1.1 Prior-based Single-Image SR . . . 122

5.1.2 Redundancy-based Multi-Image SR . . . 123

5.1.3 Multi-View Texture Mapping . . . 123

5.1.4 Multi-View Texture SR . . . 124

5.1.5 Multi-View Learning-based Texture SR . . . 124

5.2 Method . . . 125

5.2.1 Multi-View Aggregation (MVA) . . . 125

5.2.2 Single-Image Prior (SIP) . . . 127

5.2.3 Loss Function . . . 128

5.3 Experiments . . . 128

5.3.1 Datasets . . . 128

5.3.2 Implementation Details . . . 129

5.3.3 Training Setup . . . 129

5.3.4 Results . . . 131

5.3.4.1 Comparison with State-of-the-Art . . . 131

5.3.4.2 Ablation Study . . . 132

5.3.4.3 Study of Varying Number of Input Views . . . 134

5.3.4.4 Discussion . . . 135

6 Conclusion 137 6.1 Summary . . . 137

6.2 Limitations and Outlook . . . 138

6.2.1 Semantic 3D reconstruction with Finite Elements . . . 138

6.2.2 KAPLAN: A 3D Point Descriptor for Shape Completion . . . 139

6.2.3 Learned Multi-View Texture Super-Resolution . . . 140

6.2.4 The Big Picture . . . 142

A Bibliography 144

(10)

B Acronyms 163

C Proofs 164

C.1 Optimisation for Lagrange FEM . . . 164 C.1.1 Proxmap for the Minkowski Sum of Convex Sets . . . 164 C.2 Primal-Dual Update Equations . . . 166

D Curriculum Vitae 169

(11)

1.1 Illustration of some challenges in image-based modeling. . . . . 17

1.2 Teaser collage illustrating the three main contributions of this thesis. . . . . 19

1.3 Collage of a few possible applications in city modeling and entertainment. . . 23

2.1 Overview of a generic dense 3D reconstruction pipeline. . . . 26

2.2 Different volumetric representations for a same scene. . . . 27

2.3 Illustration of the Delaunay condition and Delaunay triangulation. . . . 29

2.4 Comparison of the key properties of the three main 3D representations used in this thesis. . . . 30

2.5 Overview of different regularisers (anisotropic and isotropic). . . . 33

2.6 Illustration of different shading techniques from computer graphics for the rendering of a sphere model. . . . 36

2.7 (a) Texture mapping function between the texture and the object’s surface, (b) Overview of the different mappings connecting the texture T, the object model M and the images I _i . . . . 38

2.8 Overview of the pipeline for the generation of a texture atlas. . . . . 39

2.9 Collage of alternative methods to texture mapping. . . . 41

2.10 Illustration of the common terminology in convex optimisation. . . . . 43

2.11 Illustration of the concepts of subdifferential, strong and weak duality. . . . . 44

2.12 Discretisation of arbitrary continuous objects yields an unavoidable discret- isation error. . . . 50

2.13 Illustration of P k elements in 1D, 2D and 3D. . . . . 57

2.14 Illustration of functions approximation in FEM. . . . 58

2.15 Example of piece-wise quadratic basis functions in 1D. . . . . 59

2.16 Illustration of FEM meshes in 1D to create different orders of approximation with Lagrange polynomials. . . . 60

2.17 Illustration of a perceptron and of a multi-layer perceptron (MLP). . . . 64

2.18 Illustration of the CNN terminology. . . . . 66

2.19 Common activation functions of neural networks. . . . 67

2.20 Illustration of the U-net architecture and of the main building block of ResNet. 70 3.1 Semantic 3D model, estimated from aerial views with our FEM method. . . 73

3.2 Illustration of the P ¹ basis function shape, the scalar field defined as a convex combination of basis coefficients and the gradient definition in a simplex. . . 78

3.3 Illustration of the non-metric extension and the solution implemented. . . . . 83

3.4 Illustrations of the Wulff-shape regularizer, the split of a simplex, the com- parison of finite differences versus finite elements on a regular grid. . . . . . 86

11

(12)

3.5 Comparison of the grid-based finite differences with finite elements discretisa-

tion using P 1 basis elements. . . . . 87

3.6 Updated data term after adding a new vertex. . . . 89

3.7 Input data for semantic 3D reconstruction. . . . 89

3.8 Illustration of our FEM-based method (Lagrange) for semantic 3D reconstruc- tion on a simple 2D synthetic scene. . . . 90

3.9 Example scenes of our 2D data set and results obtained with our Lagrange FEM method. . . . 91

3.10 Illustration of the control mesh foundation and quantitative evaluation of Lagrange FEM method with respect to different degradations of the input data. 92 3.11 Quantitative evaluation of our FEM-based method (3D Lagrange) of Scene 1 from Enschede. . . . 93

3.12 Additional datasets. . . . . 94

3.13 Illustration of our FEM-based method (Raviart-Thomas) for semantic 3D reconstruction on a simple 2D synthetic scene. . . . 94

3.14 Quantitative evaluation of the Raviart-Thomas FEM method with respect to different degradations of the input data. . . . . 95

3.15 Quantitative evaluation of our FEM-based method (3D Raviart-Thomas) of Scene 1 from Enschede. . . . 95

3.16 Reconstruction with the Raviart-Thomas basis and with the Lagrange basis. 96 3.17 Large-scale semantic 3D reconstruction of Enschede (Netherlands). . . . . . 97

4.1 Shape completion with the KAPLAN descriptor. . . . 98

4.2 Computation of KAPLAN for a query point q. . . 102

4.3 Overview of our multi-scale pipeline for shape completion. . . 104

4.4 Valid flag attribution. . . 105

4.5 Effect of the average-to-center distance constraint for valid cells. . . 106

4.6 Effect of τ for depth aggregation in a cell. . . . 107

4.7 Description of the U-shaped encoder-decoder. . . . 109

4.8 Reconstruction using ground truth KAPLAN for different configurations (only coarse level). . . 110

4.9 Reconstruction steps of the coarse-to-fine scheme for shape completion. . . . 111

4.10 Illustration of KAPLAN predictions at coarse level. . . 113

4.11 Quantitative comparison to the state-of-the-art shape completion methods. . 114

4.12 Qualitative and quantitative comparison to state-of-the-art shape completion methods. . . 115

4.13 Qualitative and quantitative comparison to state-of-the-art shape completion methods (meshed version). . . 116

4.14 Additional qualitative results on ShapeNet. . . 117

4.15 Failure cases of our shape completion approach. . . 118

5.1 Learned super-resolution result compared to the state-of-the-art (upscaling ×4).121 5.2 The proposed multi-view super-resolution network combining the concepts of redundancy-based multi-view SR and prior-based single-image SR. . . 125

5.3 Texture patches at different steps of training (upsampling factor ×4). . . 130

5.4 Qualitative comparison to state-of-the-art multi-view SR methods (upscaling

factor ×2). . . 132

(13)

5.5 Ablation study and comparison (upsampling factor ×4). . . . 133 5.6 Additional ablation study and comparison (upsampling factor ×4) on some

training scenes. . . 136

(14)

3.1 Quantitative comparison with octree model [Bl´ aha et al., 2016] and Multi- Boost input data. . . . . 90 3.2 Quantitative comparison of our two proposed FEM methods with octree

model [Bl´ aha et al., 2016] and MultiBoost input data [Benbouzid et al., 2012]. 93 4.1 F1-score and runtime for different ground truth KAPLAN configurations. . . 111 4.2 Parameter study at coarse level ` ⁰ . . . 119 5.1 Quantitative comparison of different texture super-resolution techniques (up-

scaling factor ×4). . . . 131 5.2 Evaluation of our network performance with a varying number of input views

on each testing scene (upscaling factor ×4). . . . 134

14

(15)

Introduction

High-fidelity 3D models are becoming tremendously important as more and more applica- tions require a high-level of realism or greater immersion feeling. Virtual and augmented reality (VR/AR) experiences are nowadays surrounding us. Most museums or exhibitions are offering virtual tours (e.g. of historical buildings or times), or even augmented artworks to add extra content to them. For example, the Louvre in Paris organised a unique VR experience for the 500th anniversary of the death of Leonardo da Vinci. To rediscover the Mona Lisa, ”Beyond the glass” offers the public to see the painting up close, in all its rich details, with a precision and closeness that could never been obtained on site or from a photo.

At the same time, a digital 3D model also allows us to keep a faithful and timeless copy of cultural heritage (e.g. for restoration or reproduction purposes). The Digital Michelangelo project scanned several statues of Michelangelo in [Levoy et al., 2000]. Movies, video games and interactive training (e.g. flight simulation or medical training) also rely on visually realistic content to create special effect or to provoke strong emotions in the user. Although humans already have incredible imaginative powers, often a photorealistic 3D model will be needed as support to project themselves and appreciate possible outcomes. Many real estate agencies or furnishing companies offer applications to visualise new interior design, or even clothing stores to try an outfit directly on your own reconstructed body. While the visually pleasing and realistic experience is of key importance for numerous applications, others rely more on an accurate and detailed shape description. Prominent examples in this list are autonomous robotics (e.g. understand and interact with the environment via accurate object detection and recognition), navigation/planning (e.g. via city modeling), or industrial/metrology applications where precise measurements of geometry are necessary (e.g. for inspection/reverse engineering tasks or 3D printing). Depending on the application final goal, different requirements are posed to the 3D modeling system.

The widespread use of high-definition cameras and displays intensifies the need for automated generation of high-quality and photorealistic 3D models. To generate such a digital model, a sensor that is able to measure the real-world 3D geometry (i.e. depth measurement) is first required. We distinguish two types of sensors: passive and active sensors, which come with their own benefits and drawbacks according to the application.

Passive sensors are principally photo cameras. The common procedure consists in finding matching pixels in the images and converting their 2D positions into 3D depths via trian- gulation. This process is referred as stereo matching. Different stereo setups can be built with two or multiple cameras which are mounted on a fixed rig or not, respectively called

15

(16)

binocular or multi-view stereo setups. In the first one, similarly to the human visual system two images are always taken at exactly the same time. All these measurements are usually combined into a consistent representation to output a 3D model.

Active sensors make use of a controlled illumination, e.g. a laser beam or a visible patterned light, to recover the 3D geometry of objects by means of triangulation or time delay between light emission and reception. They are often called active rangefinder or 3D scanners.

Most popular active sensors are LiDAR, structured light scanners or time-of-flight cameras.

Although they are a prevalent technology for capturing 3D geometry, they are invasive and cannot scan certain materials. They also have a limited range which depends on the amount of light they emit. More importantly, they only generate 3D points, which are of higher accuracy and density, but still require to be modeled into objects. Due to the usual poor texture image captured by the 3D scanner, the data-driven modeling becomes more difficult, as image-based description and recognition are often the keys to the success of the modeling. Cameras thus appear more flexible and scalable for objects of different sizes, while being especially power efficient and affordable.

The rest of this thesis will focus on passive techniques that generate a full 3D model from one or more input images of an object or a scene, also known as Image-based Modeling.

Nevertheless, it should be noted that this work is not limited to passive techniques. On the contrary, it is in principle possible to apply it to active techniques such as structure lighting or range scanning.

One of the biggest challenges in image-based 3D reconstruction comes from the intrinsic ill- posedness of the task, from both geometric and appearance perspectives, which makes the problem under-constrained. This can be even more challenging in multi-view setting which concurrently makes the problem over-constrained. Prominent reasons are the following:

• Loss of one dimension in the 3D to 2D projection process, which is often a perspective projection. Although it is usually an easy task for a human to get an idea of the 3D structure shown in an image, infinitely many different 3D surfaces may produce the same set of images. Making the task extremely hard for a computer program.

• Multiple objects aligned on the line of sight between camera and object destroy valuable information. This is known as occlusion and can lead to partial 3D reconstructions.

• Intensity variations in images, e.g. change of the object illumination with the camera viewpoint or presence of reflective or transparent surface, also introduce ambiguity in pixels matching between images.

• Intrinsic composition of appearance, which implies to separate texture/material in- formation from (view-dependent) lighting and shadows.

• Multi-view redundancy when multiple images are available. This raises the question of an effective blending of them to reach a high-quality appearance modeling which corresponds to the observations.

Another challenge rises from the absence of a unique 3D representation. While the 2D realm

is just easier to represent with a very common representation, namely pixels, the 3D realm

can be arbitrarily represented as point clouds, polygonal meshes (e.g. triangular mesh,

(17)

quad-mesh), parametric surfaces (e.g. Bezier curves, spline and B-spline), implicit functions (e.g. zero level set of signed distance function), voxels ¹ , among others. Lastly, but no less important, images are limited in terms of resolution and observed 2D data are irregularly distributed in the 3D domain. Figure 1.1 shows some examples.

To find a unique solution to this problem, additional prior knowledge is necessary. Prior knowledge is something we know about the objects we are trying to reconstruct. A typical prior is to assume that objects have a smooth surface. Fortunately, scenes naturally contain an even richer prior structure that can be directly exploited using, notably, scene under- standing, e.g. buildings are vertical and always stand on ground, or an airplane is symmetric and is equipped with at least two similar horizontal wings.

Figure 1.1: Illustration of some challenges in image-based modeling: (left) scale ambiguity for the 3D reconstruction; without prior knowledge, the real object’s scale cannot be recovered;

(middle) occlusions lead to missing data since some parts of the scene are unobserved; (right) varying illumination conditions may cause different pixel colours for same points in the scene.

In practice, many approaches in the literature follow a three-stage generic approach. Starting from an unstructured collection of images, the first step extracts the camera motion, i.e. the location and orientation, also called camera poses, and the scene structure in the form of a sparse set of 3D points ² . This is called structure-from-motion (see Part 2.1.2). Using the obtained camera calibrations, the second step consists in computing dense measurements of the 3D geometry. The most popular approach is computing dense depth maps, which are essentially images measuring the per-pixel projective depth from a specific camera viewpoint to the observed object or scene. This is called dense stereo (see Part 2.1.3). The last step brings all the information from all views into a richer and consistent representation, e.g. a dense point cloud or a textured mesh (see Parts 2.1.4 and 2.1.6).

1 A voxel is simply the cubic counterpart of a pixel (for more details, see Part 2.1.1).

2 Some systems infer the motion using additional sensors such Inertial Measurement Units (IMU) or

wheel odometry.

(18)

This thesis assumes that camera poses are known and initially start from RGB images and dense depth maps. Although the depth maps used are computed with stereo matching techniques, presented methods are not limited to this type of data. It is entirely possible to use RGB+D data available from depth sensors, e.g. Kinect or Google Tango Project.

The next step consists in fusing this information to obtain a 3D model. Traditionally, the dense 3D reconstruction problem is formulated as an energy minimisation problem over a chosen 3D representation. The energy functional commonly consists of two terms: the data fidelity term, which measures the fidelity of the solution to the observed data, and the regularisation term, which incorporates prior assumptions about the expected solution. The latter is all the more important, as it provides robustness against possible noise in the data, but more importantly against missing data so that the initial ill-posed problem becomes in fact well-posed. The final reconstructed geometry corresponds to its optimal solution. Our research work mostly relies on convex variational methods which present the following main benefit: any (local) minimiser corresponds to the global minimum of the energy, due to the convexity of the energy. Furthermore, these methods are also independent of the initialisation and thus the solution can be found with iterative numerical optimisation techniques (see Part 2.2.1).

The recent advent of deep learning as a tool for image analysis opens up new perspectives for image-based 3D modeling. Since 2015, convolutional neural networks (CNN) for 3D reconstruction have attracted increasing interest and demonstrate impressive performance (see Section 2.3). In contrast to classical energy formulations which rely on a combination of data observations and explicitly user-defined prior knowledge, pure deep learning approaches are data-driven and reproduce patterns seen in the training data. This is at the same time a strength and a weakness. On the one hand, to get a solution that looks really good and close to the reality, lots of training data from which patterns will be copied are required.

On the other hand, to keep the solution under control by defining the rules that produce it, a large number of parameters need to be adjusted.

The visually pleasant aspect of the model is added in a final post-processing step by wrapping a texture map around the reconstructed 3D model. A texture map is simply a 2D image that describes the appearance and characteristics of a surface, see Part 2.1.6. There are different ways to compute it: from the simplest way that stitches or blends pieces of the input images, to more sophisticated ones which perform super-resolution or intrinsic decomposition in order to solely recover the true colour or albedo, i.e. removing any lighting effect.

Driven by the above observations, the main goal of this thesis is to push the boundaries of 3D modeling from images by leveraging:

• different aspects of the world (e.g. geometry, appearance, semantic),

• different surface representations (e.g. volumetric representation, point cloud, mesh),

• various inference methods (e.g. classical optimisation tools, pure deep learning tech- niques or hybrid approaches),

into coherent advanced methods. Although relying on input data computed from structure-

from-motion, our proposed scheme outputs a dense high-quality textured 3D model. The

(19)

next part of this chapter gives an overview of the methods developed and highlights the main contributions of this thesis.

1.1 Topics in this Thesis

The research in this thesis proceeds in three parts. Individually, each one represents an independent contribution and is considered in a stand-alone fashion. Combining them together to build an incremental pipeline towards high-fidelity models is conceivable, i.e. recovering an accurate geometry, without hole and with high-quality texture. Fig- ure 1.2 is a collage that illustrates our main contributions in a possible common scenario of city modeling. The related work will be discussed separately in the corresponding chapters.

In Part 1.1.1, we introduce a novel generic framework for better discretisation of recon- struction problems based on finite elements. Within the same model, it allows variable and adaptive resolution to better fit the irregular distribution of observed data. In practice, we consider semantic 3D reconstruction, a powerful joint formulation for 3D reconstruction that leverages the obvious synergy between geometry and semantic scene understanding to out- put watertight and finer meshes. In Part 1.1.2, we describe a novel shape completion method that directly operates on unstructured point clouds with missing geometry. The input data can be simply obtained from sampling of the previously obtained mesh, which can still con- tain holes due to occlusions, or from any active 3D scanners. We introduce KAPLAN, a novel 3D point descriptor that efficiently aggregates local shape information from a point’s neighbourhood into a local grid of 2D features (i.e. k-planes of multiple orientations). The

Figure 1.2: Teaser collage illustrating the three main contributions of this thesis.

(20)

core idea is to inpaint missing areas of these 2D grids with convolutional neural networks in order to recover the complete 3D geometry. In Part 1.1.3, we present a method to re- trieve a high-resolution texture of an observed 3D object from multiple low-resolution view points (overlapping). The method combines two core concepts of super-resolution, namely the physical-based SR principle which is based on the image formation model to simulate the physical capturing process, and the prior-based single-view SR which leverages deep learning to derive a data-driven prior of what high-resolution patterns look like.

1.1.1 Semantic 3D Reconstruction

A number of computer vision tasks, such as segmentation, multiview reconstruction, stitching and inpainting, can be formulated as multi-label problems on continuous domains, by functional lifting [Pock et al., 2010; Cremers et al., 2011; Lellmann and Schn¨ orr, 2011; Chambolle et al., 2012; Nieuwenhuis et al., 2013]. A recent example is semantic 3D reconstruction, e.g. [H¨ ane et al., 2013; Bl´ aha et al., 2016], which solves the following problem: given a set of images of a scene, reconstruct both its 3D shape and a segmentation into semantic object classes. The task is particularly challenging, because the evidence is irregularly distributed in the 3D domain; but it also possesses a rich, anisotropic prior structure that can be exploited. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g. building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results.

So far, models for the mentioned multi-label problems, and in particular for semantic 3D reconstruction, have been limited to axis-aligned discretisations. Unless the scenes are aligned with the coordinate axes, this leads to an unnecessarily large number of elements. Moreover, since the evidence is (inevitably) distributed unevenly in 3D, it also causes biased reconstructions. Thus, it is desirable to adapt the discretisation to the scene content, as often done for purely geometric surface reconstruction, e.g. [Labatut et al., 2007].

Our contributions. We propose a novel framework for the discretisation of multi-label problems on arbitrary, continuous domains. Our work bridges the gap between general FEM discretisations, and labeling problems that arise in a variety of computer vision tasks, including for instance those derived from the generalised Potts model. Starting from the popular formulation of labeling as a convex relaxation by functional lifting, we show that FEM discretisation is valid for the most general case, where the regulariser is anisotropic and non-metric. While our findings are generic and applicable to different vision problems, we demonstrate their practical implementation in the context of semantic 3D reconstruction, where such regularisers have proved particularly beneficial.

The proposed FEM approach leads to a smaller memory footprint as well as faster computa-

tion, and it constitutes a very simple way to enable variable, adaptive resolution within the

same model. This means we can refine or coarsen the discretisation as appropriate, to adapt

to the scene to be reconstructed. Finer tesselation will be therefore preferred in regions that

are likely to contain a surface, leveraging both high spatial resolution and high numerical

precision only in those regions.

(21)

1.1.2 Point Cloud Completion

Shape completion is the task to fill holes, so as to obtain a complete representation of an object’s shape. The aim of this work is to perform shape completion directly on the point cloud, without having to transform it into a memory-demanding volumetric scene representation (e.g., a global voxel grid or signed distance function). In other words, we must learn the shape statistics of local surface patches, so that we can then sample points on the expected surface and fill the holes. With the natural decision to center the patches on existing 3D points, this is equivalent to predicting 3D point descriptors from incomplete data.

We present a 3D shape completion method that operates directly on unstructured point clouds, thus avoiding the aforementioned resource-intensive data structures. To this end, we introduce KAPLAN, a 3D point descriptor that aggregates local shape information via a series of 2D convolutions. The key idea is to project the points in a local neighborhood onto multiple planes with different orientations. In each of those planes, point properties like normals or point-to-plane distances are aggregated into a 2D grid and abstracted into a feature representation with an efficient 2D convolutional encoder. Since all planes are encoded jointly, the resulting representation nevertheless can capture their correlations and retains knowledge about the underlying 3D shape, without expensive 3D convolutions.

Experiments on public datasets show that KAPLAN achieves state-of-the-art performance for 3D shape completion.

Our contributions. We propose a novel shape completion approach that fills in 3D points without costly convolutions on volumetric 3D grids. The approach operates both locally and globally via a multi-scale pyramid. To that end, we design KAPLAN, a 3D descriptor that combines projections from 3D space to multiple 2D planes with a modern convolutional feature extractor to obtain an efficient, scalable 3D representation. Moreover, the combina- tion of KAPLAN and our multi-scale pyramid approach automatically detects the missing region, and allows to perform shape completion without needing to regenerate the whole object. While our target application is shape completion, KAPLAN can potentially be used for a range of other point cloud analysis tasks. It is simple, computationally efficient, and easy to implement; but still leverages the power of deep convolutional learning to obtain expressive visual representations.

1.1.3 Texture Super-Resolution

Besides reconstructing the best possible 3D geometry, an equally important, but perhaps less appreciated step of the image-based modeling process is to generate high-fidelity surface texture. However, the vast majority of image-based 3D reconstruction methods ignore the texture component and merely stitch or blend pieces of the input images to a texture map in a post-processing step, at the resolution of the inputs, e.g. [Debevec et al., 1996; Bernardini et al., 2001; Eisemann et al., 2008; Waechter et al., 2014].

We present a super-resolution method capable of creating a high-resolution texture map for

a virtual 3D object from a set of lower-resolution images of that object. Our architecture

unifies the concepts of (i) multi-view super-resolution based on the redundancy of overlap-

ping views and (ii) single-view super-resolution based on a learned prior of high-resolution

(22)

(HR) image structure. The principle of multi-view super-resolution is to invert the image formation process and recover the latent HR texture from multiple lower-resolution projections. We map that inverse problem into a block of suitably designed neural network layers, and combine it with a standard encoder-decoder network for learned single-image super-resolution. Wiring the image formation model into the network avoids having to learn perspective mapping from textures to images, and elegantly handles a varying number of input views. Experiments demonstrate that the combination of multi-view observations and learned prior yields improved texture maps.

Our contributions. We propose the first super-resolution framework capable of combining, in a general multi-view setting, redundancy-based multi-view SR with single-image SR based on a learned HR image prior. We do point out that a rudimentary combination has been explored in the special case of video SR, where it has been proposed to enrich learned, single-view super-resolution with redundant observations from adjacent frames [Mitzel et al., 2009; Unger et al., 2010]. Moreover, our network architecture merges state-of-the-art deep learning and traditional variational SR methods. This unifying architecture has multiple advantages: (i) it seamlessly handles an arbitrary number of input images, and is invariant to their ordering; including the special case of a single image (falling back to pure single- view SR). (ii) It does not waste resources, potentially sacrificing robustness, to learn known operations such as perspective projection to relate images taken from different viewpoints.

(iii) It focusses the learning effort on small residual corrections, in both the single- and multi-view branches, thus reducing the amount of training data needed.

1.2 Relevance for Society and Economy

Since the 1980s, 3D reconstruction of objects or scenes of our real world has played a central role in computer vision. It is probably only in the last decade that it has started to have a great and visible impact on our society, notably fuelled by the increasing availability of 3D acquisition technologies and datasets, by maturating research (algorithms), and by unleashed computational power (GPUs).

As previously listed, there are a multitude of applications that require the efficient and automatic generation of high-quality 3D models from images. Nevertheless, given my research environment (Department of Civil, Environmental and Geomatic Engineering of ETH Z¨ urich), as well as my industrial experiences (virtual reality and video games), I will focus here on the two most relevant types of applications in my view (see Figure 1.3).

City modeling. According to the United Nations ³ , more than half of the world’s population (55.3%) was living in urban areas in 2018. And it is projected than more than two-thirds (66%) will live in urban settlements by 2050. There are already 467 cities with between 1 and 5 million inhabitants and 598 cities with between 500 ⁰ 000 and 1 million inhabitants. The world’s cities are constantly growing in both size and number.

City modeling can actually play a key role in their sustainable development, including urban

3 Report available at https://www.un.org/en/events/citiesday/assets/pdf/the_worlds_cities_

in_2018_data_booklet.pdf.

(23)

planning, prediction of hazard scenarios, catastrophe response planning, road infrastructure management, landscaping of green spaces, simulation of traffic flows, development of smart city models, etc. These models notably enable the interactive measurements of 3D distances, surface areas and volumes without physical presence at the location. If augmented with semantic understanding, further high-level tasks can be performed such as path planning or quantifying roads to be maintained; if augmented with a realistic texture, the tourism and real-estate industries can also benefit from them. Among the major players interested in that sector, we can cite for instance: municipalities via the town’s urban planning departments and tourism offices, architects, ESRI, Garmin, Uber, Waze, smarterbettercities. VarCity, a multi-year research project financed by the European Research Council and carried out by the Computer Vision Lab of Prof. Dr. Van Gool at ETH Z¨ urich, reflects also the great importance given by the institutions to this research topic.

Entertainment. Another growing business is the entertainment sector, such as augmen- ted/virtual reality, movies and games. The graphic design work in these sectors is a major component, still often done manually by artists and therefore relatively expensive. Being able to replicate our real-world by automatic vision (or photogrammetric) measurements is thus extremely interesting, as evidenced by the disproportionate budgets invested by big companies to prevail in this market.

The main objective is to offer a lifelike experience to the users, either by inserting virtual objects in their real-world (holograms), or by immersing them in a realistic virtual world (TV-screen, headset). In the first instance, the virtual object should be added seamlessly inside the real-world so as to not disturb the human brain with the introduction of this non-real part. But this can get compromised if the geometry or appearance of this non-real part is not correctly reconstructed. That is where vision-based reconstruction comes into play. Among the major players interested in that sector, we can cite for instance: Facebook (e.g. Oculus Rift), Microsoft (e.g. HoloLens), Disney Research (e.g. Medusa Performance Capture), Sony (e.g. PlayStation Eye), ESC Entertainmnent (e.g. The Matrix or Mission Impossible II that used [Debevec et al., 1996]), Tilt Five (e.g. holographic tabletop game).

Figure 1.3: Collage of a few possible applications in city modeling and entertainment.

(24)

1.3 Publications

The following works have been published in the context of this thesis:

[1] A. Richard, Ian Cherabier, M. R. Oswald, M. Pollefeys, K. Schindler, KAPLAN: A 3D Point Descriptor for Shape Completion, International Conference on 3D Vision (3DV) (2020), oral presentation

[2] A. Richard, Ian Cherabier, M. R. Oswald, V. Tsiminaki, M. Pollefeys, K. Schindler, Learned Multi-View Texture Super-Resolution, International Conference on 3D Vision (3DV) (2019), oral presentation, best paper honorable mention award

[3] C. Stucker, A. Richard, J. D. Wegner, K. Schindler, Supervised outlier detection in large-scale MVS point clouds for 3D city modeling applications, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (2018), oral presentation, best paper award

[4] A. Richard, C. Vogel, M. Bl´ aha, T. Pock, K. Schindler, Semantic 3D Reconstruction with Finite Element Bases, British Machine Vision Conference (BMVC) (2017) [5] M. Bl´ aha, M. Rothermel, M. R. Oswald, T. Sattler, A. Richard, J. D. Wegner, M.

Pollefeys, K. Schindler, Semantically Informed Multiview Surface Refinement, Inter- national Conference on Computer Vision (ICCV) (2017)

[6] M. Bl´ aha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, K. Schindler, Large-Scale Semantic 3D Reconstruction: an Adaptive Multi-Resolution Model for Multi-Class Volumetric Labeling, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), oral presentation

[7] M. Bl´ aha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, K. Schindler, Towards In-

tegrated 3D Reconstruction and Semantic Interpretation of Urban Scenes, Dreil¨ ander-

tagung, D-A-CH Photogrammetry Meeting (2016)

(25)

Preliminaries

This chapter is devoted to introducing some underlying concepts and principles in computer vision, optimisation and deep learning, which are assumed to be known in the remaining chapters. In Section 2.1 we present the core components of an image-based 3D reconstruction pipeline. We notably introduce and define the different surface representations that will be used throughout this thesis. Next, in Section 2.2, we focus on the mathematics of this modeling problem. Discretisation using Finite Elements and convex optimisation techniques are discussed. Last Section 2.3 provides some basic knowledge of deep learning and more specifically of convolutional neural networks (CNN) which have the wind in their sails for the last few years.

2.1 Image-based Modeling

Image-based Modeling, also often referred to as image-based 3D reconstruction or 3D pho- tography, is the collection of techniques to compute partial or full 3D models from one or more images:

I = {I _i | i = 1...N} (2.1)

with N the total number of images. Each image I _i is acquired with a camera (e.g. perspect- ive projection) and corresponds to a 2D matrix, I _i : Ω _i ⊂ R ² → R ⁿ ^c with n _c the number of channels. As described in the Chapter 1, the focus of this thesis is dense 3D reconstruc- tion. The generic pipeline is composed of three main stages, see Figure 2.1. First, a sparse reconstruction is obtained along with the camera calibrations using techniques of structure- from-motion. Then, a densification happens in order to compute dense depth measurements (possibly noisy), called dense stereo. Finally, to get a complete and consistent 3D model, these multiple measurements need to be fused into a consistent representation before put- ting on a texture. Since we consider the powerful formulation of semantic 3D reconstruction in Chapter 3, we also describe what is semantic understanding and what are the benefits over traditional pure geometric-based approaches.

25

(26)

Figure 2.1: Overview of a generic dense 3D reconstruction pipeline.

2.1.1 3D Representations

To virtually represent the 3D world there are multiple possible representations, but in the context of image-based modeling we will focus on these three representations: volumetric models, point clouds and triangular meshes. They make fundamentally different assumptions about reality which enables embedding different scene characteristics into the representation.

Since all of them will be used throughout this thesis, this part gives a clear definition as well as more insights about their generation procedures. For a quick overview, Figure 2.4 summarises their key properties.

Volumetric models. The common approach consists in dividing the volume of interest, i.e. a bounding box enveloping the object or the scene to be reconstructed, into a regular grid of equally sized voxels; a voxel is simply a small cube that can be seen as the natural extension of pixels to 3D space. A regular voxel grid of resolution dim × dim × dim is thus defined as the set of voxels with nodes v _ijk ∈ R ³ such i, j, k = 1, ..., dim.

This natural Euclidean representation allows a simple modeling of 3D shape, e.g. via voxel labeling. The voxels are simply labeled as free or occupied (or more labels, see Part 2.1.5), as depicted in the Figure 2.2.a. The final 3D model is then composed of the union of all occupied voxels and its surface is defined as the implicit boundary between free and occupied space. The object’s surface is usually represented as a triangular mesh which can be extracted from volumetric representation using Marching cubes algorithm [Lorensen and Cline, 1987] (see next paragraph about Meshes).

Aggregating data evidence (3D points or multiple depth maps) over voxels can be very simple, e.g. voxels which contain one or several 3D measurements are labeled as occupied while empty ones are labeled as free. More sophisticated methods [Zach et al., 2007; Liu and Cooper, 2010; H¨ ane et al., 2013; Savinov et al., 2015] propose to cast the viewing ray, i.e.

the ray between a camera viewpoint and a 3D measurement, through the grid and to take visibility along this ray into account. If a 3D measurement is visible from this viewpoint, this means that all the voxels before are free space, while the voxels right after are probably occupied space.

One of the main benefits of this Euclidean data structure is that all voxels are connected

and have a predefined and simple neighbourhood structure, known from the beginning.

(27)

This representation is thus highly adapted to support regularisation during optimisation, e.g. to tackle noise in the input data or missing evidence. A brief history of volumetric reconstruction along with the most popular frameworks is set out in Part 2.1.4. Moreover, this representation can be easily extended to multi-labels (see Figure 2.2.b.), i.e. the label occupied is further decomposed into object classes tailored to the application needs, e.g. in city modeling such classes could be buildings, streets, etc., offering the possibility to design class-specific shape priors (see Part 2.1.5).

Although the final reconstructed level of resolution can be controlled via the voxel size, this model rapidly becomes computationally expensive with a very high memory footprint.

More precisely, it grows cubically, with the number of voxels making the model memory- prohibitive to scale up the resolution. This is due to its intrinsic property of saving useless information at full resolution, e.g. free space and inside of objects, whereas only the surface is of interest. To overcome this limitation, adaptive multi-resolutions methods based on octrees were proposed, e.g. [Bl´ aha et al., 2016]. Benefits are clearly visible in the schematic of Figure 2.2.c. Based on the principle of recursive decomposition, it enables the saving of memory where a high resolution is not required, i.e. employing big voxels for free space or inside solid components whereas fine voxels are only used close to the surface to be reconstructed.

Figure 2.2: Different volumetric representations for a same scene: (left) a simple regular voxel grid that differentiates only occupied and free voxels, (middle) semantic understanding is added to the previous representation such every occupied voxels is further decomposed into a specific class, e.g . ground, vegetation or building, (right) the voxel discretisation is adapted to the data using an octree.

So far, all the presented methods make use of a regular partitioning of the volume, with voxels of fixed or adaptive size along the coordinate axes. However, most of the time data evidence is irregularly distributed and not especially aligned with the coordinate axes.

This leads to an unnecessarily high number of elements and also to biased reconstruction.

Irregular discretisations thus appear as the most suitable, for which voxels are commonly

replaced by tetrahedrons. [Vu et al., 2012] proposes a data-dependent discretisation based

on the initial point cloud. Based on visibility and photoconsistency, a binary state is

then assigned to each tetrahedron. In the same vein, Chapter 3 of this thesis propose a

novel adaptive and irregular discretisation scheme based on FEM for semantic multi-label

reconstruction.

(28)

Point clouds. A point cloud is a set of unstructured 3D points that samples the geometry of an observed object or scene:

P = {p | p ∈ R ³ } (2.2)

It is actually the closest representation of the acquisition process of 3D data, whether active or passive (see Chapter 1). Despite the ease of capturing point clouds, processing them is a challenging task due to their lack of structure. In the generic pipeline of dense 3D reconstruction, point clouds only constitute an intermediate sparse representation (from structure-from-motion) before the subsequent densification stage.

However, the biggest advantage of not making prior assumptions about point connectivity is that editing operations, such as adding and removing, or merging operations become straightforward. Moreover, with a high enough resolution and point density, point clouds can accurately represent the features of any virtual 3D object. For example, point cloud 3D scanning has been used to create 3D representations of highly complex objects like archeological ancient marvels. In Chapter 4, this flexible handling is exploited for shape completion and combined with a powerful 2D representation more adapted to deep learning architectures. Similar to volumetric models, an additional step is required to extract the final reconstructed surface, e.g. Delaunay triangulation [Delaunay, 1934] or Poisson Surface Reconstruction [Kazhdan et al., 2006].

Meshes. A mesh is a collection of vertices and faces that defines the shape of polyhedral surfaces; the faces usually consist of triangles and are described with a connectivity list over the vertices, as considered in this thesis:

M = < V, F > with V = {v i | v i ∈ R ³ } and F = {f i | f i ∈ V × V × V } (2.3) This 2D manifold embedded in 3D space is one of the most popular representations for 3D geometry, intensively used in computer graphics due to its GPU-friendly asset and to all rendering operations tools available (e.g. ray tracing, collision detection, etc.). Furthermore, today’s graphics hardware is optimised for fast processing of triangle meshes.

Unlike the two previous representations (volumetric models and point clouds), the meshes allow for modeling the reconstruction problem directly with the final representation (explicit representation) and do not rely on an additional conversion step to extract the final reconstructed surface. This alleviates possible discretisation artefacts during the extraction of the reconstructed surface. They are usually the most flexible way to accurately represent 3D geometry, i.e . as a continuous piecewise linear surface. Their intrinsic irregular structure makes them adaptive and scalable to large data, e.g. city modeling includes thousands of buildings that are mainly described with planar or cylindrical patches, thus the surface modeling requires only a small number of points along with few triangular patches. However, it still remains difficult to approximate curved surfaces with a series of triangles. Moreover, although this representation seems of the most attractive, it has the big disadvantage of not explicitly handling arbitrary topology and can encounter some numerical difficulties, e.g.

[Turk and Levoy, 1994].

Many mesh generators build a mesh of triangles by first creating all the nodes and then

connecting the nodes to form triangles, i.e. meshing point clouds directly. A popular method

is Delaunay triangulation that maximises the smallest angle of the triangles, i.e. triangles

with internal angles close to 60 ^◦ are selected (well-shaped triangles) over ones with small

(29)

internal angles (thin triangles); the optimal configuration being equilateral. It builds upon the Delaunay condition as illustrated in Figure 2.3.a, i.e. the circumcircles of any triangles of the triangulation have empty interiors (it does not include other vertices). This leads to better triangulation as shown in Figure 2.3.b. Another important characteristic is that Delaunay triangulation connects points in a nearest-neighbour manner. The properties of Delaunay triangulations naturally extend to higher dimensions. The triangulation of a set of 3D points is composed of tetrahedra and is referred as Delaunay tetrahed- ralisation. In order to keep the following explanations clearer, the 2D case will be considered.

[Lee and Schachter, 1980] describes two algorithms to build such triangulation from a point cloud P ; the first one uses a divide-and-conquer approach that runs in O( |P| log |P| ) time (asymptotically optimal), the second is simpler with an iterative approach that runs in O( |P| ² ) time as worst case. The idea of the latter one is to build the Delaunay triangulation by inserting one point after another, while always ensuring the Delaunay condition of the set of points inserted so far. To avoid special cases, the point cloud P is first enhanced with three artificial points “far out” such that the new point set has a triangular convex hull. The algorithm starts from this big enclosing triangle and progressively introduces points inside.

At each newly introduced point s, it first finds which triangle t = {p, q, r} contains s, then splits it into three triangles resulting from connecting s with all three vertices p, q, r. To ensure the Delaunay property, some edge-flippings are performed.

Figure 2.3: Illustration of the Delaunay condition (left) that allows the construction of a “well- shaped” triangulation compared to any arbitrary triangulation for which many triangles are very thin (right).

There are also other meshing techniques, from oriented point clouds though, which are now available in several libraries for 3D data processing, e.g. Ball Pivoting algorithm [Bernardini et al., 1999] or Poisson Surface Reconstruction (PSR) [Kazhdan et al., 2006].

As said above, it is entirely possible to extract a mesh from volumetric models. The most

famous technique for regular discretisations is called Marching Cubes [Lorensen and Cline,

1987]. The algorithm proceeds through each voxel: given the scalar field value, e.g. a

labelling function, at the eight neighbour locations, it identifies the voxel configuration

among the 15 possible configurations and determines the corresponding isosurface (one or

several triangles). Irregular discretisations are more straightforward since the surface is

directly defined by the boundary triangles between different labels.

(30)

Meshes are also beneficial in terms of texture mapping for which more details can be found in following Part 2.1.6. While it is possible to colour volumetric models and point clouds, it is relatively hard to texture them, and mesh-based techniques are then preferred.

Figure 2.4: Comparison of the key properties of the three main 3D representations, namely volumetric models, point clouds and meshes.

2.1.2 Structure-from-Motion (SfM)

Starting from a set of unstructured images, Structure-from-Motion (SfM) algorithm aims to simultaneously recover the structure, i.e. 3D points, and the camera motion, i.e . location and orientation, also called camera poses. Succinctly, it builds upon sparse feature detection and matching between the different input images to perform triangulation. The obtained 3D points (sparse point cloud) along with the camera poses are then refined in a last step of bundle adjustment that globally minimises the reprojection error over all the images.

SfM is a broad research topic in itself that has been intensively studied in computer vision.

But presenting its fundamentals more deeply goes beyond the scope of this thesis. Instead,

we refer to [Hartley and Zisserman, 2003], which covers in detail the theory of the projective

geometry on which SfM techniques are based, and to [ ¨ Ozye¸sil et al., 2017] for a recent

survey on the topic. This thesis resorts to an existing pipeline [Wu, 2011] to obtain the

necessary camera poses. Cameras were also calibrated beforehand to determine their internal

parameters such as the focal length and the position of the image centers.