• Keine Ergebnisse gefunden

Big Geospatial Data

N/A
N/A
Protected

Academic year: 2022

Aktie "Big Geospatial Data"

Copied!
40
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Prof. Dr. Martin Werner www.martinwerner.de

Big Geospatial Data

(2)

www.martinwerner.de

Me...

Studied math (algebraic topology) in Bonn

Doctoral Dissertation in computer

science (indoor navigation) in Munich

(3)

www.martinwerner.de

… and you?

(4)

www.martinwerner.de

Overview ...

Part I: Parallel Computing

-

Parallel Programming

-

Message Passing Interface

-

Annotation-based Multiprocessing Using OpenMP

-

GPU Computing Using NVIDIA CUDA

-

MapReduce

-

Apache Big Data Stack (Hadoop, Spark, …)

Part II: Selected Algorithms for Spatial Big Data

-

Points, Images, Street Networks (aka Spatial Data Types)

(5)

www.martinwerner.de

Overview ...

Part III: Examples and Appendix

-

Trajectory Clustering Using Traclus and OpenMP

-

Word Counting in Various MapReduce environments

-

Trajectory Similarity Matrix Computations (aka ACM SIGSPATIAL GIS Cup 2017)

-

Counting astronomic objects from Hubble Space Images

Your Ideas? Just

propose application

areas or concrete

(6)

www.martinwerner.de

Lecture Documentation

I am doing the following to allow you to follow this lecture easily:

-

Write a script (but, please, take your own notes during lecture

-

Publish slides (of course)

-

Provide sources via github

You can do the following to help with this lecture:

-

translate sources to your favourite programming languages

-

write corrections (from typos to errors) for the script

-

add your own case-studies to the script in the Appendix

(7)

www.martinwerner.de

Github - Why?

Using Github, we can

easily share code

easily discuss proposed changes

keep track of contributioins and activities Basic Usage:

Go to the repository and download all sources Advanced Usage:

Edit something and create a pull request to notify me, where something is wrong or not working in your environment.

(8)

www.martinwerner.de

However

I am still living in Munich with my family.

Therefore, I want to ask you:

Can we (partly) block this lecture into larger units?

This semester, the lecture is one hour of lecture and one hour of practice. This means, that we should have 14 hours of lecture.

Teaching must be completed until July 15th.

I would be very happy, if we could have 2 hours on Tuesday in the

(9)

www.martinwerner.de

Time Table

12.4. (today) Lecture (2h) Introduction

19.4. free

26.4. free

3.5. Lecture (4h) incl 2h additional lecture

10.5. Exercise

17.5. free

24.5. Lecture( 4h) incl. 2h additional lecture

31.5. free

7.6. Tutorial

14.6. free

21.6. Tutorial

28.6. Lecture (2h)

(10)

www.martinwerner.de

Programming Languages for Big Data

What is your favourite programming language?

(11)

www.martinwerner.de

A choice of languages for (Spatial) Big Data

-

Python

-

R

-

MATLAB and Octave

-

C++

-

Java

-

Scala

and some more specific languages depending on the actual context.

(12)

www.martinwerner.de

Python

Advantages

-

Nice, modern scripting language

-

Huge amount of software available

-

C++ friendly (easy to extend towards high performance) Drawbacks

-

Difficult to read (this one bracket expression that is good, because it once seemed to work, does?)

-

Software Quality (especially packages) varies

-

Easy to break: Virtual environment stuff, versions, python2 vs.

python3

(13)

www.martinwerner.de

R

Advantages

-

Classical language, good documentation

-

Uniform names for common actions (fit, model, predict, plot,...)

-

Extremely C++ friendly (easy to extend towards high performance)

-

Very good plot defaults for scientific computing

-

Nice IDE (RStudio)

-

CRAN - Peer-Reviewed source code packages for almost everything in statistical computing

Drawbacks

-

Not the easiest to start with

-

Sometimes difficult to read due to complex statements

(14)

www.martinwerner.de

MATLAB / Octave

Advantages

-

Matrix-centered multi-purpose programming

-

Very good documentation, wide usage in the field

-

Extensible

-

High-quality toolboxes (however, expensive!) for MATLAB Drawbacks

-

Expensive

-

Non Open Source

-

Open-Source version Octave is not fully equivalent

(15)

www.martinwerner.de

C++

Advantages

-

High performance

-

Extremely high-quality libraries (boost)

-

Platform-independece even towards GPU and Embedded

-

Embeddable into Python, Java, R and MATLAB (almost anywhere)

-

full support for generic programming

-

very modern standard (C++17 is ready) Drawbacks

-

Compiler errors are difficult to read (especially, when using generics)

-

Some inconsistencies between compilers

(16)

www.martinwerner.de

Java

Advantages

-

Good performance

-

High-quality Design and Runtime

-

Platform-independece

-

Easy to learn (very good error messages)

-

Safe memory management Drawwbacks

-

Unable to unlock some aspects of modern computers (GPUs, specific instructions)

-

Overhead produced by memory management

(17)

www.martinwerner.de

Scala

Advantages

-

A modern approach to functional programming

-

Compatible with Java, running on top of JVM

-

Platform-independece Drawwbacks

-

Unable to unlock some aspects of modern computers (GPUs, specific instructions)

-

Overhead produced by memory management

(18)

www.martinwerner.de

Wrap-Up

-

Python: A useful scripting language with high adoption ratae, but sometimes easy to break

-

R: A fully function data science environment that feels like a classical imperative scripting language

-

MATLAB and Octave: You need matrices and matrix algebra, then consider MATLAB and Octave.

-

C++: You need to scale up to unlimited performance still using a high-quality, nice language: C++ is here for you.

-

Java: You need to scale out? Java is the way to go. Not the fastest, not the most efficient, but easy to use and not so

(19)

Prof. Dr. Martin Werner www.martinwerner.de

Motivating Example:

OpenCV and Python for Face

Recognition

(20)

www.martinwerner.de

Python: Simple Face Detection

Prepare your system…

-

sudo apt-get install python-opencv # for Debian / Ubuntu

-

git clone https://github.com/shantnu/FaceDetect/

Run the example...

bgd:~$ python face_detect.py abba.png haarcascade_frontalface_default.xml

(21)

www.martinwerner.de

import cv2 import sys

# Get user supplied values imagePath = sys.argv[1]

cascPath = sys.argv[2]

# Create the haar cascade

faceCascade = cv2.CascadeClassifier(cascPath)

# Read the image

image = cv2.imread(imagePath) face_detect.py

(22)

www.martinwerner.de

face_detect.py

# Detect faces in the image

faces = faceCascade.detectMultiScale(

gray,

scaleFactor=1.1, minNeighbors=5, minSize=(30, 30),

flags = cv2.cv.CV_HAAR_SCALE_IMAGE )

print("Found {0} faces!".format(len(faces)))

(23)

www.martinwerner.de

face_detect.py

# Draw a rectangle around the faces for (x, y, w, h) in faces:

cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)

cv2.imshow("Faces found", image) cv2.waitKey(0)

(24)

www.martinwerner.de

Face Detection in Python

The source code is

-

easy to read

-

easy to modify

-

Complex algorithms made accessible for anyone

-

Performance overhead can be ignored.

(25)

Prof. Dr. Martin Werner www.martinwerner.de

Motivating Example:

Trajectory Clustering Using R

(26)

www.martinwerner.de

Movement and Tracking in Video Surveillance

(27)

www.martinwerner.de

Sports… (David Alaba)

(28)

www.martinwerner.de

Mapping and GIS

(29)

www.martinwerner.de

Personal Tracking (runtastic)

(30)

www.martinwerner.de

Biology (Orca Movement)

(31)

www.martinwerner.de

Trajectory Clustering in R

http://martinwerner.de/blog/traclus Prepare your system

-

Install a recent version of R (from CRAN)

-

Install and compile libtrajcomp

https://github.com/mwernerds/trajcomp Run the example...

(32)

www.martinwerner.de

R example: TRACLUS

(33)

www.martinwerner.de

R example: TRACLUS

(34)

www.martinwerner.de

R example: TRACLUS

This is a very typical R calling sequence

using the keyword function() to define an

inline function and setting a parameter

(35)

www.martinwerner.de

Wrapup

Trajectory Clustering in R

-

easy to use

-

functional sorting and grouping proved useful (ddply)

(36)

Prof. Dr. Martin Werner www.martinwerner.de

Motivating Example:

Working with Astronomical

Images from C++

(37)

www.martinwerner.de

Prepare your system…

Find it in Stud-IP, not in our github

-

Install libTIFF for reading very huge images (many other tools will fail on this 980 MB file)

-

Install libpng for exporting imagery

-

Download the image

http://www.spacetelescope.org/images/heic1620a/

Run the example...

bgd:~$ make

g++ -std=c++11 -fopenmp -ltiff -lpng -o stars_omp stars_omp.cpp bgd:~$ ./stars_omp ~/Downloads/heic1620a.tif

input file /home/martin/Downloads/heic1620a.tif

(38)

www.martinwerner.de

A part

(39)

www.martinwerner.de

.. and another part

(40)

Prof. Dr. Martin Werner www.martinwerner.de

Thank you!

Referenzen

ÄHNLICHE DOKUMENTE

There are countless possible applications for “Big Data” analyses, especially in industries that heavily rely on statistical data sets, such as the health care sector.. Due to

In addition to physical operators corresponding to logical oper- ators, Algebricks also provides physical operators to enforce order properties as well as partitioning properties

Yen: Reducing workload in systematic review preparation using automated citation classification.Journal of the American Medical Informatics Association, 13(2):206 – 19, 2006.

Effective integration of biological knowledge from databases scattered around the internet and other information resources (for example experimental data) is recognized as

demonstrated improved accuracy and communication speed using a dynamic stopping algorithm which accomplished adaptive selection by maintaining a probability

Academic Publishing in Europe 2009, Berlin,

Provides a citable publication – facilitates easy reuse and reconstruction of data.. “Scientist-Friendly” – making use of the established

Users can access the Data Science Canvas basically from two different directions: On the one hand, it allows them to start by defining the business case via the data collection and