Twitter –An Example

(1)

www.martinwerner.de

Twitter – An Example

(2)

www.martinwerner.de

A tweet

Twitter in a nutshell:

- A tweet is a short message

- A hashtag is a word starting with a #.

It is used to assign a topic to a tweet.

- A mention is a word starting with @ and is used to address a (public) message to a person or company - A follower is someone who

subscribed for updates from you - A like is when someone clicks the

heart below the tweet.

- A retweet is when a (possibly

commented) copy of the tweet is send out

(3)

www.martinwerner.de

The network

A tweet object contains all of this

information (redundantly at the time of API access).

You get it as a JSON object from the API

In other words as a nested key-value data structure

(4)

www.martinwerner.de

Twitter Data Objects

Key Value

contributors null

truncated true

text

"The Shortest Paths Dataset used for #acm #sigspatial #gis cup has just been released. https://t.co/pzeEleBfu9 #gis…

https://t.co/IF7z1WnUDk"

is_quote_status false

in_reply_to_status_id null

id 1062405858712272900

favorite_count 3

source "<a href=\"http://twitter…>Twitter for Android</a>"

retweeted false

coordinates null

entities {…}

in_reply_to_screen_na

(5)

www.martinwerner.de

Twitter Data Objects

Key Value

in_reply_to_user_id null

retweet_count 0

id_str "1062405858712272898"

favorited false

user {…}

geo null

in_reply_to_user_id_str null possibly_sensitive false

lang "en"

(6)

www.martinwerner.de

Entities

(7)

www.martinwerner.de

Twitter Data Remarks

 each tweet has a unique 64 bit unsinged ID given as an integer (field id) and as a string (field id_str)

 each tweet has a timestamp created_at and though I created this tweet in

Germany (GMT+1), it is stored in UTC (GMT+0) time zone. All tweets share this timezone. In this way, it is very easy to relate tweets to each other on a global scale, but more difficult to relate a tweet to the local time of day.

 the language is estimated by twitter

 Some user account information is embedded into the tweet. This is highly

redundant, but very useful for web performance: A tweet object is designed to be sufficient to render the tweet with Javascript (e.g., create the view shown above).

 hashtags are isolated

 a field truncated has been introduced for compatibilty: when Twitter changed

(8)

www.martinwerner.de

Data Acquisition

(9)

www.martinwerner.de

Preparing for API Access

Twitter provides a nice and clean API and the first thing you will need is, well, a Twitter account.

 Then, as of July 2018, you must apply for a Twitter developer account and give some information on how you want to use the Twitter API

 Then, you need to create an app which provides you with credentials to use the API.

As this process is changing over time, just find it on Twitters web pages.

(10)

www.martinwerner.de

Keys and Tokens

Setting up the app gives you

 The Consumer Key (API Key)

 The associated Consumer Secret (API Secret)

 An Access Token

 An associated Access Token Secret Each of those is an alphanumeric string.

(11)

www.martinwerner.de

Tip: Record in secret.env

Create a file secret.env similar to

#Access Token

TWITTER_KEY=274[...]M9b

#Access Token Secret

TWITTER_SECRET=WKS[...]1oI

#Consumer Key (API Key)

TWITTER_APP_KEY=8Co[...]Plt

#Consumer Secret (API Secret) TWITTER_APP_SECRET=cEI[...]net

(12)

www.martinwerner.de

Streaming Twitter Data

Twitter provides two ways of accessing data Query

- Ask for a certain hashtag, location, or object and get back a certain result set

Stream

- Create a filter (specification of what you are interested in) and get one tweet after another

(13)

www.martinwerner.de

Stream is better? Probably, but not for all…

Advantage of Streaming:

For spatial applications, I love hanging on the stream, because you get a temporal sample of the data which is not skewed towards temporal

hotspots.

Downside of Streaming:

You need to operate a reliable system for getting the data (interruptions lead to missing time intervals in your sample)

(14)

www.martinwerner.de

Streaming in practice

You can rely on the tweepy library to manage the Twitter API from within python. It is simple and actively maintained. However, it is not ultimately stable…

You can as well develop your own API client using the documentatino offered by Twitter, this can (could) be very stable…

(15)

www.martinwerner.de

Streaming Framework

auth = tweepy.OAuthHandler(os.environ['TWITTER_APP_KEY'],os.environ['TWITTER_APP_SECRET']) auth.set_access_token(os.environ['TWITTER_KEY'], os.environ['TWITTER_SECRET'])

api = tweepy.API(auth)

stream_listener = StreamListener()

stream = tweepy.Stream(auth=api.auth, listener=stream_listener)

stream.filter(locations=[-180.0,-90.0,180.0,90.0])

This attaches you to the stream. You receive information in the way that the library will call certain functions on your object StreamListener, which you have to

implement yourself.

(16)

www.martinwerner.de

Streaming Details – A StreamListener

class StreamListener(tweepy.StreamListener):

def __init__(self):

self.outfile = open('tweets.json',"a+") tweepy.StreamListener.__init__(self);

def on_status(self, status):

tweet=json.dumps(status._json) print(tweet, file=self.outfile)

def on_error(self, status_code):

... add proper error handling (like throwing an uncaught exception ;-)...

This very simple listener (not production ready!) opens a single file tweets.json for appending data and writes each tweet into this file.

(17)

www.martinwerner.de

Wrapup

 We have now three components

 Secret.env with all the API details

 Main.py implementing the tweepy client and StreamListener class

 Hopefully a tweets.json to work with (downloaded from the API)

(18)

www.martinwerner.de

The real world

 Problem: Inevitable Faults

 Library errors: We can‘t handle (while running)

 API errors: We can‘t handle (while running)

 Host errors: We can partially handle (but do we catch all)

 Network errors: We could handle easily, but why?

 Solution:

 Fail fast: throw exceptions (don‘t catch them) all over the place and restart your script quickly (but make sure, that you keep friendly with repeated fails – otherwise Twitter might ban you)

 Rely on systemd, docker (with restart policy), or your own

„shell“ to restart the script and take appropriate actions and delays (e.g., exponential delay in case of repeated error)

(19)

www.martinwerner.de

Working with JSON

(20)

www.martinwerner.de

JSON

 JSON stands for JavaScript Object Notation

 JSON has become one of the central data representations on the Internet.

 extendible,

 human-readable

 easy to write.

 It can be read by all major programming languages and has been around for a long time in the context of RESTful services.

(21)

www.martinwerner.de

Handling JSON

 The downside of JSON is the complexity of things you can model with it (including tweets).

 In contrast to traditional SQL or XML, tweets don‘t follow a specific schema

 Working with tweets can be done from any good programming language

 Writing programs for simple operations is over-complicated

 JSON can be complicated

(22)

www.martinwerner.de

JQ

Luckily, this problem has converged to a very nice query langauge and a command line tool called JSON Query Processor (JQ)

(23)

www.martinwerner.de

JQ Basics

JQ can be used for querying and pretty-printing JSON collections (that is files containing multiple JSON objects)

The most basic query matches everything and is expressed as“.”

 jq . tweets.json

 cat tweets.json | jq .

(24)

www.martinwerner.de

Where are the colors

(25)

www.martinwerner.de

JQ Expressing values

icaml$ cat sample-tweet.json | jq true true

icaml$ cat sample-tweet.json | jq false false

icaml$ cat sample-tweet.json |jq 1.42 1.42

icaml$ cat sample-tweet.json | jq '"this is a string"'

"this is astring"

(26)

www.martinwerner.de

Objects and Arrays

 Basically, JSON has two higher-order datatypes:

 Objects

 icaml$ cat sample-tweet.json | jq '{"key1":42,"key2":"a string"}‚

{

"key1": 42,

"key2": "a string "

}

 Arrays

 icaml$ cat sample-tweet.json |jq '[1,2,3,4]‚

[ 1, 2, 3, 4 ]

(27)

www.martinwerner.de

Combined

icaml$ cat sample-tweet.json | jq '{"array":[1,2,4,8],"2d array":[[1,2],[3,4]],"nested objects":{"key":"value"}}'

{

"array": [ 1,

2, 4, 8 ],

"2d array": [ [

1, 2 ], [

3,

(28)

www.martinwerner.de

Extracting Fields

 The dot operator (prepending a field name) selects elements from an object

icaml$ cat tweets.json |jq '.id_str'

"1062406263932444672"

"1062405858712272898"

"1036898465270444032"

"1034516701235372032"

"1027811999529529344"

[...]

(29)

www.martinwerner.de

Chaining…

 You can chain this operator. The second in the chain is applied to the result of the first

 .A.B is actually SELECT(B, SELECT(A,…))

icaml$ cat sample-tweet.json |jq .user.entities.url.urls [

{

"url": "https://t.co/74ySSExk6l",

"indices": [ 0,

23 ],

(30)

www.martinwerner.de

Arrays and Brackets

 Bracket expressions are used to access arrays icaml$ echo "[[1,2],[3,4]]" | jq '.[0]'

[ 1, 2 ]

icaml$ echo "[[1,2],[3,4]]" | jq '.[0][1]' 1

icaml$ echo "[[1,2],[3,4]]" | jq '.[1][0]' 2

icaml$ echo "[[1,2],[3,4]]" | jq '.[0][1]' 3

icaml$ echo "[[1,2],[3,4]]" | jq '.[1][1]'

(31)

www.martinwerner.de

Arrays and Brackets

 Unspecific brackets loop over the elements icaml$ echo "[[1,2],[3,4],[5,6]]" | jq '.[][0]'

1 3 5

icaml$

(32)

www.martinwerner.de

Applying what we learnt (and more)

(33)

www.martinwerner.de

Extract Hashtags

 Let us extract hashtags from a tweet object:

 Loop over all hashtags with an unspecific bracket operation:

icaml$ cat sample-tweet.json |jq '.entities.hashtags[].text'

"acm"

"sigspatial"

"gis"

(34)

www.martinwerner.de

Now with multiple tweets

icaml$ cat tweets.json | jq '.entities.hashtags[].text'

"acm"

"sigspatial"

"gis"

"MyData2018"

"SpatialComputing"

"GISChat"

"DataScience"

"tutorial"

"Spark"

"AWS"

"Docker"

"spatial"

"analytics"

Problem: A set of tweets results in a concatenation of the sets of hashtags each tweet contains. This might not be what we wanted.

Solution: Create a sequence of object instead of a sequence of strings!

(35)

www.martinwerner.de

Maintain the structure

icaml$ cat sample-tweet.json | jq '{"id":.id_str, "hashtag": .entities.hashtags[].text}' {

"id": "1062405858712272898",

"hashtag": "acm"

} {

"id": "1062405858712272898",

"hashtag": "sigspatial"

} {

"id": "1062405858712272898",

"hashtag": "gis"

(36)

www.martinwerner.de

Calculating with JQ

 Of course, you can calculate with JQ (as with most query languages)

icaml$ echo "[]" | jq 1+2 3

icaml$ echo "[]" | jq '"hello " + "world!"'

"hello world!"

icaml$ echo "[]" | jq '[1,2]+[3]' [

1, 2, 3 ]

icaml$ echo "[]" | jq '{"key":"value"}+{"key2":"value2"}' {

"key": "value",

"key2": "value2"

}

(37)

www.martinwerner.de

Warning

 But it is not always what you expect:

icaml$ echo "[]" | jq '{"key":"value"}+{"key":"value for duplicate key"}' {

"key": "value for duplicate key"

}

(38)

www.martinwerner.de

Brackets (Rounded ones)

 Sometimes, you need to scope operations into an explicit expression.

This is done using round brackets (as in math) icaml$ echo "[]" | jq '"x"+"y"*2'

"xyy"

icaml$ echo "[]" | jq '("x"+"y")*2'

"xyxy"

(39)

www.martinwerner.de

The , operator

 If you want to run several queries, you can create a sequence of results using the , operator:

icaml$ cat sample-tweet.json |jq '.id_str, .text'

"1062405858712272898"

"The Shortest Paths Dataset used for #acm #sigspatial #gis cup has just been released. https://t.co/pzeEleBfu9 #gis… https://t.co/IF7z1WnUDk"

icaml$

(40)

www.martinwerner.de

Remark:

 Actually, the generation of arrays we have seen [1,2,3]

is a combination of the [] operator creating an array from a set and the , operator creating a sequence, and the values 1,2, and 3.

(41)

www.martinwerner.de

Piping

 Similar to chaining for the . operator, we can pipe expressions

meaning that the result of the left expression is made the input of the right expression.

icaml$ cat sample-tweet.json |jq '.user | .name'

(42)

www.martinwerner.de

JQ Functions

 Finally, JQ provides many functions you will want to have (basically all you can think of and more)

 cat sample-tweet.json |jq '. | keys‚

 echo [1,2,3,4] | jq 'map(.+1)' results in [ 2, 3, 4, 5 ]

 echo '{"key":"value","key2":"value2"}' | jq 'map_values(.+"_")‘

 echo '{"key":"value","key2":"value2"}' | jq 'to_entries‘

See the JQ manual for more functions and their explanations.

(43)

www.martinwerner.de

Extracting Geo-Located Tweets

(44)

www.martinwerner.de

Using JQ, WKT, and CSV

 A tweet is precisely geolocated, when the field location is defined.

 Query:

 First all that have geography

 Then, extract coordinates and, for example, follower_count into an array

 Turn this array into a CSV and write it

cat <file> | jq –r ‘select(.coordinates != null) | [.coordinates.coordinates[0],.coordinates.coordinates [1],.user.followers_count] | @csv‘ > geo-follower.txt

(45)

www.martinwerner.de

QGIS…

Result for 200k tweets:

(46)

www.martinwerner.de

Kapitel 4

Raster-Daten

(47)

www.martinwerner.de

Motivation

 Die Übungsaufgabe, in der eine Menge aus Polygonen A mit Labels versehen werden soll, indem eine andere Menge gelabelter Polygone B mit diesen Polygonen geschnitten wird und als Label des Polygons das Label genommen wird, mit dem die Polygone die meiste Fläche haben, ist in Geometrie sehr aufwändig zu lösen, zum Beispiel

Für jedes Polygon in A

 Finde zu einem Polygon aus A alle scheidenden Polygone aus B

(48)

www.martinwerner.de

A litte inaccuracy saves a lot of calculation…

 Eine einfachere Lösung des Problems wird geliefert, indem man den komplexen Schnitt zweier Polygone durch eine geeignete Menge an Punkt-in-Polygon-Tests ersetzt.

 Dazu benötigen wir eine Menge an Punkten, die Flächeninhalt mit einer gewählten Genauigkeit repräsentieren kann.

 Eine Möglichkeit besteht in einem regulären Gitter: Setze Punkte mit festen Abständen (z.B. je 1m in X und Y-Richtung), sodass jeder Punkt einen Quadratmeter repräsentiert. Dann kann man zuerst dieses Gitter aus Datensatz B labeln (Punkt in Polygon-Test) und

(49)

www.martinwerner.de

Gitter

 Ein n-dimensionales reguläres Gitter ist eine Menge an Punkten,

deren Abstände in jeder Koordinate fest sind. Diese Abstände heißen Gitterabstände.

 Ein georeferenziertes Gitter ist ein Gitter mit Koordinaten in einem Referenzsystem (Koordinatensystem)

(50)

www.martinwerner.de

Gitter Labeln

Werte, die einem regulären Gitter zugeordnet werden, sind in natürlicher Weise als Matrix aufzufassen:

j

i ⋯

⋮ ⋱ ⋮

⋯

(51)

www.martinwerner.de

Gitter Labeln

Werden bei festem Gitter jedem Gitterpunkt mehrere Werte zugeordnet, spricht man von Bändern.

(52)

www.martinwerner.de

Beispiel: Wellenlängen in optischen Bildern

+ +

(53)

www.martinwerner.de

Zur Übungsaufgabe

1. Wähle Gitterabstand.

2. Berechne Gitter / Raster

3. Fülle Raster-Layer mit Klassen aus Datensatz B (z.B. int8) 4. Fülle Raster-Layer mit ID aus Datensatz B (z.B. int64)

5. Reduziere auf ID und zähle

Bem.: Das ist durch ein Ablaufen des Rasters einfach möglich.

(54)

www.martinwerner.de

Raster- und Vektorkarten

Vektorkarten

 Liste von geometrischen Primitiven (Line, …)

+ beliebig genau

+ verlustfrei skalierbar

+ Datengröße abhängig von Komplexität - Komplexe Flächen- und

Kollisionsanfragen - Spatiale Organisation

- Numerische Probleme (Artefakte)

Rasterkarten

• Information pro Flächenelement / Gitterpunkt

+ konstante Genauigkeit

+ Flächen- und Kollisionsanfragen einfach

+ Systeme können für Bildbearbeitung optimiert sein (MMX, SSE, CUDA etc.)

~ Datengröße abhängig von Fläche - keine verlustfreie Skalierbarkeit - weniger Artefakte (Anti-Aliasing)

- sehr viele Auflösungen für „schöne“ Anzeige benötigt

(55)

www.martinwerner.de

Struktur einer Vektorkarte

(56)

www.martinwerner.de

Struktur einer Rasterkarte

(57)

www.martinwerner.de

Geometrie in Rasterkarte (horizontal, vertikal)

(58)

www.martinwerner.de

Geometrie in Rasterkarte (Diagonal)

(59)

www.martinwerner.de

Schneiden sich diese zwei Linien?

(60)

www.martinwerner.de

Pixel werden

entsprechend ihres

Flächenanteils gefüllt oder gefärbt.

Für Farben einfach (weil vom menschlichen Auge korrekt wahrgenommen).

Für z.B. Raster mit

diskreten Werten (unsere Klassen von vorhin) ist das mitunter etwas schwierig Lösung: Sub-Pixeling / Anti-Aliasing

(61)

www.martinwerner.de

Raster-Darstellung in der echten Welt

gdalinfo germany-cloudfree-simple.tiff Driver: GTiff/GeoTIFF

Files: germany-cloudfree-simple.tiff Size is 67235, 94080

Coordinate System is:

LOCAL_CS["WGS 84 / Pseudo-Mercator", GEOGCS["WGS 84",

DATUM["unknown",

SPHEROID["unretrievable - using WGS84",6378137,298.257223563]], PRIMEM["Greenwich",0],

Angaben zu - Format

- Anzahl Pixel

- Koordinatensystem (WKT)

(62)

www.martinwerner.de

Raster-Darstellung in der echten Welt

Origin = (641594.910993173718452,7370990.488857678137720) Pixel Size = (15.545613327413848,-15.545613327413848)

Metadata:

AREA_OR_POINT=Area Image Structure Metadata:

INTERLEAVE=PIXEL Corner Coordinates:

Upper Left ( 641594.911, 7370990.489) Lower Left ( 641594.911, 5908459.187) Upper Right ( 1686804.223, 7370990.489) Lower Right ( 1686804.223, 5908459.187) Center ( 1164199.567, 6639724.838)

Band 1 Block=67235x1 Type=Byte, ColorInterp=Red Band 2 Block=67235x1 Type=Byte, ColorInterp=Green

- Ursprung in Geokoordinaten - Größe der Pixel - Fläche oder Punkt - Strukturinformation - Geokoordinaten

(berechnet) - Bänder

(63)

www.martinwerner.de

Bemerkungen

 Rastergeometrie im Detail:

Ein GeoTIFF-Raster besteht aus einer rechteckigen Matrix mit Werten pro Band. Eine Geotransformation besagt, wie sich mit einem Schritt nach rechts (bzw. nach unten) in dieser Matrix die Weltkoordinaten ändern. Diese Veränderung ist konstant.

Elt. Bedeutung

GT[0] X-Koordinate in CRS oben links (für Pixel 0,0)

GT[1] X-Offset in X-Richtung für jeden Pixel-Schritt nach Rechts

(64)

www.martinwerner.de

Berechne Weltkoordinaten für beliebigen Pixel

 Mit der Geotransformation bestimmen die folgenden Gleichungen die Weltkoordinaten eines Pixels mit (ganzen) Koordinaten i,j:

= + ⋅ + ⋅

(65)

www.martinwerner.de

Einfache Rasteroperationen

gdal_info:

 Zeige Raster-Informationen an gdal_warp:

 Zusammensetzen von Teilrastern

 Ändern des Gitters / der Projektion (Warping) gdal_translate

 Ändere Format

 Subsetting / Resampling / Interpolation gdal_rasterize

 Übersetze (OGR-)Geometrie in Raster

(66)

www.martinwerner.de

Rasteralgorithmen

(67)

www.martinwerner.de

Rasterisierung

 Übersetzen von

Geometrieobjekten in Raster

 Beispiel: Punkte (mit Radius)

 Berechne Pixelkoordinaten in Fließkomma

 Berechne anteilige Fläche pro Pixel

 Modifiziere Pixel (Blending

(68)

www.martinwerner.de

Rasterisierung: Linien

 Einfachste Möglichkeit: Interpoliere und Raster Punkte

= + ⋅

 Problem: Welche Schrittweite? Tradeoff zwischen Aufwand und Wahrscheinlichkeit, alle Pixel auch zu treffen.

(69)

www.martinwerner.de

Bresenham-Algorithmus

 Beobachtung: Nach Setzen einen Pixels kann der nächste Pixel nur ein Nachbarpixel sein

 Einschränkung auf Oktant: Linien, die nach unten-Rechts gehen

Nur Linien in

(70)

www.martinwerner.de

Welcher Y-Schritt

 Allgemeine Liniengleichung

−

− = −

−

 Pro ganzzahligem Schritt in X-Richtung berechnet man dann

= −

− ⋅ − +

/

/ /

(71)

www.martinwerner.de

Optimierung

Der Bresenham-Algorithmus berechnet jetzt nicht die echte Y-

Koordinate, sondern verfolgt die ganzzahlige Y-Koordinate und den Fehlerterm (also die Differenz des echten Y-Wertes und des aktuellen ganzzahligen Y-Wertes. Sobald dieser Fehler größer wird als 1, wird er um 1 verringert und die Y-Koordinate angepasst.

Explizit also Pseudo-Code also:

error=0, int y = y0, float deltaerr = abs( (y1-y0)/(x1-x0)) for x from x0 to x1

putpixel(x,y)

(72)

www.martinwerner.de

Beispiel: Bresenham-Algorithmus für Linien (2)

 Eine Fallunterscheidung unter Ausnutzung der Symmetrie wird jetzt verwendet, um den Basisalgorithmus auf alle 8 Oktanten abzubilden.

Merke: Der Bresenham-Algorithmus hat immer eine schnelle Richtung und eine langsame Richtung. In die schnelle Richtung (X oder Y)

macht man immer einen Schritt, in die langsame Richtung nur ab und zu.

Übungsaufgabe: Implementieren Sie Bresenham‘s Algorithmus für eine Linie in Python. Verwenden Sie zur Eingabe Fließkommazahlen im

Bereich zwischen 0 und 100 und als Raster eine Numpy-Matrix der