Getting the Most Out of Wikidata

(1)

Markus Krötzsch: Wikidata Toolkit Kickoff

Getting the Most Out of Wikidata

Semantic Technology Usage in Wikipedia’s Knowledge Graph

Stas Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt

Wikimedia Foundation TU Dresden

Presentation of paper published at the

International Semantic Web Conference 2018 Download:

https://iccl.inf.tu-dresden.de/web/Inproceedings3044/en

All slides CC-BY 3.0

(2)

(3)

Timeline of

tropical cyclones

(4)

Number of tropical cyclones per year

(5)

(6)

The Wikidata Query Service

www.wikidata.org

Relational Database (MySQL)

Wiki Website

(7)

The Wikidata Query Service

www.wikidata.org

Relational Database

(MySQL) Graph

Database (BlazeGraph) query.wikidata.org

Load balancing/caching

Wiki Website Query Service

(8)

The Wikidata Query Service

www.wikidata.org

Relational Database

(MySQL) Graph

Database (BlazeGraph) query.wikidata.org

Linked data export

Load balancing/caching

Change monitoring

Wiki Website Life Synch. Query Service

(9)

“Where are people born who travel to space?”

(Colour-coded by gender)

(10)

“Which 19

^th

century paintings show the moon?”

(11)

“Which days of the week do disasters occur on?”

(12)

“The free knowledge base that anyone can edit”

Entities: 50M

Statements: 570M Labels: 260M

Descriptions: 1.5B

Links to Wikis: 65M

(13)

“The free knowledge base that anyone can edit”

Editors: >230K

(14)

(15)

(16)

(17)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

[Erxleben et al., ISWC 2014]

“ يل زرنريب ميت”@ar

“ 提姆 ·柏納 -李” @zh

“Тим Бернерс-Ли”@ru

“ -יל סרנרב םיט”@he

“Tim Berners-Lee”@en

label

. . .

label

“Queen Elizabeth Prize for Engineering”@en

(18)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

label

. . .

label

(19)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

“2013”^^xsd:gYear

pq:P585

Q214129

pq:P1706

Q92743 Q3083180

pq:P 1706

pq:P1706 Q62882

pq:P1706

label

. . .

label

(20)

From Wikidata (rich graphs) to RDF (plain graphs)

wdt:P166

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

“2013”^^xsd:gYear

pq:P585

Q214129

pq:P1706

Q92743 Q3083180

pq:P 1706

pq:P1706 Q62882

pq:P1706

label

. . .

label

(21)

Wikidata RDF Exports



Weekly full dumps

 Currently 6.2 billion triples (42 GB Turtle gzip compressed)

 At https://dumps.wikimedia.org/wikidatawiki/entities/



Linked Data Exports

 Live data in many formats

 E.g., http://www.wikidata.org/wiki/Special:EntityData/Q42.nt

(22)

22

Wikidata SPARQL Query Service

 Official query service since mid 2015

 User interface at https://query.wikidata.org/

 All the data (6.2B triples), live (latency<60s)

 No limits (well, almost):

 60sec timeout

 No limit on result size (!)

 No limit on parallel queries, but CPU-time budget per client

 Extra SERVICEs in SPARQL (geo, Wikipedia API, labels, …)

(23)

A simple SPARQL query

(24)

A simple SPARQL query

(25)

An advanced SPARQL query

(26)

“It’s too complicated!”

(27)

“It’s not too complicated!”

 SPARQL is widely used

 >100M requests per month (3.8M per day) in 2018

 It’s an API – most users are not in direct contact

 The community offers tutorials, workshops and support services

(28)

“It does not scale!”

(29)

“It does not scale!”

 Excellent availability and performance

 50% of queries answered in <40ms (95% in <440ms; 99% in <40s)

 Less than 0.05% of queries time out

 Service has never been down so far

 Affordable system setup:

 Three commodity servers (+three for geo-redundancy)

 Standard Linux load balancing + standard HTTP cache

 All software/customisations free & open source

– See https://github.com/wikimedia/wikidata-query-rdf

(30)

So what are those 100Ms of queries?

 We looked at 481,716,280 queries logged during 24 weeks

 Analysing SPARQL query activity is hard

 Extreme influence of scripts and bots

 Does not average out over time, each month looks rather different!

→ Classify major sources (bots) and isolate “organic”

part of the traffic

(31)

Robotic and organic traffic

Robotic traffic

 Dominates (60% of queries by top-3 bots)

 Mostly data integration and data download

 More uniform, shorter

Organic traffic

 Much smaller volume (0.6% of all queries)

 Browsers, mobile apps, miscellaneous

 More diverse, longer Path queries are very important

• Reified statements in 4%–10% of queries

(32)

See for yourself!

 We have released complete, timestamped query logs

 Anonymised to avoid user identification

 With limited user agent information

 Full dataset, no sample!

 Currently 12 weeks in 2017 – more to come soon https://kbs.inf.tu-dresden.de/WikidataSPARQL

(33)

Conclusions



Semantic web technology – it works!

 Interactive analytics and query is affordable for dynamic knowledge graphs with >10^9 edges

 Usable for large, open communities without prior RDF/SPARQL experience

 We want more applications & more research!

(34)

Thanks

 Denny Vrandecic, Lydia Pintscher, and the whole Wikimedia Deutschland e.V. team in Berlin who made Wikidata possible

 Brad Bebee, Bryan Thompson, and all of the BlazeGraph team

 Anyone contributing to RDF and SPARQL libraries

 All who contributed to W3C standards used here, esp. SPARQL

 The Wikidata community

 TimBL ;-)

(35)

Films with future heads of government

(36)

Literature

 Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: “Getting the Most out of Wikidata: Semantic Technology Usage in

Wikipedia’s Knowledge Graph” In Denny Vrande i , et al., eds., Proceedings of the č ć 17th International Semantic Web Conference (ISWC'18)

 Adrian Bielefeldt, Julius Gonsior, Markus Krötzsch: “Practical Linked Data Access via SPARQL: The Case of Wikidata” Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18), CEUR Workshop

 Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny

Vrande i : “Introducing Wikidata to the Linked Data Web” In Proceedings of the č ć 13th International Semantic Web Conference (ISWC 2014)

(37)

SPARQL Feature Distribution (2017/2018)

(38)

38

Triples per query: organic

^(blue)

/robotic

^(yellow)

(39)

39

Languages of labels in organic queries

(40)

40

Getting the Most Out of Wikidata