Markus Krötzsch: Wikidata Toolkit Kickoff
Getting the Most Out of Wikidata
Semantic Technology Usage in Wikipedia’s Knowledge Graph
Stas Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt
Wikimedia Foundation TU Dresden
Presentation of paper published at the
International Semantic Web Conference 2018 Download:
https://iccl.inf.tu-dresden.de/web/Inproceedings3044/en
All slides CC-BY 3.0
Timeline of
tropical cyclones
Number of tropical cyclones per year
The Wikidata Query Service
www.wikidata.org
Relational Database (MySQL)
Wiki Website
The Wikidata Query Service
www.wikidata.org
Relational Database
(MySQL) Graph
Database (BlazeGraph) query.wikidata.org
Load balancing/caching
Wiki Website Query Service
The Wikidata Query Service
www.wikidata.org
Relational Database
(MySQL) Graph
Database (BlazeGraph) query.wikidata.org
Linked data export
Load balancing/caching
Change monitoring
Wiki Website Life Synch. Query Service
“Where are people born who travel to space?”
(Colour-coded by gender)
“Which 19
thcentury paintings show the moon?”
“Which days of the week do disasters occur on?”
“The free knowledge base that anyone can edit”
Entities: 50M
Statements: 570M Labels: 260M
Descriptions: 1.5B
Links to Wikis: 65M
“The free knowledge base that anyone can edit”
Editors: >230K
From Wikidata (rich graphs) to RDF (plain graphs)
Q80 Q4273323
[Erxleben et al., ISWC 2014]
“ يل زرنريب ميت”@ar
“ 提姆 ·柏納 -李” @zh
“Тим Бернерс-Ли”@ru
“ -יל סרנרב םיט”@he
“Tim Berners-Lee”@en
label
. . .
label
“Queen Elizabeth Prize for Engineering”@en
From Wikidata (rich graphs) to RDF (plain graphs)
Q80 Q4273323
wds:Q80-...
p:P166 ps:P166
[Erxleben et al., ISWC 2014]
“ يل زرنريب ميت”@ar
“ 提姆 ·柏納 -李” @zh
“Тим Бернерс-Ли”@ru
“ -יל סרנרב םיט”@he
“Tim Berners-Lee”@en
label
. . .
label
“Queen Elizabeth Prize for Engineering”@en
From Wikidata (rich graphs) to RDF (plain graphs)
Q80 Q4273323
wds:Q80-...
p:P166 ps:P166
“2013”^^xsd:gYear
pq:P585
Q214129
pq:P1706
[Erxleben et al., ISWC 2014]
Q92743 Q3083180
pq:P 1706
pq:P1706 Q62882
pq:P1706
“ يل زرنريب ميت”@ar
“ 提姆 ·柏納 -李” @zh
“Тим Бернерс-Ли”@ru
“ -יל סרנרב םיט”@he
“Tim Berners-Lee”@en
label
. . .
label
“Queen Elizabeth Prize for Engineering”@en
From Wikidata (rich graphs) to RDF (plain graphs)
wdt:P166
Q80 Q4273323
wds:Q80-...
p:P166 ps:P166
“2013”^^xsd:gYear
pq:P585
Q214129
pq:P1706
[Erxleben et al., ISWC 2014]
Q92743 Q3083180
pq:P 1706
pq:P1706 Q62882
pq:P1706
“ يل زرنريب ميت”@ar
“ 提姆 ·柏納 -李” @zh
“Тим Бернерс-Ли”@ru
“ -יל סרנרב םיט”@he
“Tim Berners-Lee”@en
label
. . .
label
“Queen Elizabeth Prize for Engineering”@en
Wikidata RDF Exports
Weekly full dumps
Currently 6.2 billion triples (42 GB Turtle gzip compressed)
At https://dumps.wikimedia.org/wikidatawiki/entities/
Linked Data Exports
Live data in many formats
E.g., http://www.wikidata.org/wiki/Special:EntityData/Q42.nt
22
Wikidata SPARQL Query Service
Official query service since mid 2015
User interface at https://query.wikidata.org/
All the data (6.2B triples), live (latency<60s)
No limits (well, almost):
60sec timeout
No limit on result size (!)
No limit on parallel queries, but CPU-time budget per client
Extra SERVICEs in SPARQL (geo, Wikipedia API, labels, …)
A simple SPARQL query
A simple SPARQL query
An advanced SPARQL query
“It’s too complicated!”
“It’s not too complicated!”
SPARQL is widely used
>100M requests per month (3.8M per day) in 2018
It’s an API – most users are not in direct contact
The community offers tutorials, workshops and support services
“It does not scale!”
“It does not scale!”
Excellent availability and performance
50% of queries answered in <40ms (95% in <440ms; 99% in <40s)
Less than 0.05% of queries time out
Service has never been down so far
Affordable system setup:
Three commodity servers (+three for geo-redundancy)
Standard Linux load balancing + standard HTTP cache
All software/customisations free & open source
– See https://github.com/wikimedia/wikidata-query-rdf
So what are those 100Ms of queries?
We looked at 481,716,280 queries logged during 24 weeks
Analysing SPARQL query activity is hard
Extreme influence of scripts and bots
Does not average out over time, each month looks rather different!
→ Classify major sources (bots) and isolate “organic”
part of the traffic
Robotic and organic traffic
Robotic traffic
Dominates (60% of queries by top-3 bots)
Mostly data integration and data download
More uniform, shorter
Organic traffic
Much smaller volume (0.6% of all queries)
Browsers, mobile apps, miscellaneous
More diverse, longer Path queries are very important
• Reified statements in 4%–10% of queries
See for yourself!
We have released complete, timestamped query logs
Anonymised to avoid user identification
With limited user agent information
Full dataset, no sample!
Currently 12 weeks in 2017 – more to come soon https://kbs.inf.tu-dresden.de/WikidataSPARQL
Conclusions
Semantic web technology – it works!
Interactive analytics and query is affordable for dynamic knowledge graphs with >10^9 edges
Usable for large, open communities without prior RDF/SPARQL experience
We want more applications & more research!
Thanks
Denny Vrandecic, Lydia Pintscher, and the whole Wikimedia Deutschland e.V. team in Berlin who made Wikidata possible
Brad Bebee, Bryan Thompson, and all of the BlazeGraph team
Anyone contributing to RDF and SPARQL libraries
All who contributed to W3C standards used here, esp. SPARQL
The Wikidata community
TimBL ;-)
Films with future heads of government
Literature
Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: “Getting the Most out of Wikidata: Semantic Technology Usage in
Wikipedia’s Knowledge Graph” In Denny Vrande i , et al., eds., Proceedings of the č ć 17th International Semantic Web Conference (ISWC'18)
Adrian Bielefeldt, Julius Gonsior, Markus Krötzsch: “Practical Linked Data Access via SPARQL: The Case of Wikidata” Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18), CEUR Workshop
Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny
Vrande i : “Introducing Wikidata to the Linked Data Web” In Proceedings of the č ć 13th International Semantic Web Conference (ISWC 2014)
SPARQL Feature Distribution (2017/2018)
38
Triples per query: organic
(blue)/robotic
(yellow)39
Languages of labels in organic queries
40