• Keine Ergebnisse gefunden

Getting the Most Out of Wikidata

N/A
N/A
Protected

Academic year: 2022

Aktie "Getting the Most Out of Wikidata"

Copied!
40
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Markus Krötzsch: Wikidata Toolkit Kickoff

Getting the Most Out of Wikidata

Semantic Technology Usage in Wikipedia’s Knowledge Graph

Stas Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt

Wikimedia Foundation TU Dresden

Presentation of paper published at the

International Semantic Web Conference 2018 Download:

https://iccl.inf.tu-dresden.de/web/Inproceedings3044/en

All slides CC-BY 3.0

(2)
(3)

Timeline of

tropical cyclones

(4)

Number of tropical cyclones per year

(5)
(6)

The Wikidata Query Service

www.wikidata.org

Relational Database (MySQL)

Wiki Website

(7)

The Wikidata Query Service

www.wikidata.org

Relational Database

(MySQL) Graph

Database (BlazeGraph) query.wikidata.org

Load balancing/caching

Wiki Website Query Service

(8)

The Wikidata Query Service

www.wikidata.org

Relational Database

(MySQL) Graph

Database (BlazeGraph) query.wikidata.org

Linked data export

Load balancing/caching

Change monitoring

Wiki Website Life Synch. Query Service

(9)

“Where are people born who travel to space?”

(Colour-coded by gender)

(10)

“Which 19

th

century paintings show the moon?”

(11)

“Which days of the week do disasters occur on?”

(12)

“The free knowledge base that anyone can edit”

Entities: 50M

Statements: 570M Labels: 260M

Descriptions: 1.5B

Links to Wikis: 65M

(13)

“The free knowledge base that anyone can edit”

Editors: >230K

(14)
(15)
(16)
(17)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

[Erxleben et al., ISWC 2014]

يل زرنريب ميت”@ar

“ 提姆 ·柏納 -李” @zh

“Тим Бернерс-Ли”@ru

“ -יל סרנרב םיט”@he

“Tim Berners-Lee”@en

label

. . .

label

“Queen Elizabeth Prize for Engineering”@en

(18)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

[Erxleben et al., ISWC 2014]

يل زرنريب ميت”@ar

“ 提姆 ·柏納 -李” @zh

“Тим Бернерс-Ли”@ru

“ -יל סרנרב םיט”@he

“Tim Berners-Lee”@en

label

. . .

label

“Queen Elizabeth Prize for Engineering”@en

(19)

From Wikidata (rich graphs) to RDF (plain graphs)

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

“2013”^^xsd:gYear

pq:P585

Q214129

pq:P1706

[Erxleben et al., ISWC 2014]

Q92743 Q3083180

pq:P 1706

pq:P1706 Q62882

pq:P1706

يل زرنريب ميت”@ar

“ 提姆 ·柏納 -李” @zh

“Тим Бернерс-Ли”@ru

“ -יל סרנרב םיט”@he

“Tim Berners-Lee”@en

label

. . .

label

“Queen Elizabeth Prize for Engineering”@en

(20)

From Wikidata (rich graphs) to RDF (plain graphs)

wdt:P166

Q80 Q4273323

wds:Q80-...

p:P166 ps:P166

“2013”^^xsd:gYear

pq:P585

Q214129

pq:P1706

[Erxleben et al., ISWC 2014]

Q92743 Q3083180

pq:P 1706

pq:P1706 Q62882

pq:P1706

يل زرنريب ميت”@ar

“ 提姆 ·柏納 -李” @zh

“Тим Бернерс-Ли”@ru

“ -יל סרנרב םיט”@he

“Tim Berners-Lee”@en

label

. . .

label

“Queen Elizabeth Prize for Engineering”@en

(21)

Wikidata RDF Exports

Weekly full dumps

Currently 6.2 billion triples (42 GB Turtle gzip compressed)

At https://dumps.wikimedia.org/wikidatawiki/entities/

Linked Data Exports

Live data in many formats

E.g., http://www.wikidata.org/wiki/Special:EntityData/Q42.nt

(22)

22

Wikidata SPARQL Query Service

Official query service since mid 2015

User interface at https://query.wikidata.org/

All the data (6.2B triples), live (latency<60s)

No limits (well, almost):

60sec timeout

No limit on result size (!)

No limit on parallel queries, but CPU-time budget per client

Extra SERVICEs in SPARQL (geo, Wikipedia API, labels, …)

(23)

A simple SPARQL query

(24)

A simple SPARQL query

(25)

An advanced SPARQL query

(26)

“It’s too complicated!”

(27)

“It’s not too complicated!”

SPARQL is widely used

>100M requests per month (3.8M per day) in 2018

It’s an API – most users are not in direct contact

The community offers tutorials, workshops and support services

(28)

“It does not scale!”

(29)

“It does not scale!”

Excellent availability and performance

50% of queries answered in <40ms (95% in <440ms; 99% in <40s)

Less than 0.05% of queries time out

Service has never been down so far

Affordable system setup:

Three commodity servers (+three for geo-redundancy)

Standard Linux load balancing + standard HTTP cache

All software/customisations free & open source

See https://github.com/wikimedia/wikidata-query-rdf

(30)

So what are those 100Ms of queries?

We looked at 481,716,280 queries logged during 24 weeks

Analysing SPARQL query activity is hard

Extreme influence of scripts and bots

Does not average out over time, each month looks rather different!

Classify major sources (bots) and isolate “organic”

part of the traffic

(31)

Robotic and organic traffic

Robotic traffic

Dominates (60% of queries by top-3 bots)

Mostly data integration and data download

More uniform, shorter

Organic traffic

Much smaller volume (0.6% of all queries)

Browsers, mobile apps, miscellaneous

More diverse, longer Path queries are very important

Reified statements in 4%–10% of queries

(32)

See for yourself!

We have released complete, timestamped query logs

Anonymised to avoid user identification

With limited user agent information

Full dataset, no sample!

Currently 12 weeks in 2017 – more to come soon https://kbs.inf.tu-dresden.de/WikidataSPARQL

(33)

Conclusions

Semantic web technology – it works!

Interactive analytics and query is affordable for dynamic knowledge graphs with >10^9 edges

Usable for large, open communities without prior RDF/SPARQL experience

We want more applications & more research!

(34)

Thanks

Denny Vrandecic, Lydia Pintscher, and the whole Wikimedia Deutschland e.V. team in Berlin who made Wikidata possible

Brad Bebee, Bryan Thompson, and all of the BlazeGraph team

Anyone contributing to RDF and SPARQL libraries

All who contributed to W3C standards used here, esp. SPARQL

The Wikidata community

TimBL ;-)

(35)

Films with future heads of government

(36)

Literature

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: “Getting the Most out of Wikidata: Semantic Technology Usage in

Wikipedia’s Knowledge Graph” In Denny Vrande i , et al., eds., Proceedings of the č ć 17th International Semantic Web Conference (ISWC'18)

Adrian Bielefeldt, Julius Gonsior, Markus Krötzsch: “Practical Linked Data Access via SPARQL: The Case of Wikidata” Proceedings of the WWW2018 Workshop on Linked Data on the Web (LDOW-18), CEUR Workshop

Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny

Vrande i : “Introducing Wikidata to the Linked Data Web” In Proceedings of the č ć 13th International Semantic Web Conference (ISWC 2014)

(37)

SPARQL Feature Distribution (2017/2018)

(38)

38

Triples per query: organic

(blue)

/robotic

(yellow)

(39)

39

Languages of labels in organic queries

(40)

40

SPARQL feature co-occurrence

Referenzen

ÄHNLICHE DOKUMENTE

Our approach is feasible for subsets of Wikidata that capture interesting domains of knowledge. For larger subsets, we propose to compute Luxenburger bases of association rules [13,

Maximilian Marx (TU Dresden) Discovering Implicational Knowledge in Wikidata ICFCA 2019 2 / 8 Image sources: Public Domain; Bengt Nyman, CC BY 2.0;.. D-Kuru, CC BY-SA 3.0; 彭嘉傑 ,

After observing the superior expressive power of the standard and Datalog-first chase on polynomial time problems, we turn to the question of whether one can also express queries

i) they can further the understanding of knowledge implicit in Wikidata, ii) they can make explicit how editing a statement interacts with the implicational theory on properties,

User:Akorenchkin / @korenchkin (TU Dresden) Discovering Implicational Knowledge in Wikidata Wikimania 2019 3 / 8 Image sources: Bengt Nyman, CC BY 2.0; Public Domain;..

Interactive extraction of knowledge from domain experts (with minimal # of questions). Expert

organic queries fetch data to satisfy an immediate information need of a human user, while robotic queries fetch data in an unsupervised fashion for further automated processing..

The heart of the new functionality is a live SPARQL endpoint, which answers over a hundred million queries each month, and which is also used as a back-end for several