Approaches for enriching and improving textual knowledge bases

(1)

APPROACHES FOR ENRICHING AND IMPROVING TEXTUAL KNOWLEDGE BASES

Von der Fakultät für Elektrotechnik und Informatik der Gottfried Wilhelm Leibniz Universität Hannover

zur Erlangung des akademischen Grades

DOKTOR DER NATURWISSENSCHAFTEN Dr. rer. nat.

genehmigte Dissertation von

M. Sc. Besnik Fetahu

geboren am 21. September 1986, in Prishtina, Kosovo

Hannover, Deutschland, 2017

(2)

Korreferent: Prof. Dr. Maarten de Rijke Korreferent: Prof. Dr.-Ing. Bodo Rosenhahn Tag der Promotion: 08.11.2017

(3)

ABSTRACT

Verifiability is one of the core editing principles in Wikipedia, where editors are encour- aged to providecitations for the added statements. Statements can be any arbitrary piece of text, ranging from a sentence up to a paragraph. However, in many cases, citations are either outdated, missing, or link to non-existing references (e.g. dead URL, moved content etc.). In total, 20% of the cases such citations refer to news articles and represent the second most cited source. Even in cases where citations are provided, there are no explicit indicators for the span of a citation for a given piece of text. In addition to issues related with the verifiability principle, many Wikipedia entity pages are incomplete, with relevant information that is already available in online news sources missing. Even for the already existing citations, there is often a delay between the news publication time and the reference time.

In this thesis, we address the aforementioned issues and propose automated approaches that enforce theverifiability principle in Wikipedia, and suggest relevant and missing news references for further enriching Wikipedia entity pages. To this end we make the following contributions as part of this thesis:

• Citation recommendation – we address the problem of finding and updating news citations for statements in Wikipedia entity pages. We propose a two-stage approach for this problem. First, we classify each statement whether it requires a news citation or citations from other categories (e.g. web, book, journal, etc.). Second, for statements that require a news citation, we formalize three properties of what makes a good citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We combine standard information retrieval techniques, where we use the statement to query a news collection, and build classification models based on the three properties to determine the most appropriate citation.

• Citation span – from the already existing citations in Wikipedia entity pages and the ones we recommend in our first problem, we propose an automated approach which determines the span of such citations. We approach this problem by classifying which textual fragments in a paragraph are covered or hold true given a citation. We propose a sequence classification approach where for a paragraph and a citation, we determine the citation span at a fine-grained level.

• News suggestion – to account for the ever evolving nature of Wikipedia entities, with relevant information published on a daily basis in news articles, we propose a two-stage supervised approach for this problem. First, we suggest news articles to Wikipedia entities (article-entity placement) relying on a rich set of features which take into account the salience and relative authority of entities, and thenovelty of news articles to entity pages. Second, we determine the exact section in the entity page for the input article (article-section placement) guided by class-based section templates.

We perform extensive evaluation with real-world datasets, on news collections with more than 20 million news articles, and on the entire set of english Wikipedia entity pages. Our approaches perform with high accuracy on the three problems we address and show superior performance when compared to existing baselines and state of the art approaches.

Keywords:citation recommendation, citation span, news suggestion, wikipedia enrichment

(4)

Nachweisbarkeit ist eines der zentralen Editierungs-Prinzipien in Wikipedia. Editoren werden dazu angehalten, ihre hinzugefügten Aussagen mittelsZitierungenzu belegen. Aus- sagen können dabei beliebige Textstücke sein, von Sätzen bis hin zu einem Absatz. In vielen Fällen sind Zitierungen jedoch veraltet, fehlen oder verweisen auf nicht existierende Ref- erenzen (z.B. tote URLs oder verschobene Inhalte). In 20% der Fälle verweisen Zitierungen aufNews-Artikel, die zweithäufigste Art der Zitierung in Wikipedia. Selbst in den Fällen, in denen Zitierungen vorhanden sind, fehlen Angaben über die Spanne des Textes, die durch die Zitierung abgedeckt wird. Unabhängig von Problemen in Zusammenhang mit dem Nachweisbarkeits-Prinzip sind viele Wikipedia-Artikel unvollständig, wobei oft relevante In- formation fehlt, die bereits in Online News-Quellen verfügbar ist. Auch für die bereits hinzugefügten Zitierungen existiert oft eine Verzögerung zwischen der Veröffentlichungszeit des News-Artikels und der Zeit, zu der der Artikel in Wikipedia referenziert wurde.

In dieser Arbeit beschäftigen wir uns mit den aufgezeigten Problemen und stellen automatisierte Ansätze vor, die das Nachweisbarkeits-Prinzip in Wikipedia durchsetzen und relevante und fehlende News-Referenzen vorschlagen mit dem Ziel, die Qualität von Wikipedia-Artikeln zu erhöhen. Diese Arbeit enthält die folgenden Beiträge:

• Zitierungsempfehlungen – Wir beschäftigen uns mit dem Problem des Findens und Erneuerns von News-Zitierungen für Aussagen in Wikipedia-Artikeln. Für dieses Problem stellen wir einen zweiteiligen Ansatz vor. Zunächst bestimmen wir mittels Klassifizierung, ob eine Aussage eine News-Zitierung oder eine Zitierung aus einer anderen Kategorie (z.B. Web, Bücher, Zeitschriften) benötigt. Im zweiten Schritt bestimmen wir drei Eigenschaften, die eine gute Zitierung ausmachen: (i) die Zi- tierung sollte die Wikipedia-Aussage enthalten, (ii) die Aussage sollte eine zentrale Rolle im zitierten Artikel einnehmen und (iii) der zitierte Artikel sollte aus einer verlässlichen Quelle stammen. Wir kombinieren Standardtechniken des Information Retrieval und verwenden die gegebene Aussage, um eine passende News-Sammlung zusammenzustellen. Weiterhin entwickeln wir Klassifizierungsmodelle basierend auf den drei genannten Eigenschaften, um die passendste Zitierung zu ermitteln.

• Zitierungsspanne – Aus den bereits existierenden Zitierungen in Wikipedia Artikeln und den Zitierungen, die wir in unserem ersten Problem vorschlagen, entwickeln wir einen automatisierten Ansatz zur Bestimmung der Spanne einer Zitierung. Dazu klassifizieren wir, welche Textbausteine eines Absatzes durch eine Zitierung abgedeckt werden, bzw. als wahr angesehen werden können. Wir stellen einen Sequenzklassi- fizierungsansatz vor, der die Spanne einer Zitierung bezogen auf einen Absatz im Detail bestimmen kann.

• News-Vorschläge – Unter Berücksichtigung der ständigen Veränderung von Wikipedia-Artikeln sowie der Tatsache, dass täglich neue relevante Informatio- nen in Online News-Artikeln veröffentlicht werden, stellen wir einen zweiteiligen überwachten Ansatz vor. Zunächst schlagen wir News-Artikel für Wikipedia-Artikel vor (article-entity placement), basierend auf einer ergiebigen Menge an Features, die dierelative Bedeutung eines Wikipedia-Artikels und dieNeuheit von News-Artikeln berücksichtigen. Im zweiten Schritt bestimmten wir die Sektion des Wikipedia- Artikels, für die der News-Artikel vorgeschlagen werden soll (article-section placement).

(5)

Wir führen umfassende Evaluationen mit real-world Datensätzen von News- Sammlungen mit mehr als 20 Millionen News-Artikeln und der gesammten Menge englis- cher Wikipedia-Artikel durch. Unsere Ansätze erzielen hohe Genauigkeit und übertreffen die Leistung von existierenden Baselines und State-of-the-Art Ansätzen.

Schlagwörter: Zitierungsempfehlungen, Zitierungsspanne, News-Vorschläge, Wikipedia Anreicherung

(6)

I would like to acknowledge and thank everyone who have supported and helped me shape as a researcher and person throughout the course of my doctoral studies. First and foremost, I would like to thank my advisor Prof.

Dr. techn. Wolfgang Nejdl. He provided the perfect environment and guidance throughout these years, helping me shape as a researcher and successfully conduct the work published in this thesis. He was and still remains a role model and inspiration for my future career and path.

Special thanks to Prof. Avishek Anand and Prof. Katja Markert, for their close collaboration and the countless discussions, which helped me learn and develop as a researcher. I am also very grateful to Dr. Stefan Dietze for his guidance and introducing me to many exciting topics, projects, and providing helpful feedback and discussions for the many research ideas we developed together.

I thank all the colleagues at L3S, with many of whom we became thick friends through the course of our PhD studies. With Ujwal, Ricardo, Kaweh, Patrick, Bernardo, and many others, we shared lots of moments, countless discussions, sleepless night catching the deadlines, and learned a lot from each other. Thank you.

I would like to thank a good old friend, Flakron, who was always there to celebrate and discuss all the successes and setbacks through this PhD, and being a great help with his suggestions about all the new Web technologies.

I thank Mary, for her unconditional support and being there for me during the most important part of my PhD. She was the safe haven and the escape from the hectic of countless experiments, late working hours that came along with the PhD. Thank you, I cherish your caring, help and understanding through this phase of my life.

I would like to thank my family for being there for me, for their support, love, caring and understanding. Through the course of my PhD, especially my mother, they came to learn the different paper submission time-zones, and conference acronyms. This was all possible because of you, and I dedicate this to you all.

(7)

FOREWORD

The work in this thesis and generally throughout the entire course of the Ph.D studies I have published and co-authored several papers and journals in the fields of Text Mining, Natural Language Process, Semantic Web, Informa- tion Retrieval and Human-Computer Interaction.

The core contributions of this thesis in the individual chapters are published in the following venues:

1. The contributions in Chapter 5, which deal with the problem of news suggestion for Wikipedia articles are published in:

• [FMNA16] Besnik Fetahu, Katja Markert, Wolfgang Nejdl, Avishek Anand: Finding News Citations for Wikipedia. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, Oc- tober 24-28, 2016, pages 337–346.

2. The contributions in Chapter 6, which deal with the problem of determining the citation span for references in Wikipedia are published in:

• Besnik Fetahu, Katja Markert, Avishek Anand: Fine Grained Ci- tation Span for References in Wikipedia. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 7-11, 2017 (to appear).

3. The contributions in Chapter 7, which deals with the problem of automatically suggesting news articles to Wikipedia is published in:

• [FMA15] Besnik Fetahu, Katja Markert, Avishek Anand: Auto- mated News Suggestions for Populating Wikipedia Entity Pages.

In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2015, Melbourne, Australia, October 19 - 23, 2015, pages 323–332.

4. Finally, in Chapter 3 we present a study of the lag between Wikipedia and news, and in Chapter 8 we show an application use case of entity search in structured datasets, which are published in:

(8)

• [FAA15] Besnik Fetahu, Abhijit Anand, and Avishek Anand. How much is wikipedia lagging behind news. InProceedings of the ACM Web Science Conference, WebSci 2015, Oxford, United Kingdom, June 28 - July 1, 2015, pages 28:1–28:9.

• [FGD15] Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze. Im- proving Entity Retrieval on Structured Data. InThe Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Beth- lehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 474–491.

The complete list of publications during my PhD is shown below:

1. Besnik Fetahu, Katja Markert, Avishek Anand: Fine Grained Citation Span for References in Wikipedia. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 7-11, 2017 (to appear).

2. Ran Yu, Ujwal Gadiraju, Besnik Fetahu, and Stefan Dietze: FuseM:

Query-Centric Data Fusion on Structured Web Markup. In Proceedings of the 33rd IEEE International Conference on Data Engineering, ICDE, San Diego, CA, USA, April 19-22, 2017, pages 179–182.

3. Ujwal Gadiraju, Besnik Fetahu, Rickardo Kawase, Patrick Siehndel, and Stefan Dietze: Is the Crowd Smarter Than a 5th Grader? Using Worker Self-Assessments for Competence-based Pre-Selection. In ACM Trans- actions of Computer-Human Interaction, 2017 (to appear).

4. Besnik Fetahu, Katja Markert, Wolfgang Nejdl, Avishek Anand: Finding News Citations for Wikipedia. In Proceedings of the 25th ACM Inter- national on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 337–

346.

5. Ran Yu, Ujwal Gadiraju, Xiaofei Zhu, Besnik Fetahu, Stefan Dietze: To- wards Entity Summarisation on Structured Web Markup. InProceedings of the Semantic Web - ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers, pages 69–73.

6. Ran Yu, Besnik Fetahu, Ujwal Gadiraju, Stefan Dietze: A Survey on Challenges in Web Markup Data for Entity Retrieval. In Proceedings of the ISWC 2016 Posters & Demonstrations Track co-located with 15th International Semantic Web Conference (ISWC 2016), Kobe, Japan, Oc- tober 19, 2016.

(9)

ix 7. Davide Taibi, Giovanni Fulantelli, Stefan Dietze, Besnik Fetahu: Educa-

tional Linked Data on the Web - Exploring and Analysing the Scope and Coverage. In Open Data for Education - Linked, Shared, and Reusable Data for Teaching and Learning, pages 16–37.

8. Jakob Beetz, Ina Blümel, Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju, Martin Hecher, Thomas Krijnen, Michelle Lindlar, Martin Tamke, Raoul Wessel, Ran Yu: In 3D Research Challenges in Cultural Heritage II - How to Manage Data and Knowledge Related to Interpretative Digital 3D Reconstructions of Cultural Heritage, pages 231–255.

9. Davide Taibi, Saniya Chawla, Stefan Dietze, Ivana Marenzi, Besnik Fe- tahu: Exploring TED talks as linked data for education. BJET 46(5):

1092-1096 (2015)

10. Besnik Fetahu, Katja Markert, Avishek Anand: Automated News Sug- gestions for Populating Wikipedia Entity Pages. In Proceedings of the 24th ACM International Conference on Information and Knowledge Man- agement, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, pages 323–332.

11. Ujwal Gadiraju, Besnik Fetahu, Ricardo Kawase: Training Workers for Improving Performance in Crowdsourcing Microtasks. In Design for Teaching and Learning in a Networked World - 10th European Confer- ence on Technology Enhanced Learning, EC-TEL 2015, Toledo, Spain, September 15-18, 2015, Proceedings, pages 100–114.

12. Ujwal Gadiraju, Patrick Siehndel, Besnik Fetahu, Ricardo Kawase: Break- ing Bad: Understanding Behavior of Crowd Workers in Categorization Microtasks. In Proceedings of the 26th ACM Conference on Hypertext

& Social Media, HT 2015, Guzelyurt, TRNC, Cyprus, September 1-4, 2015, pages 33–38.

13. Besnik Fetahu, Ujwal Gadiraju, Stefan Dietze: Improving Entity Re- trieval on Structured Data. In The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 474–491.

14. Besnik Fetahu, Abhijit Anand, Avishek Anand: How much is Wikipedia Lagging Behind News? In Proceedings of the ACM Web Science Con- ference, WebSci 2015, Oxford, United Kingdom, June 28 - July 1, 2015, pages 28:1–28:9.

15. Ran Yu, Ujwal Gadiraju, Besnik Fetahu, Stefan Dietze: Adaptive Fo- cused Crawling of Linked Data. InWeb Information Systems Engineer-

(10)

ing - WISE 2015 - 16th International Conference, Miami, FL, USA, November 1-3, 2015, Proceedings, Part I, pages 554–569.

16. Davide Taibi, Giovanni Fulantelli, Stefan Dietze, Besnik Fetahu: To- wards Analysing the Scope and Coverage of Educational Linked Data on the Web. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, Italy, May 18-22, 2015 - Companion Volume, pages 705–710.

17. Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, Wolfgang Nejdl: A Scalable Approach for Effi- ciently Generating Structured Dataset Topic Profiles. In The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings, pages 519–534.

18. Bernardo Pereira Nunes, Alexander Arturo Mera Caraballo, Ricardo Kawase, Besnik Fetahu, Marco A. Casanova, Gilda Helena Bernardino de Campos: A Topic Extraction Process for Online Forums. In IEEE 14th International Conference on Advanced Learning Technologies, ICALT 2014, Athens, Greece, July 7-10, 2014, pages 541–543.

19. Davide Taibi, Stefan Dietze, Besnik Fetahu, Giovanni Fulantelli: Explor- ing type-specific topic profiles of datasets: a demo for educational linked data. InProceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014., pages 353–356.

20. Besnik Fetahu, Ujwal Gadiraju, Stefan Dietze: Crawl Me Maybe: It- erative Linked Dataset Preservation. In Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014., pages 433–436.

21. Bernardo Pereira Nunes, Ricardo Kawase, Besnik Fetahu, Marco A.

Casanova, Gilda Helena Bernardino de Campos: Educational Forums at a Glance: Topic Extraction and Selection. In Web Information Systems Engineering - WISE 2014 - 15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part II, pages 351–364.

22. Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, Wolfgang Nejdl: What’s all the data about?:

creating structured profiles of linked data on the web. In 23rd Inter- national World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7-11, 2014, Companion Volume, pages 261–262.

(11)

xi 23. Bernardo Pereira Nunes, Alexander Arturo Mera Caraballo, Marco An-

tonio Casanova, Besnik Fetahu, Luiz André P. Paes Leme, Stefan Dietze:

Complex Matching of RDF Datatype Properties. In Database and Ex- pert Systems Applications - 24th International Conference, DEXA 2013, Prague, Czech Republic, August 26-29, 2013. Proceedings, Part I, pages 195–208.

24. Davide Taibi, Giovanni Fulantelli, Stefan Dietze, Besnik Fetahu: Evalu- ating Relevance of Educational Resources of Social and Semantic Web.

In Scaling up Learning for Sustained Impact - 8th European Confer- ence, on Technology Enhanced Learning, EC-TEL 2013, Paphos, Cyprus, September 17-21, 2013. Proceedings, pages 637–638.

25. Bernardo Pereira Nunes, Stefan Dietze, Marco Antonio Casanova, Ri- cardo Kawase, Besnik Fetahu, Wolfgang Nejdl: Combining a Co-occurrence- Based and a Semantic Measure for Entity Linking. In The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26-30, 2013. Proceedings, pages 548–

562.

26. Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze: Summaries on the Fly: Query-Based Extraction of Structured Knowledge from Web Documents. InWeb Engineering - 13th International Conference, ICWE 2013, Aalborg, Denmark, July 8-12, 2013. Proceedings, pages 249–264.

27. Bernardo Pereira Nunes, Ricardo Kawase, Besnik Fetahu, Stefan Dietze, Marco A. Casanova, Diana Maynard: Interlinking Documents based on Semantic Graphs. In17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES 2013, Ki- takyushu, Japan, 9-11 September 2013, pages 231–240.

28. Bernardo Pereira Nunes, Besnik Fetahu, Marco Antonio Casanova: Cite4Me:

Semantic Retrieval and Analysis of Scientific Publications. In Proceed- ings of the LAK Data Challenge, Leuven, Belgium, April 9, 2013.

29. Bernardo Pereira Nunes, Besnik Fetahu, Stefan Dietze, Marco A. Casanova:

Cite4Me: A Semantic Search and Retrieval Web Application for Scien- tific Publications. In Proceedings of the ISWC 2013 Posters & Demon- strations Track, Sydney, Australia, October 23, 2013, pages 25–28.

30. Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Davide Taibi, Marco Antonio Casanova: Generating structured Profiles of Linked Data Graphs. In Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013, pages 113–116

(12)

31. Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze: Towards focused knowledge extraction: query-based extraction of structured summaries.

WWW (Companion Volume) 2013: 77-78

32. Davide Taibi, Besnik Fetahu, Stefan Dietze: Towards integration of web data into a coherent educational data graph. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, Companion Volume, pages 419–424.

33. Besnik Fetahu, Ralf Schenkel: Retrieval evaluation on focused tasks. In The 35th International ACM SIGIR conference on research and develop- ment in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012, pages 1135–1136.

(13)

List of Figures

1.1 Overview of the proposed approach for enrichment and improvement of textual knowledge bases. The approach shows the three main steps: (i) Citation Recommendation, (ii)Citation Span, and (iii)News Suggestions. 6 2.1 Mention-Entity graph for an example text snippet with ambiguous

entity mentions [HYB⁺11]. . . 13 2.2 Vector representation of a document collection. Each term is drawn

from a vocabulary and it represent a dimension in the document representation. . . 16 2.3 An example of linearly separable instances. The support vectors are the

ones which are at the margins and they define the maximum distance between the support vectors between the two classes. . . 25 2.4 Modeling of dependencies between instances in a sequence according

to linear-chain CRFs and general CRFs.. . . 26 3.1 Number of entities appearing in the corresponding years in Wikipedia,

and those extracted from the entity linking process in the NYT corpus. 31 3.2 News Reference Density for the different entity types. The reference

density of a given reference type is measured as the fraction of references of that type over all references for the entity page. . . 32 3.3 Reference density for the different entity types. The plots show the

reference density for years 2009-2014, in order from left to right. . . . 33 3.4 Entity mention counts in news articles before creation of Wikipedia

entity page. Mention counts of entities peak a year before it is created in Wikipedia. . . 33

xix

(20)

3.5 Entity lag in months. The emergent entities are shown in red, they are determined by filtering all entities from the subset of NYT that appear in earlier years before 2001. The y-axis is normalized using the sum of entities having medium lag for the emerging and non-emerging entities, respectively. . . 34 3.6 Lag distribution of different types. The y-axis values are normalized

by the sum of the overall entities falling into the differentlag classes. 35 3.7 Event news reference lag (in years) in Wikipedia. Most of Wikipedia

events fall into low-lag class, showing high dynamics of reporting real news events in Wikipedia. . . 37 3.8 Emerging entity density in Wikipedia event pages. . . 38 5.1 Approach overview. In the first task, we classify statements into one

of the citation categories, and in the second we find the appropriate news article for citation for a news statement. . . 51 5.2 Statement distribution by citation category. . . 53 5.3 News article length (in number of characters) distribution. . . 54 5.4 An example sub-tree from the YAGO type taxonomy. The edge in red

represents the edge we need to delete since it is not depth-consistent. 58 5.5 Hit-rate of articles in N^W up to rank 1000 (x–axis) for 1000 news

statements, respectively QCA1Base queries. . . 60 5.6 Learning curve for SC measured for different sample sizes. . . 66 5.7 Feature ablation for features in SC for type preserver . . . 67 5.8 Evaluation interface in CrowdFlower shown to crowd-workers for as-

sessing the relevance of news articles for a news statement s. . . 69 6.1 Sub-sentence level span for citation^[1]in a citing paragraph in a Wikipedia

article. . . 75 6.2 Linear chain CRF representing the sequence of text fragments in a

paragraph. In the factors we encode the fitness to the given citation. 78 6.3 Entity distribution based on the number of news citations. . . 81 6.4 Average document length for the different span buckets for citation

typesweb and news. . . 83 6.5 Erroneous span for the different citation span buckets. The y-axis

presents the∆w whereas in the x-axis are shown the different approaches. 89 7.1 Comparing how cyclones are reported in Wikipedia entity pages. . . 92 7.2 News suggestion approach overview. . . 94 7.3 Number of entities with at least one news reference for different entity

classes. . . 104

(21)

LIST OF FIGURES xxi 7.4 Precision-Recall curve for the article–entity placement task, in blue is

shown Fe, and in red is the baseline B1. . . 107 7.5 Feature analysis for the AEP placement task for t= 2009. . . 108 7.6 Article-Section performance averaged for all entity classes forF^s(using

SVM) and S2.. . . 110 7.7 Correctly suggested news articles for s∈Se(t)∧s /∈Se(t−1). . . 112 8.1 (a) Number of explicit similarity statements in contrast to the fre-

quency of object property statements overall, shown for all data graphs.

(b) Query type affinity shows the query type and the corresponding entity types from the retrieved and relevant entities. . . 117 8.2 Overview of the entity retrieval approach. . . 118 8.3 (a) Worker agreement on cluster accuracy for spectral and x–means

clustering. (b) Cluster accuracy for thespectral andx–meansclustering approaches. . . 125 8.4 (a) P@k for the different entity retrieval approaches under comparison.

(b) The relevant entity frequency based on their graded relevance (from 2-Slightly Relevant to5-Highly Relevant) for the different methods. . 126 8.5 NDCG@kfor B1,S1 andSP,XM . . . 127 8.6 (a) The aggregated MAP for different query types and for the different

retrieval approaches (note, we show the results for field body where baseline performs best). (b) The various configurations for the number of expanded entities for SP and XM.. . . 128

(22)

(23)

List of Tables

2.1 A characteristic matrix representing a set of documents (in the columns) and for the terms drawn from the vocabulary of terms from all documents. . . 21 2.2 Minhash signatures after mapping the rows into the hashing buckets

based on h1 and h2 and after replacing each hash column with the lowest hash bucket for each document for all its non-zero entries. . . . 21 2.3 An example feature of word occurrence for classifying documents into

winter and summer class. . . 28 3.1 Absolute entity lag distributions for all lag types. The numbers are

aggregated over the years 2001-2006. . . 36 5.1 The cells show the number of statements that are changed from one

category to another category after ground-truth curation. . . 52 5.2 Feature list for statement categorization. . . 54 5.3 Entailment feature set for the citation discovery task. . . 61 5.4 Sentence centrality feature set for the citation discovery task. . . 62 5.5 Results for thestatement classificationfor entities of typeyagoLegalActorGeo.

Results are for the different sample rangesτ and shown different levels of entity types in the YAGO type hierarchy. . . 65 5.6 Top–10 best performing entity types for the F C task. E1+FP and E2

columns show the improvement for P over E1. Right most column shows the configuration (#f represents the number of top–k most important features, and % is the percentage of training data) for the F C models. The last row shows themicro-average precision across all F C models. . . 71

xxiii

(24)

5.7 The results for the two baselines B1 and B2 for the top–10 best performing entity types for the F C task. The last row shows the micro- average precision across allF C models. . . 72 5.8 Relevant citation distribution for E2. . . 72 6.1 Varying degrees of citation span granularity in Wikipedia text. . . 76 6.2 Structural features for a sequence δi. . . 78 6.3 Citation span distribution based on the number of sub-sentences in the

citing paragraph. . . 82 6.4 The percentage of citations in a span with sequence skips andsentence

skips. . . 83 6.5 Evaluation results for the different citation span approaches. . . 86 6.6 Evaluation results for the citation span approaches for the different

span cases. For the results of CSP S we compute the relative in- crease/decrease of F1 score compared to the best result (based on F1) from the competitors. We mark in bold the best results for the evaluation metrics, and indicate with ** and * the results which are highly significant (p <0.001) and significant (p <0.05) based ont-test statistics when compared to the best performing baselines (CS, IC, CSW, MRF) based on F1 score, respectively. . . 88 7.1 Article–Entity placement feature summary. . . 97 7.2 Feature types used in Fs for suggesting news articles into the entity

sections. We compute the features for all s∈Sbc(t−1) as well asst−1. 100 7.3 News articles, entities and sections distribution across years. . . 104 7.4 Number of instances for train and test in the AEP and ASP tasks. . 105 7.5 Article–Entity placement task performance. . . 107 7.6 Article-Section placement performance (SVM based Fs) for the differ-

ententity classes. . . 111 8.1 Performance of the different entity retrieval approaches. In all cases

our approaches are significantly better in terms of P/R (p < 0.05mea- sured fort-test) compared to baseline andstate of the art. There is no significant difference betweenSP and XM approaches. . . 126

(25)

1

Introduction

1.1 Motivation

The advent of Internet and Web 2.0 platforms, has led to major societal shifts. Nowa- days, Web users have the means to consume and create content as part of various shar- ing platforms like Social Media, Blogs and other platforms. The increasing number of users in the Web [Sta] has shifted the focus of many organizations towards providing content online. This has led to many successful digitization projects like the Internet Archive [Arc], the Million Books Project [Pro], The New York Times [San08] etc.

The impact of such digitization has societal benefits, in which information is easier accessible thus leading to a more informed society. For instance, in US alone, nearly 40% of users consume their daily news through online news media platforms [Cen].

Apart from organizations that follow a certain discourse on providing information like news media, there is the other spectrum of the Web, where information is a direct result of the interaction between users and Web applications. The engagement and the number of users directly correlates to the value of an application. Examples include Social Media platforms like Twitter, Facebook etc., where user engagement is of importance to the success of these applications [Rib14]. Such a phenomena is described by Simon [Sim71], where user attention and engagement are some of the key available resources for organizations. This similarly applies to Web applications.

In this respect, one of the most known examples of such synergies between Web application and users on the Web is Wikipedia [Enc]. It represents an open, collaborative effort of creating encyclopedic content by Web users. At its core are Wikipedia editors, who provide content according to established principles and editing policies [pol].

There are approximately 284 different language versions of Wikipedia [oW]. Only in the english Wikipedia there are roughly 5 million articles, and a total of 30 million registered editors across all localized Wikipedias. The dynamics of content creation in Wikipedia, and the organizational structure and collaboration between editors has been subject to extensive research [KGC12,KPSM07, ZPL12, PHT09, WRTH15].

1

(26)

Wikipedia, is one of the top visited websites overall¹. Due to the large number of editors, and its openness in terms of added content, to provide quality assurances, there are guidelines and policies [pol], editor categorizations (e.g. admins or novice editors), and finally each revision of an article can be edited or deleted by other peer-editors. Despite the fact that the nature of these policies are guidelines and are not enforced, studies [MOM⁺15] show that Wikipedia in specific domains achieves comparable quality to expert curated encyclopedia like Britannica [Bri].

The value of Wikipedia has been widely acknowledged. It serves as the backbone for a wide range of applications. It is used to construct knowledge graphs like DB- pedia [BLK⁺09], YAGO [SKW07], which are included in major search engines like Google KnowledgeGraph or Apple’s Siri system. Furthermore, it has been widely used in fields such as text categorization [WD08], entity disambiguation [HYB⁺11]

etc. Therefore, apart from its direct visitors, its content is used and accessed implicitly through other sources that are built upon Wikipedia.

The core role and popularity of Wikipedia on the Web and the large variety of applications can be traced to two main factors. First, articles in Wikipedia are constantly evolving and new articles are added by its community of editors. This is mostly influenced by emerging information from the Web. For example, for an existing Wikipedia article likeUnited States presidential election, 2016², there are news reports, blogs, and other sources reporting about this particular event.

In many cases such emerging information is directly reflected in the corresponding Wikipedia articles, thus, keeping Wikipedia up to date. In some cases, real-world events are immediately reflected in Wikipedia within few minutes [KGC11]. Second, due to the Wikipedia policies [pol], especially the verifiability³ policy, it recommends Wikipedia contributors to support their additions with references from authoritative external sources. In particular, this policy states that “articles should be based on reliable, third-party, published with a reputation for fact-checking and accuracy.”⁴. This policy, on the one hand, guides contributors towards both neutrality and the importance of authoritative assessment and, on the other hand, allows Wikipedia core editors to identify unreliable articles more easily via a lack of such citations. Citations therefore play a crucial role in ensuring and upholding Wikipedia reliability, leading to high quality and important information which is harnessed by the Web users in general and the above mentioned applications.

Despite the established policies and the speed with which editors provide content for Wikipedia articles, these articles vary heavily in terms of quality. First, articles vary in their popularity, hence, their affinity to attract editors and thus provide content. For instance, 51% of Wikipedia articles are categorized asStub or articles in

1In 2015 it was in the top 10 most visited Internet sites according to the Alexa Internet rankingwww.alexa.com).

2https://en.wikipedia.org/wiki/United_States_presidential_election,_2016

3https://en.wikipedia.org/wiki/Wikipedia:Verifiability

4https://en.wikipedia.rg/wiki/Wikipedia:Identifying_reliable_sources

(27)

1.2 Scope of the Thesis 3 a need for expansion [Wik]. Second, given that new information regarding articles in Wikipedia constantly emerges, it is hard to keep all articles up to date. Apart from such information overload, the number of active editors⁵ at a given time point varies, and is far smaller than the total of 30 million registered editors [Wik]. Naturally, the amount of active editors and other editors demographics and editor’s interests will impact the coverage of articles that are kept up to date. Furthermore, there is an inherent delay between the time a real-world event happens which is relevant to a Wikipedia article and the time it is reflected in Wikipedia [FAA15]. In addition, changes in a specific Wikipedia article may cause information on other articles to be inconsistent. Finally, for any provided citation in Wikipedia and the text it is cited in, it is not possible to determine the span of text it covers. This has implications in enforcing the verifiability policy, where situation may arise in which a paragraph in a Wikipedia article containing a citation may be only partly covered by a reference.

1.2 Scope of the Thesis

Motivated from the importance and wide use of Wikipedia as a resource for a wide range of tasks and its high popularity amongst Web users, we address three core issues which deal with consistency, keeping up to date, and providing trustworthy information for Wikipedia: (i) finding news citations for Wikipedia statements, (ii) citation span determination, and (iii) enrichment of Wikipedia entity pages with novel and important news articles.

Before delving into the details of the problems we address, we clarify some notions we will use throughout this thesis. We will useWikipedia article, with which we refer to Wikipediaentity and event pages, whereas with Wikipedia entity page we refer to only entities. Whereas, with Wikipedia statement we will refer to the piece of text, ranging from a sentence up to a paragraph, that has orneeds a citation.

(I) Despite the growing trends in terms of the number of entity pages in Wikipedia, and those that adhere to the editing policies in Wikipedia, there is a large set of entity pages whose already existing citations are either outdated or not accessible.

Furthermore, as new information is constantly added by Wikipedia editors, it is of great importance to automate or at the very least aid the editors in finding the appropriate citations. Additionally, there are cases in which statements are explicitly marked with citation needed; the trust and truthfulness of such statement is into question. This indeed may be the case where the statement is simply not true, however, more often the citation is simply missing and it can be found from sources like news collections.

5The definition of anactive editor based on Wikipedia refers to registered users that have con- tributed in Wikipedia in the last 30 days (for any time point of measurement).

(28)

For this problem, where we are given apiece of text (we will define later the granularity of the textual fragments that we consider) from a Wikipedia article, we lay out two fundamental research questions.

RQ1.1. For a statement from a Wikipedia entity page, how can we determine the required type of citation (e.g. news, web, book etc.)?

The outcome from RQ1.1 is of great importance in finding appropriate citations for any given statement which adhere to the Wikipedia policies. There are many sce- narios where specific citation categories are preferred. For instance, for entity pages in the medical domain, a more authoritative reference would be a citation coming from a medicaljournal. In other cases, the availability of specific sources may restrict the statements for which we can suggest a citation.

Next, after knowing the desired citation category of a statement, the problem is how to find such references to cite. This brings us to the second research question which we postulate as following.

RQ1.2. For a Wikipedia statement that requires a news citation, how can we find news citations which provide evidence for the statement under consideration?

Automating the process of providing citations as postulated in RQ1.2 has several advantages. First, it addresses the problem of long-tail entity pages, which suffer due to the lack of interest by Wikipedia editors. Second, because Wikipedia is at a constantly evolving state, providing citations in an automated manner will ease the process of editing and serve as a complementary mechanism for Wikipedia editors. Finally, through automation it is possible to enforce in an objective manner the Wikipedia policies without falling into issues that in many cases lead to edit wars and disputability in Wikipedia.

(II) It is evident from the problem in(I) that in Wikipedia, determining the granularity for which a citation is valid is somewhat ill-defined. The reasons for this is that there are no explicit requirements and furthermore no means on specifying for what part of text a citation is valid. There are several consequences as a result of this. For instance, if we take a paragraph from a Wikipedia article and a reference cited from a paragraph, we are not able to tell for which part of the paragraph the citation provides evidence for. We formalize the research question addressing this issue as following:

RQ2. For a paragraph which we extract from a Wikipedia article and a reference cited within the paragraph, how can we accurately determine the span of the citation?

Determining the span of a citation, that is, singling out at a fine-grained level what a

(29)

1.2 Scope of the Thesis 5 citation covers in a paragraph extracted from a Wikipedia article has important implications. By accurately knowing the span it is possible to have a closed cycle where for uncovered parts in a paragraph we find citation as postulated in (I). Otherwise, for statements which do not have a citation from the appropriate source type, it may be an important signal on the truthfulness and validity of a statement.

(III) Finally, apart from recommending citations for already existing content in Wikipedia article, and finding their corresponding span, a highly important issue remains with emerging information from Web sources regarding a specific Wikipedia article or missing information. Specifically, for a given news collection, we deal with the problem of suggesting news articles to Wikipedia articles, which in turn can be processed by Wikipedia editors in order to add the encoded information within these sources. To this end we postulate the following two research questions.

RQ3.1. For a Wikipedia entity and a news collection, how can we find news articles in which the entity is a salient concept and at the same time the news article provides important and novel information for the entity?

After addressing the question RQ3.1, which we refer to as the article-entity placement task, we proceed and answer the second question which deals with finding the appropriate section in the Wikipedia article.

RQ3.2. For a Wikipedia entity and a suggested news article, how can we find the appropriate section within the entity, and in case such a section is missing how can we automatically add the appropriate section in the Wikipedia entity page and suggest the news article for?

The second research question addresses an important issue on suggesting novel information to Wikipedia articles and to specific sections. Due to the fact that information regarding Wikipedia articles constantly evolves, such sections in many cases might be missing. Therefore, to address fully RQ3.2 one needs to be able to add missing sections for which the news is relevant. For example, for a Wikipedia articleBarack Obama before his US presidential election, the corresponding article did not contain a section aboutUS Presidency, hence, in this case, a new section should be suggested automatically, in order to suggest news articles at the appropriate section and at a fine-grained level.

(30)

1.3 Contributions of the Thesis

In this thesis, we answer the research questions formalized in the previous section.

The contribution of this thesis is on enriching and improving the quality of Wikipedia as one of the most well known textual knowledge bases in the Web. Figure 1.1shows an outline of our contributions and the proposed solutions for the three core problems listed in Section 1.2.

News Collection

t1 t2 tn

Textual Knowledge Base

t1 t2 tn

Citation Recommendation Citation Span

News Suggestion

Entity Placement Section Placement e:“Barack Obama”

Obama was born on August 4, 1961,^[4] …..

The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude^[49]… Obama was elected to the Illinois Senate in …

news? query for s1

c4

Obama was born on August 4, 1961, at Kapiʻolani Maternity &

Gynecological Hospital in Honolulu, Hawaii.

c4

s1

Obama was born on August 4, 1961,at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.

citation c4 span

e:“Barack Obama” AND t2

news: nk

time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr.

Obama’s goal of achieving Middle East peace.

1.Early life and career 2.Political career 3.2008 presidential

campaign 4.Presidency 5.Political positions 6.Family and Personal

life

7.Cultural and political image

1.Early life and career 2.Political career 3.2008 presidential

campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political

image

t2 t3

publish date t headline body

entity mentions (e.g. “Barack Obama, Nobel Prize”…)

revision date t entity title sections section text categories citations

Figure 1.1. Overview of the proposed approach for enrichment and improvement of textual knowledge bases. The approach shows the three main steps:

(i)Citation Recommendation, (ii)Citation Span, and (iii) News Suggestions.

(I) News Citation Recommendation for Wikipedia: In Chapter 5 we propose a novel approach for finding news citations in Wikipedia. We address the two research questions in problem(I).

• RQ1.1. Firstly, for a Wikipedia article and a specificstatement, we propose an approach which determines the type of resources that are appropriate for citation. Dependent on the statement at hand, there is a range of 12 citation types (e.g. news, web, journal, book, report,. . .) which can be chosen from. This is a prerequisite for finding appropriate citations that enforce the Wikipedia editing policies, whereauthoritative sources are suggested. Which source type is considered more authoritative is dependent on the article and the statement, however, if an article is about medicine a source from a journal is preferred over a news article. To determine the citation category,

(31)

1.3 Contributions of the Thesis 7 we rely on language style and other structural attributes we extract from Wikipedia articles. This addresses question RQ1.1.

• RQ1.2. Second, we focus on only the cases where for any piece of text in Wikipedia the required and appropriate citation is of type news. The reason for focusing on this specific type of citation is motivated in Chapter3. We automatically construct a query to find news articles from a news corpus. We determine which news article to suggest as evidence for the given statement based on textual entailment, news authority and other centrality measures.

The contributions from this chapter are published in:

• [FMNA16] Besnik Fetahu, Katja Markert, Wolfgang Nejdl, Avishek Anand:

Finding News Citations for Wikipedia. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 337–346.

(II) Fine-grained Citation Span for References in Wikipedia: In Chapter 6 we propose an approach, where for a paragraph containing awebornewscitation we determine the coverage span of the citation. The work is of importance for automated approaches on enriching and expanding Wikipedia. In this way, we can determine to what extent an entity page adheres to the verifiability policy.

Furthermore, we provide explicit markings of the statements in Wikipedia that are covered by a citation. Hence, tying this together with the contribution in Chapter 5we close the cycle in which we find citations for uncovered statements until for all statements we can provide citations, in case they exist in a given news collection.

The contribution in this chapter has been published in:

• Besnik Fetahu, Katja Markert, Avishek Anand: Fine Grained Citation Span for References in Wikipedia. In Proceedings of the Conference on Empir- ical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 7-11, 2017 (to appear).

(III) Automated News Suggestion for Populating Wikipedia Pages: In Chap- ter 7 we propose a novel approach for accounting for the ever evolving nature of Wikipedia entity pages, respectively, the emerging information in news and Web sources in general. Furthermore, due to the varying popularity of Wikipedia articles and respectively their affinity to attract editors to provide content for such articles, we can retrospectively suggest missing information for such articles. The proposed approach addresses the following research questions:

• RQ3.1. First, for a news article, we consider the dual problem of determining the salient entities in the news article and consequentially if the news

(32)

article is of importance for the Wikipedia entity page at hand. In this way, we ensure that only important information is suggested where the entity is a salient concept. This addresses the research question inRQ3.1.

• RQ3.2. Second, considering the section structure of Wikipedia articles, it is important to determine precisely for which section a suggested news article is relevant. Due to the fact that Wikipedia articles evolve with new information becoming available, and along with that the section structure will change. Therefore, in our proposed approach we account for the two attributes, by first determining the appropriate section for which we suggest the news article, and in case such a section is missing we suggest its addition into the section structure.

The contribution in this chapter has been published in:

• [FMA15] Besnik Fetahu, Katja Markert, Avishek Anand: Automated News Suggestions for Populating Wikipedia Entity Pages. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Man- agement, CIKM 2015, Melbourne, Australia, October 19 - 23, 2015, pages 323–332.

Apart from our holistic approach on dealing with enrichment and improvement of Wikipedia articles, we additionally make the following contributions which provide the context for the work carried in this thesis.

(A1) How much is Wikipedia lagging behind News? The importance of news in Wikipedia is acknowledged by its editing policies [pol], where authoritative and third-party sources like news articles are suggested for citation. To better understand the actual use of news as citations in Wikipedia we analyze how news are reflected in Wikipedia, respectively, the amount of time it takes for an entity to be reported in news and its occurrence in Wikipedia, and finally, how many of the citations from all citation categories are news citations.

The importance and interaction between news and Wikipedia is published in:

– [FAA15] Besnik Fetahu, Abhijit Anand, and Avishek Anand. How much is wikipedia lagging behind news. InProceedings of the ACM Web Science Conference, WebSci 2015, Oxford, United Kingdom, June 28 - July 1, 2015, pages 28:1–28:9.

(A2) Improving Entity Retrieval in Structured Data. Finally, we look into application use cases, where Wikipedia drives several major industry projects on constructing knowledge graphs by Google [DGH⁺14], Yahoo! [BMV11], Mi- crosoft [NMS⁺07] which drive the functionalities behind entity search.

(33)

1.3 Contributions of the Thesis 9 In this case, we address several of the shortcomings that mostly deal with the nature of such structured datasets (e.g. DBpedia, Freebase etc.) and propose a new query similarity model for entity search. The contribution of this work has been published in:

– [FGD15] Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze. Improving Entity Retrieval on Structured Data. InThe Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, Oc- tober 11-15, 2015, Proceedings, Part I, pages 474–491.

(34)

(35)

2

Foundations and Technical Background

In this chapter, we introduce the technical background necessary to understand the work carried in this thesis. We first introduce the notion of knowledge bases, then continue on entity linking techniques. Next, we provide a thorough analysis of information retrieval techniques. We then describe clustering techniques, and finally conclude with supervised learning and feature selection algorithms.

2.1 Knowledge Bases

With the termknowledge base(KB) we refer to RDF datasets published based on a set oflinked dataprinciples introduced by Berners-Lee et al. [BLHL⁺01]. Since then there has been a big push towards publishing data according to these principles. Nowadays, the number of datasets is in the range of thousands [LS]. KBs are commonly referred to as structured datasets, linked datasets or RDF datasets.

2.1.1 Resource Description Framework – RDF

The term knowledge bases (KB) refers to RDF datasets published based on linked data principles introduced by Berners-Lee et al. [BLHL⁺01]. This can be considered also as the inception of the field of Semantic Web. Since then, there has been a big push towards publishing data following the principles, known as the linked data principles, introduced in [BLHL⁺01]. Nowadays, the number of datasets is in the range of thousands [LS]. These datasets are represented in RDF format and are interchangeably referred to as structured datasets, linked datasets or RDF datasets.

Resource Description Framework, or RDF, is a graph data model proposed by W3C as a standard for knowledge representation [W3Ca]. The RDF data model consists of a set of resources,predicates, andliterals.

A resource refers to a real-world entity (e.g. person, organization etc.) or an 11

(36)

abstract concept. We denote withR the set of resources in a KB. Literals, L, on the other hand, represent values like a string, date, number etc. Finally, predicates, P, represent a relation between resources or a resource and a literal.

We define a KB to be the projection between these building blocks of the RDF data model. Hence, a knowledge base can be represented as following: K:=R×P×(R∪L).

Alternatively, a KB can be seen as simply a set of triples of the form ⟨s, p, o⟩, where s∈ R, p∈ P, and the objecto can represent a resource or a literal, thus,o ∈ L ∪ R. Listing 2.1 shows a set of triples describing a resource, which here represents s:=db:University_of_Hannover.

Listing 2.1 RDF Resource Example for “University of Hannover”

db:University_of_Hannover rdf:type owl:Thing . db:University_of_Hannover rdf:type dbo:Agent .

db:University_of_Hannover rdf:type dbo:EducationalInstitution . db:University_of_Hannover rdf:type dbo:Organisation .

db:University_of_Hannover rdf:type dbo:University . db:University_of_Hannover dbo:city db:Hannover . db:University_of_Hannover dbo:country db:Germany .

db:University_of_Hannover dbp:budget "441.8 million"@en .

2.1.2 RDF Schema – RDFS

A crucial construct in publishing RDF datasets is the organization of resources into classes. This functionality is provided by RDFS [W3Cb]. RDFS is an extension of the basic constructs provided by the RDF data model. A class in RDFS is defined by the set of triples in Listing 2.2. A resourcesis assigned to a class cby the triple⟨ s, rdf:type, c⟩, similar to Listing2.1 where ⟨ db:University_of_Hannover, rdf:type, owl:Thing⟩.

Listing 2.2 RDFS Class Definition Example

dbo:University rdf:type rdfs:Class .

dbo:University rdfs:subClassOf dbo:Organisation . dbo:Organisation rdfs:subClassOf dbo:Agent .

dbo:Agent rdfs:subClassOf owl:Thing .

RDFS allows us to express hierarchy relations between different classes through rdfs:subClassOf. As such, the resource db:University_of_Hannover assigned to class dbo:University, by traversing the class hierarchy we assume that it is also of type owl:Thing.

(37)

2.2 Entity Linking and Disambiguation 13

2.1.3 Real-World Knowledge Bases

The functionalities of RDF and RDFS, and similar Semantic Web initiatives on modeling and representing knowledge have resulted in many initiatives on representing data according to such standard and principles. Arguably, some of the most well known examples, include knowledge bases like DBpedia [BLK⁺09], YAGO [SKW07].

Such KBs represent a subset of the information contained in Wikipedia articles, and represented according to linked data principles [BLHL⁺01].

These KBs are particularly interesting for this thesis. They allow to construct homogeneous groups of Wikipedia articles based on a type taxonomy constructed for Wikipedia articles based on the category structure in Wikipedia. Such categorization of articles according to a type taxonomy is useful for the approach in Figure 5.1.

In many cases articles that are of different types, e.g., articles about Places and Politicians, have completely different structure and the problems we tackle behave differently for the different types. Therefore, treating such articles separately accounts for accurate and reliable models.

2.2 Entity Linking and Disambiguation

In this section, we present an overview of state-of-the-art approaches on entity linking (EL) and disambiguation (NED) techniques. The task here corresponds to canoni- calizing surface forms or text phrases to entities on a given database of entities. In majority of the cases [HYB⁺11, MJGB11, FS12] Wikipedia is used as the target to link such surface forms to entities.

The core problem in this task is to resolve ambiguous mentions of entities in free text. Figure 2.1 highlights the problem of resolving entity mentions from a text snippet into entities in Wikipedia.

Figure 2.1. Mention-Entity graph for an example text snippet with ambiguous entity mentions [HYB⁺11].

We highlight two main differences between state of the art approaches on NED

(38)

and EL. In NED systems, the candidate mentions from a text snippet are limited to only those that resolve to a named entity of type {Person, Location, Organization, Time}. This is usually done through named entity recognition (NER) approaches like [FGM05]. Contrary to NED approaches, in EL there is no restriction in the mentions that can be resolved to entities on a target knowledge base. For instance, through EL systems one can resolve a mention referring to an event, e.g. U.S Pres- idential Elections 2016 to its corresponding Wikipedia article¹, which through NED approaches is not possible.

In this thesis, we opt for entity linking approaches due to their wider coverage of linking mentions from news and Web sources to Wikipedia articles. While the different entity linking and disambiguation approaches differ on their final result of disambiguated or linked entities, however, on common attribute of both applications is that they consider the coherence and contextual similarity between entity candidates in a textual snippet as part of their linking or disambiguation results.

NED. AIDA [HYB⁺11] is a state of the art approach on named entity disambiguation. For a text snippet as shown in Figure 2.1, AIDA performs the following operations to accurately link ambiguous entity mentions into a KB. Based on a pre-processing step where through NER are extracted mentions to named entities it generates a list of entity candidates from the target KB. Finally, the disambiguation is performed jointly by constructing a graph of mentions and entity candidates from the target KB. The goal is to find a dense sub-graph that fulfills the following properties: (i) high contextual similarity between mentions in the text and the candidate entities in KB, (ii) weighted edges amongst the candidate entities, which measure their coherence. The coherence in this case is a function of the number of incoming links two entities share in a KB. Common incoming links for any two entities e1 and e2 in a KB can be defined by the following triples ⟨x, p, e1⟩ and ⟨x, p,e2⟩.

EL. The work by Milne and Witten [MW08] is one of the most notable works and serves as the basis for many EL approaches. It relies on anchor text from Wikipedia articles which are used to learn models, and are later applied on textual resources to link specific phrases to Wikipedia articles. A commonality between AIDA and this work is the coherence between entities [WM08] and is computed occurring in a textual snippet, and contextual similarity. The coherence score is computed as in Equation 2.1.

TagMe [FS12] is a state of the art in entity linking. The advantage of TagMe compared to other existing approaches, is that it optimizes for short textual snippets as well. This provides an advantage considering that a significant proportion of news and web resources in general are not very lengthy. TagMe follows a similar scheme on performing the entity linking, however, with improvements on disambiguating surface

1https://en.wikipedia.org/wiki/United_States_presidential_election,_2016