• Keine Ergebnisse gefunden

Analyzing and Accessing Wikipedia as a Lexical Semantic Resource

N/A
N/A
Protected

Academic year: 2022

Aktie "Analyzing and Accessing Wikipedia as a Lexical Semantic Resource"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Analyzing and Accessing Wikipedia as a Lexical Semantic Resource

Torsten Zesch, Iryna Gurevych, Max Mühlhäuser

Department of Telecooperation Ubiquitous Knowledge Processing Group

Darmstadt University of Technology, Hochschulstraÿe 10 D-64289 Darmstadt, Germany

{zesch,gurevych,max} (at) tk.informatik.tu-darmstadt.de

1

(2)

Analyzing and Accessing Wikipedia as a Lexical Se- mantic Resource

In this paper, we analyze Wikipedia as an emerging lexical semantic resource that is growing exponentially. Recent research has shown that Wikipedia can be successfully employed for NLP tasks, e.g. question answering (Ahn et al., 2004), text classication (Gabrilovich and Markovitch, 2006) or named entity disambiguation (Bunescu and Pasca, 2006). We extend this work by focusing on the analysis of Wikipedia content, in particular its category structure, and present a highly ecient Java-based API to use Wikipedia in large scale NLP.

At rst, we compare Wikipedia with conventional lexical semantic re- sources such as dictionaries, thesauri, semantic wordnets or paper bound encyclopedias. We show that dierent parts of Wikipedia reect dierent aspects of these resources. Wikipedia articles form a heavily linked ency- clopedia, whereas the category system is a collaboratively constructed the- saurus used for collaborative tagging (Voss, 2006). Our analysis reveals that Wikipedia additionally contains a vast amount of knowledge about named entities, domain specic terms or specic word senses that cannot be easily found in other freely available lexical semantic resources. We also show that the redirect system of Wikipedia can be used as a dictionary for synonyms, spelling variations and abbreviations.

Next, we perform a detailed analysis of the category graph of Wikipedia employing graph-theoretic methods. Previous studies (Holloway et al., 2005;

Buriol et al., 2006; Zlatic et al., 2006) have focused on the Wikipedia article graph. We analyze the category graph, as it can be regarded as an important lexical semantic resource of its own. From this analysis, we draw some conclusions for adapting algorithms from semantic wordnets to the Wikipedia category graph.

If Wikipedia is to be used as a lexical-semantic resource in large-scale NLP tasks, ecient programmatic access to the knowledge therein is re- quired. We review existing access mechanisms (Riddle, 2006; Sigurbjörnsson et al., 2006; Wikimedia Foundation, 2006) and show that they are limited with respect to either their performance or the access functions provided.

Therefore, we introduce a general purpose, high performance Java-based API for Wikipedia that overcomes these limitations and is specically de- signed to turn Wikipedia into a lexical semantic resource. We transform the Wikipedia database dump into an optimized representation that can be more eciently accessed. The information nuggets, such as redirects, are explic-

2

(3)

itly stored there, whereas they are scattered in the original dump. As a case study, the API is used to perform the graph-theoretic analysis of Wikipedia mentioned above.

The rst release of the Wikipedia-API provides access to the full set of explicit information encoded in Wikipedia including articles, links, categories and redirects. In particular, the category system is converted into a graph representation, on which the whole range of standard graph algorithms can be applied, e.g. nding the shortest path between two given categories. We plan to make the API freely available for research purposes. Our ongoing work investigates the use of knowledge in Wikipedia to compute semantic relatedness of words and extensions of the API for mining the lexical semantic knowledge represented in Wikipedia articles.

References

David Ahn, Valentin Jijkoun, Gilad Mishne, Karin Müller, Maarten de Rijke, and Stefan Schlobach. 2004. Using Wikipedia at the TREC QA Track.

In Proceedings of TREC 2004.

Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of the 11th Conference of the EACL, pages 916, Trento, Italy.

Luciana Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ste- fano Stefano Millozzi. 2006. Temporal Analysis of the Wikigraph. In Proceedings of Web Intelligence, Hong Kong.

Evgeniy Gabrilovich and Shaul Markovitch. 2006. Overcoming the Brittle- ness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In AAAI, pages 13011306, Boston, MA.

Todd Holloway, Miran Bozicevic, and Katy Börner. 2005. Analyzing and Visualizing the Semantic Coverage of Wikipedia and Its Authors. ArXiv Computer Science e-prints, cs/0512085.

Tyler Riddle. 2006. Parse::mediawikidump. URL http://search.cpan.

org/~triddle/Parse-MediaWikiDump-0.40/.

Börkur Sigurbjörnsson, Jaap Kamps, and Maarten de Rijke. 2006. Focused Access to Wikipedia. In Proceedings DIR-2006.

3

(4)

Jakob Voss. 2006. Collaborative thesaurus tagging the Wikipedia way.

ArXiv Computer Science e-prints, cs/0604036.

Wikimedia Foundation. 2006. Wikipedia. URL http://en.wikipedia.

org/wiki/Wikipedia:Searching.

Vinko Zlatic, Miran Bozicevic, Hrvoje Stefancic, and Mladen Domazet.

2006. Wikipedias: Collaborative web-based encyclopedias as complex networks. Physical Review E, 74:016115.

4

Referenzen

ÄHNLICHE DOKUMENTE

Di ff erent names- paces are used to distinguish the semantic rôles that wiki pages may play: they can be individual elements (the majority of the pages, describing elements of

As expounded above, the additional typing information will be composed of two parts: the general definition of the available types with their hierarchical organization and the

Auch wenn es nun naheliegt, Artikel, die von wenigen Autorinnen und Autoren bearbeitet und kaum diskutiert werden, kritischer zu betrachten – ins- besondere, wenn er sich mit

WKHJD]HWWHHUZKLFKSURYLGHVWKHPDLQIRUPVDWOHDVWIRU WKH IUHTXHQWO\ XVHG 3ROLVK ILUVW QDPHV ZKHUHDV OHPPDWL]DWLRQ RI VXUQDPHV LV D PRUH FRPSOH[ WDVN )LUVWO\ ZH KDYH LPSOHPHQWHG D UDQJH

The German Devel- opment Institute / Deutsches Institut für Entwicklung- spolitik (DIE) is one of the leading global research in- stitutions and think tanks on global

Currently, the proposed API provides robust parsing of the English and the German Wiktionary editions and extracts structured information, including glosses, etymology, ex-

For bilingual IR experiments using English topics on a German document collection we use (i) machine translation methods for statistical and semantic models, and (ii)

For bilingual IR experiments using English topics on a German document collection, we use (i) machine translation methods for statistical and semantic IR models, and (ii)