• Keine Ergebnisse gefunden

A Hybrid Machine Learning Approach to Information Extraction

N/A
N/A
Protected

Academic year: 2022

Aktie "A Hybrid Machine Learning Approach to Information Extraction"

Copied!
1
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

A Hybrid Machine Learning Approach to Information Extraction

G¨unter Neumann?

German Research Center for Artificial Intelligence (DFKI) LT–Lab, DFKI Saarbr¨ucken, D-66123 Saarbr¨ucken, Germany

Abstract. A central issue of Information Extraction (IE) research is the adapta- tion of an IE–system to a new domain. Since IE–systems are by definition domain–

specific in order to achieve the desired efficiency and robustness, the mapping of linguistic structures to domain specific structures, is specified in an explicit and direct way. Because of the very idosyncratic nature of these mappings, they are usually not re–useable in other domains or for other text styles. However, it has been shown that a manual specification of such mapping rules are very expensive and that it is very hard to keep the mappings up to date. Hence, recently Machine Learning (ML) methods for automatic acquisition of such mappings are exploited and systematically evaluated. So far, a number of statistical and symbolic-based methods have been explored and evaluated, however mainly in non-hybrid environ- ments. Therefore, an interesting question is whether the combination of a stochastic and a symbolic ML-method can improve the performance of an IE–system. As basis for our statistical-based learner we have chosen the Maximum Entropy Modelling (MEM) framework. The symbolic learner is based on our work ondata–driven ex- traction of lexicalized tree grammars. The core idea is to generate trees from the in- formation obtained from the shallow parser applied on the annotated training data.

These trees are further generalized by cutting of irrelevant subtrees. Both learning methods are applied independently of each other during the training phase. The application phase is realized as an iterative tag–insertion algorithm, where the tags are actually determined by the learned mappings. The envisaged hybrid learning behavior is achieved through a voting mechanism, which is applied in each iteration on the tagging results of all active mappings. We have systematically evaluated our approach following the MUC7 guide lines on a manually annotated small corpus of Germannewspaper articles about company turnover (75 documents with a total of 5878 tokens; 60 documents are used for training, 15 for testing). For the template element task, we obtained 85.18% F-measure using the hybrid approach, compared to 79.27% for MEM and 51.85% for the symbolic learner when running them in iso- lation. The overall result is competitive with related work described in IE literature mainly for English using larger document sets than we had available for German, e.g., Chieu & Ng (2002).

References

CHIEU & NG (2002): A Maximum Entropy Approach to Information Extraction from Semi–Structured and Free Text. In Proceedings of AAAI-2002.

?Thanks to my student Volker Morbach for his great help during the implemen- tation and evaluation phase of the project.

Referenzen

ÄHNLICHE DOKUMENTE

A review of available evaluations and studies, including two cases from former UK Department for International Development (DFID) and the World Bank, reveals that there is

ôBáá ÷ èíáÞ/ÖF×VÝ õݦÞFçÜÕ ÖaÞFÝäݦôWÜÕ ÷ ØÚÕ ãÝ¦ÞFôVç5áôjØæå ò˜ëݦÞFôjç5áé.

The amount of source sentences used in shallow evaluation, and the number of source sentences together with number of translations pro- duced by different translation systems used

The Penn Treebank (Marcus M. 1993) used the Fidditch parser (Hindle D. The structures were subsequently presented to human an- notators and manually corrected. 1997) is used

The language model for the token level is obtained using Maximum Entropy Modeling (MEM). The major advantages of MEM for IE from unstructured texts are 1) that one can easily

As a general strategy for the semantic annotation of folk- tales, we will first remain at the level of the extraction of entities, relations and events, corresponding roughly to

It is able to adapt the variation of NER and non-NER pattern library; (iii) The information provided by the relation features deals with multi- ple linguistic levels, depicts both

1, learns and refines visual conceptual models, either by attending to informa- tion deliberatively provided by a human tutor (tutor- driven learning: e.g., H: This is a red box.) or