# of training sentences

(1)

Markus Beker and Anette Frank

LanguageTehnology Lab

DFKI GmbH

Stuhlsatzenhausweg 3

D-66123 Saarbruken

fmbeker,frankgdfki.de

Abstrat

We presentanewapproahtotopologialpars-

ing of Germanwhih is orpus-based and built

onasimplemodelofprobabilistiCFGparsing.

The topologialeld model of Germanprovides

a linguistiallymotivated, at maro struture

foromplexsentenes. Besidesthepratialas-

petof developinga robust and aurate topo-

logialparserforhybridshallowanddeepNLP,

we investigatetowhatextenttopologialstru-

turesanbehandledbyontext-freeprobabilis-

ti models. We disuss experiments with sys-

temativariantsofatopologialtreebankgram-

mar,whihyield ompetitiveresults.

1

1 Introdution

We presentanewapproahtotopologialpars-

ingforGermanwhihisorpus-basedandbuilt

on a simple model of probabilisti CFG pars-

ing. Topologial parsing is of speial interest

forshallowpre-proessingoflanguageslikeGer-

man,whih exhibitfreeword orderand theso-

alledverb-seond(V2)property. Thetopologi-

aleldmodel(Hohle,1983)isatheory-neutral

modelof lausalsyntaxthat providesa linguis-

tially well-motivated,butat maro struture

for omplex sentenes. As opposed to hunk-

based partial parsing, the topologial model is

ompatible with deep syntati analysis, and

thusperfetlysuitedforintegrated shallowand

deep NLP, by guiding deep syntati analysis

bypartial,topologialbraketing(Crysmannet

1

Theideas that ledto thispapergrew from disus-

sions with Feiyu Xu and Jakub Piskorski. The work

was in part supportedby a BMBF grant to the DFKI

projetwhiteboard(FKZ01IW002). Speialthanks

gotoBerndKiefer forproviding uswithaCFG parser

and for his support in tehnial issues, and to Hubert

Shlarb and Holger Neis for manual orretion of our

al.,2002),orforpre-struturingofomplexsen-

tenesforhunk-basedproessing(Neumannet

al., 2000),asa divideand onquer strategy.

Previous approahes to topologial parsing

of German make use of hand-oded gram-

mars (Waushkuhn, 1996; Braun, 1999). In

this paper we pursue a orpus-based, statisti-

al approah, aiming at a robust parser with

high auray. We make use of a treebank-

indued probabilisti non-lexialised CFG, fol-

lowing (Charniak, 1996). While this simple

model is learly outperformed by more rened

stohasti models for full onstituent-struture

parsing, 2

ourexperiment isinterestinginshow-

ingthatfor topologialparsinga robustparser

withhighaurayguresanbeobtainedwith

a standard stohasti model of non-lexialised

ontext-freetreebank grammars.

Topologial strutures are partial or under-

speied in that they do not enode internal

strutureanddemarationofsubsententialon-

stituents, i.e. NP, AP, PP or VP onstituents.

Topologial base lauses 3

are haraterised by

morphologial and ategorial properties. Still,

the topologial parsing task is not trivial, in

that the boundaries and relative embedding of

base lauses and the demaration of elds in

generalarenotdeterministi,and alsolexially,

or semantially determined. Thus, the om-

plexity of topologial parsing lies somewhere

between hunk parsing and full onstituent-

struture parsing. The interesting question we

are exploring in our approah is whether this

type of syntati struture an be suessfully

dealtwithusinganon-lexialisedPCFGmodel.

The aim of this paper is three-fold. Besides

the pratial aspet of (i) developing a robust

2

E.g. (Collins,1997)andlaterwork,see(Belz,2001).

3

(2)

integration with deep syntati analysis or for

asaded shallow analysis systems, we (ii) in-

vestigate how well topologial strutures an

be modeled byontext-free probabilisti gram-

mars, while (iii) trying to detet spei phe-

nomenathatrequiremoresophistiatedmodels.

The paper is strutured as follows. In Se-

tion 2 we present the eld model for German

and desribe the reation of a topologially

strutured treebank, whih we derive from the

negra orpus (Brants et al., 1997). Setion 3

disussespreviouswork. Setion4desribesour

orpus-basedstohastiapproahtotopologial

parsing. In Setion5 we introdue formalvari-

antsof our treebank grammar, whih illustrate

problemati aspets in topologial stohasti

parsing, and possible strategies to their solu-

tion. Setion 6 presents the testing setup and

evaluation results for dierent grammar vari-

ants. The results are analysedindetail inSe-

tion 7. Setion8 onludes.

2 A Topologial Corpus of German

Germansentenestrutureistraditionallyanal-

ysedintermsofits \eld" ortopologialstru-

ture,whihisdeterminedbythepositionofthe

nite verb in left (LB) or right (RB) braket

position (1). In main lauses the nite verb

typially oupies the seond onstituent posi-

tion,followingtheso-alled\Vorfeld"(VF)(V2

lauses). The Vorfeldan be missing inyes/no

questions or embeddedonditional lauses (V1

lauses), as well as insubordinate lauses with

omplementizer. In subordinate lauses the

omplementizer(orwh-/rel-phrase)demarates

the LB position, the nite verb is in RB po-

sition (VL lauses). Arguments and modiers

between LBand RBoupythe \middleeld"

(MF),extraposedmaterial isfoundtotheright

of therightbraket, inthe\Nahfeld" (NF).

(1) Vorfeld Left(LB)MiddleRight(RB)Nahfeld

(VF) Braket Field Braket (NF)

V2 topi/ nite args/ (verbal extraposed

wh-phr. verb adjs omplex) onstituents

V1 - nite args/ (verbal extraposed

- verb adjs omplex) onstituents

VL - ompl args/ (verbal extraposed

wh-phr. - omplex)

rel-phr. - adjs +nite onstituents

ortheother{ onthisdesriptivemodelofGer-

man sentene struture. It is thus straightfor-

ward to dene mappings from topologial to

deep syntati strutures of almost any syn-

tati framework. Its ompatibility with deep

syntati analysis makes topologial syntati

struture an ideal andidate forinterleaving of

shallowanddeepNLP(Crysmannetal.,2002).

For our orpus-based approah, no topo-

logiallyannotatedorpusofGermanwasavail-

able. Thenegratreebank(Brantsetal.,1997),

a largeannotatedorpusofGermannewspaper

text, follows an annotation sheme whihom-

bines strutural and dependeny annotations.

However,theruialtopologiallues,inparti-

ular the distintion between fronted or lause-

nalverbposition,aswellasthedelimitationof

pre-, middle-and post-elds arenotenoded.

To derive a topologial \treebankgrammar"

from the negra orpus, we applied the tree-

bank onversionmethod of(Frank,2000). This

methodisbuiltonageneraltreedesriptionlan-

guage, and allows the denition of ne-grained

rulesforstrutureonversion. Conversionrules

speify partial strutural onstraints and a-

tions for tree modiations, whih are applied

by removing or adding tree desription predi-

atesfromthetreesthatsatisfytheonstraints.

We derived a topologial orpus from the

negra treebank, by dening linguistially in-

formed onversion rules whih exploit addi-

tional annotations in the orpus, i.e. indiret

linguisti evidene, to assign topologial lues.

In a seond step we indued topologial stru-

turesbyattening irrelevant internal struture

within topologial elds and introduing topo-

logial ategory nodes DF, VF, MF, and NF as

well as LB and RB for left and right brakets.

4

Basi lauses are marked with labels CL whih

expand to various patterns of DF, VF, LB, MF,

LB, and NF nodes. Basi lauses an be em-

bedded within phrasal elds VF, MF, NF. The

resulting strutures give (i) an internal stru-

ture of basi lauses in terms of elds whih

are internally attened to POS sequenes, and

(ii) an overall hierarhial struture of lausal

embedding, inluding oordination. (2) gives

an example of a omplex topologial stru-

4

DFmarksaspeial\disourseeld"preedingVF,as

inNaja ,er kommthaltspater{Well ,hewill omelater.

(3)

VF-TOPIC $ LB-VFIN MF RB-PTK NF

,

CL-WH VVFIN NE ART NN PTKVZ CL-INF

wies Souza die Polizei an

VF-WH MF RB-VFIN $ MF RB-VINF NF

,

PWAV NE VVFIN ART NN PTKZUVVINF CL-REL

Wie BBC meldete denHauptling zu fassen

$VF-REL MF RB-VFIN

,

PRELS PRF VVPP VVFIN

der sih verstekt halt

As BBCreported orderedSouza the polie the hieftain to ath who himself hidden keeps

ture. It illustrates the use of parameterised

ategorynodes, whihdistinguishvarioustypes

oflauses: CL-V2,-V1,-INF,-REL,-WH,pre-elds:

VF-TOPIC,-WH,-REL, left: LB-COMPL,-VFIN and

right brakets: RB-VFIN,-VINF,-VPART,-PTK.

Theautomatiallyderived topologialorpus

is used for extration of a stohasti treebank

grammar with reserved development and test

setions. Thetestorpuswasmanuallyheked

and orreted by two independent annotators.

Manual orretion of the test setion yielded

93.0%labelledpreisionand 93.7% labelledre-

allof theautomationversion proedure.

3 Topologial Parsing of German

While partial parsers for detetion of lausal

struture are now available in many varieties

and for many languages, 5

this type of pars-

ing approah has always been onsidered dif-

ultforlanguageslikeGerman.(Waushkuhn,

1996) was among the rst to present a par-

tial parser for German. In a rst step, the

oarsesyntatilausestrutureisdeteted,us-

ing indiators like verbs, onjuntions, puntu-

ation,et. Ane grainedanalysisisarriedout

in the seond step, by grouping the remaining

eldsintosequenes of minimal "base" NPs or

PPs. Theanalysisisstillpartialinthatattah-

mentsofbaseNPsandPPsarenotdetermined.

The grammarisdened asaCFG withfeature

strutures,where grammarrulesare annotated

withmanuallyadjustedweightsforparse rank-

ing. Grammar rules, inluding the assoiated

weights, are handoded. (Waushkuhn, 1996)

reports overage of 85.7% for lausal analysis.

No guresaregiven forpreisionorreall.

(Braun, 1999; Neumann et al., 2000) report

an approah to topologialparsing of German,

based on asaded nite state automata. In

5

See for example (Ait-Mokhtar and Chanod, 1997;

a rst pass, possible verb groups are identi-

ed. Aseondpassidentiessubordinatelause

strutures, using similar ues as(Waushkuhn,

1996). (Braun, 1999) arried out an evalua-

tion over400sentenesand reports overage of

94.3%, preision of89.7%and reall of84.75%.

While these approahes are similar to our

workininduingtopologialstruturefromkey

linguisti indiators, they suer from several

problems. (i) Hand-oding of rules is labori-

ous 6

and likelyto miss out rare or exeptional

phenomena,inludingungrammatialonstru-

tions. (ii) Ambiguities are either resolved by

manually assigned weights, or simply by using

agreedystrategy(Braun,1999). (iii)Theseap-

proahes heavily exploit presriptive puntua-

tion rules. This isproblemati forperformane

inuened deviations from standard puntua-

tion or less standardised text sorts, leading to

either a lossofoverage,orauray.

4 A Stohasti Topologial Parser

In response tothese problems we investigate a

orpus-based,stohastiapproahtotopologial

parsing. It has been demonstrated 2

that sto-

hasti parsing an ahieve high gures of ro-

bustness and auray, while mostly restrited

topurely onstituent-based syntati analysis.

For our task of topologial parsing, we in-

vestigatetheadequayof thevery simple,non-

lexialisedmodelof(Charniak,1996),ifapplied

toratherat,topologialstrutures. Ourwork-

ing hypothesis was that the model shouldper-

form well,even ifnotlexialised,sine(i) there

arelessattahmentdeisions,due totherather

at target strutures. (ii) Topologial stru-

turesassuh,aswellasattahmentdeisionsfor

baselausesarelessdependent onlexialinfor-

mation,than, e.g.,attahment of PPs. Finally,

(iii) a orpus-based stohasti grammar has a

6

(4)

strutions andperformane-inuenedinput.

Followingthemethodof(Charniak,1996)we

extrat a ontext free grammar from the or-

pusdesribedinSetion2. Fromthisgrammar

we deriveformalgrammarvariants(see Setion

5). Ruleprobabilitiesareestimatedusingmaxi-

mumlikelihood. Weemployaexibleandefi-

entCFGhartparser(KieferandSherf,1996),

whihweextendedtomanageruleprobabilities.

Currently, we let the parser ompute the full

searh spae. N-best parse trees are eÆiently

determined by applying the Viterbi algorithm

overpaked tree strutures.

5 Variations of TopologialGrammars

Aspartofourexperimentalsetupweinduefor-

mal variants of the topologial treebank gram-

mar. The aim isto explore dierent strategies,

or`models',andhowwelltheyperforminterms

ofoverageandauray.

7

Thesegrammarvari-

ants illustrate problemati aspets in topologi-

al stohasti parsing, and strategies to their

solution. In partiular, we disuss(a) parame-

terisationofeldategories,(b)alternative ap-

proahes to puntuation, () the use of binary

eldstrutures toaddresssparsenessproblems,

and (d)theeets of grammarpruning.

(a) Parameterised ategories Our topo-

logial orpus denes maximally informative

strutureswheretopologialategoriesareasso-

iated with more ne-grained syntati labels.

For instane, relative lauses, whih dominate

a nite right braket daughter RB-VFIN, are

marked CL-REL, as opposed to verb-seond

lausesCL-V2withnite leftbraket(LB-VFIN)

(see (2)). A VF ategory that ontains a rela-

tive pronounwillbemarkedVF-REL.Suh ne-

grained labels impliitly enode a larger syn-

tati ontext (f. (Belz, 2001)): for example,

a relative pronoun inVF-REL predits(through

oourrenedataintheorpus)thatitisdomi-

natedbyagrandfatherategoryCL-REL,whih

takesa right braket daughter RB-VFIN, as op-

posedtoa leftbraketdaughter.

Weextratgrammarvariantswithandwith-

out parameterised ategories, to investigate to

whihextentamorene-grainedand impliitly

7

Heneforth we use auray as ameasure for both

rayin atopologialmodelof syntax.

(b)Puntuation Themaximaldeorationof

atreeontainspuntuationmarkslikeommas,

quotes, olons, et.

8

While the orret attah-

ment of puntuation marks is not part of our

evaluation,theguidingintuitionwasthatpun-

tuation should help to identify lause bound-

aries. Ontheotherhand,irregularitiesinpun-

tuationsettingausenoiseinthedata,inreases

grammar size, and ould ause overage prob-

lems. Weomparetheperformaneofgrammar

variantswith and withoutpuntuation.

() Binarisation Phrasal topologial elds

VF, MF, NF are underspeied for onstituent

boundariesofNPs,PPs,et. Theeldsareradi-

allyattened, diretlyexpandingtosequenes

of POS ategories. We expet a great variety

of POS sequenes as expansions of eld ate-

gories,butatthesametimerekonwithonsid-

erablesparseness problems,due tounseenPOS

sequenes.

Toaddressthisproblem,weintrodue(right-

branhing) binary eld strutures. The at

struture for the two onstituents Souza die

Polizei in (2) is transformed to the tree (3).

Learning rules from binary subtrees effetive-

ly indues a unigram language model where

thenumberof\ells"orrespondstotherather

smallnumberof POSategories. Again,we ex-

perimentwithatvs. binarygrammarversions,

totesttheir respetive overage andauray.

(3)

MF

NE MF

Souza ART MF

die NN

Polizei

(d) Pruning Due to automati transforma-

tion, the topologial orpus ontains some ill-

formedstrutures. Wetestwhethernoiseinthe

grammaran be redued bypruning single o-

urrenesofrules. Weomparetheperformane

of pruned andunpruned grammars.

8

(5)

Experimentalsetup Thenegraorpuswas

split into randomised setions for training

(16476),development(1000)andtesting(1058),

plusfurtherheld-outdataforlaterexperiments.

Fortraininganddevelopmentweusedtheauto-

matially derived topologial orpus, while the

testdatawasmanually orreted(Setion2).

Totesttheperformaneofthegrammarinde-

pendentlyfromatagger,theinputtotheparser

onsistsofthemanuallydisambiguatedPOSse-

quenes ofthetestorpus.

9

Evaluation Measures For evaluation we

employthePARSEVALmeasuresoflabeledre-

allandpreisionandrossingbrakets,aswell

asompletemath,i.e. fullstrutureidentity.

10

To aommodate for the dierenes between

grammarversions,evaluationwasondutedas

follows. The evaluation measures in Tables 1

and 2 disregard puntuation and are based on

simplenodelabels,i.e. ategoryparametersare

stripped. Finally,toallowlearomparisonbe-

tweenbinarisedandatgrammarversionsbina-

risedparsetreesareompiledtoattreesbefore

evaluation againstattargettrees.

11

Results We onduted systemati tests for

all ombinations of grammar variants: para

(parameterised ategories), bin (binarised),

pnt (puntuation), prun (pruning single

rule ourrenes), seeresultsinTable1.

Tables 2 and 3 give more detailed evalua-

tion guresforthe best performingmodel (v1)

para+.bin+.pnt+.prun+. Table2listslabeled

reall and preision resultsfor individualtopo-

logialategories. Field ategories VF...NF re-

eive high gures above 90%, to the exeption

ofNF,yetwithloweroverallproportion(quota).

Table3reportsalternativeevaluationgures,

namelyevaluation bydisregardingategorypa-

rameters (param ), or by evaluating on om-

plex ategorylabels (param +); and by taking

ornotpuntuation intoaount (punt+/ ).

Finally, Fig. 4 displays a learning urve for

stepwiseextension ofthetrainingorpus.

9

8sentenesweresetapartduetowrongPOStags.

10

We veried our results using the evaluation tool

evalbbySatoshiSekine

http://www.s.nyu.edu /s /p ro je t s/p ro te us /e val b/ .

11

Evaluatinglabeledreallandpreisiononbinarised

treeswouldyieldundulyhighgures,duetoahighnum-

Table1 shows betterperformaneof grammars

v1-8usingparameterised ategories,asopposed

totheomplementaryversions v9-16. Parame-

terisedgrammarsmakeuseofariherstruture,

whih is mapped to oarser topologial ate-

goriesforevaluation.

12

Theimpliitontextual-

isationinategorylabelslearlyimprovespars-

ingresults. While therule set grows,a relative

lossofoverageisonlyvisiblefornon-binarised

versionsv5-8as opposed tov13-16.

Binarisation shows dramati eets in ove-

rage and auray. Binarised grammars are

smallerthantheiratounterparts,butfarless

onstrained,allowingthederivationofvirtually

any POS sequene. Flat grammars suer from

lakofoverage,espeiallythoseusingrihat-

egory labels and/or puntuation. We see dra-

matidierenesofabout100%ompletemath

improvement between v6/v2, v8/v4, v16/v12,

and signiant ontrasts in LP/LR and CB

measures. Thus, binarisationsolvesthe sparse-

nessproblem forattopologialCFGs without

jeopardisingauray.

Usingpuntuationinparsing leadstoimpro-

ved auray measures, yet only in binarised

grammars, where sparseness problems are ir-

umvented. Flat grammars with puntuation

show lower overage than their ounterparts {

higher auray measures are probably due to

lower overage. Use of puntuation is similar

toparameterisationof labels,inthatgrammar-

internally it helps to disriminate elds, while

forevaluation itislteredfromtheparse trees.

Pruningofsingleruleourrenesleadstosig-

niant redution in grammar size, in partiu-

lar fornon-binarised grammars. Here,pruning

inurs signiant loss in overage. This is ex-

peted, sineextremely at rulesare likelynot

to re-our several times. For binarised gram-

marspruningyieldsrulesetsofabout1/3,with

almostunhanged100%overage. Ourhypoth-

esiswasthatpruningimprovesthequalityofthe

grammarbyeliminating noiseimported byau-

tomatitreebankonversion. Thisisonrmed,

in all binary grammars, by improved auray

measures. Sine in binary grammars generi

eld rules are binarised and frequently our-

ing,rule pruningis likelytoeliminatenoise.

12

Thus,parameterisationorrespondstothenotionof

(6)

# (trainedon16476sents.) size in% len in% len in% in% in% in%

1 para+.bin+.pnt+.prun+

a) 40 867 100.0 14.6 80.4 13.1 93.4 92.9 92.1 98.9

b)all 867 99.8 15.9 78.6 13.7 92.4 92.2 90.7 98.5

2 para+.bin+.pnt+.prun- 2308 99.9 14.6 79.1 13.0 93.3 92.7 92.1 99.1

3 para+.bin+.pnt-.prun+ 679 100.0 14.6 80.8 13.1 92.8 91.7 89.1 98.0

4 para+.bin+.pnt-.prun- 1917 99.9 14.6 79.6 13.0 92.2 91.5 89.0 97.9

5 para+.bin-.pnt+.prun+ 2962 57.5 10.3 49.7 5.7 63.2 79.9 59.3 87.6

6 para+.bin-.pnt+.prun- 19536 88.4 13.6 37.5 6.5 54.0 73.1 48.0 78.8

7 para+.bin-.pnt-.prun+ 2839 67.2 11.6 45.8 6.0 59.8 76.5 52.7 83.3

8 para+.bin-.pnt-.prun- 18365 92.5 13.9 38.9 6.8 55.2 73.6 47.5 78.6

9 para-.bin+.pnt+.prun+ 634 100.0 14.6 74.9 12.4 89.3 89.0 87.5 97.9

10 para-.bin+.pnt+.prun- 1827 99.9 14.6 72.7 12.3 88.3 88.2 86.7 97.7

11 para-.bin+.pnt-.prun+ 489 100.0 14.6 71.6 11.9 86.0 84.5 80.6 95.7

12 para-.bin+.pnt-.prun- 1528 99.9 14.5 70.4 11.8 85.6 84.3 80.9 95.4

13 para-.bin-.pnt+.prun+ 2756 76.4 12.8 37.4 5.6 53.4 71.7 46.6 80.1

14 para-.bin-.pnt+.prun- 18979 94.9 14.2 34.6 6.4 53.4 71.5 46.9 80.4

15 para-.bin-.pnt-.prun+ 2675 80.4 13.3 36.9 5.8 53.2 71.1 45.7 80.5

16 para-.bin-.pnt-.prun- 17885 96.6 14.2 35.4 6.5 53.7 70.7 46.8 82.3

Table 1: Resultsforsystemati grammarvariations(sentene length40,exept 1b)

LP LR

Category in% quota in% quota

CL 88.9 24.3 92.2 23.2

MF 93.2 23.8 93.1 23.7

LB 99.6 17.9 99.4 17.8

VF 96.1 16.3 91.8 16.9

RB 96.3 13.7 95.8 13.7

NF 82.6 3.6 73.4 4.1

S 4.8 0.3 5.3 0.3

DF 16.7 0.1 6.7 0.2

all 93.4 100.0 92.9 100.0

Table2:Category-speievaluation(v1,40) 13

eval perf. math LP LR

param punt in% len in% in%

{ { 80.4 13.1 93.4 92.9

+ { 79.6 13.1 92.7 92.2

{ + 78.5 12.8 92.1 91.6

+ + 77.7 12.8 91.5 91.0

Table3: Dierent evaluation shemes(v1,40)

In sum, our best performing model (v1)

makes use of a maximally disriminative sym-

boligrammar(parameterisedategories,pun-

tuation), resolves sparseness problems by rule

binarisation, and an aord rule pruning to

eliminate noise. Applied to full sentene

lengths (v1b) we note a drop in performane,

13

S-ategorieswereusedfornon-standardbaselauses,

butinsigniantlysoforoverage,and onlyby

1% inLP and 0.7%in LR.

Table 3 details alternative evaluation mea-

sures. Evaluation on parameterised ategories

inurs a slight drop in auray, but in high

ranges.

14

Evaluation of puntuation attah-

ment { whih is of little importane { yields a

furtherdrop.

The learning urve in Fig. 4 is surprising

in that we obtain relatively high performane

from rather small training orpora and gram-

mar sizes (size grows almost linearly from 313

to2308).

15

Saturationregarding overage and

aurayis obtainedaroundtrainingsize6000.

Finally, we determined phenomena that all

for stronger ontextualisation or lexialisation.

A ase in point areverb-seond(V2) sentenes

with a fronted V2 lause in Vorfeld position

(i.e. with VF-V2 ategories), whih allow an

alternative analysis as oordinate lauses with

sharedsubjets. This type of onstrution was

frequentlymis-analysedasaoordinationstru-

ture sine this strutural ambiguity annot be

14

These measuresare relevantforintegrationofshal-

lowanddeepNLP(Crysmannetal.,2002),asparame-

terisedategoriesprovidehighlydisriminativeinforma-

tionthatanbeusedtoguidedeepsyntatiproessing.

15

Note,however, thattheurvepertainsto arobust,

binarisedgrammar.Wehosev2(prun )inordernotto

undulypenalisesmallgrammars. Lakofpruningould

(7)

76 78 80 82 84 86 88 90 92 94 96 98 100

0 2000 4000 6000 8000 10000 12000 14000 16000

# of training sentences

labeled precision labeled recall coverage exact match

Figure 4: Learningurve (versionv2)

resolved on thebasis of morphologialor topo-

logial riteria. A promising strategy to en-

haneourmodelis(targeted)lexialisiation,as

these onstrutions typiallyour with a spe-

i typeof \reporting"verbs.

8 Conlusion and Future Work

We presenteda topologialparser forGerman,

using a standard PCFG model trained on an

annotated orpus. We have shown that for

thetaskoftopologialparsinganon-lexialised

PCFG model yields ompetitive results. We

investigated various grammar versions to illus-

trateproblematiaspetsinstohastitopolog-

ial parsing. Category parameterisation (i.e.

ontextualisation)and puntuationwereshown

to inrease auray. Binarisation results in

highoveragegures. Pruningofsingleruleo-

urreneseliminates noise inthe automatially

onstrutedtrainingorpus.

The omplexity of topologial parsing lies

somewhere between the omplexity of hunk

parsing and full onstituent struture parsing.

Our results indiate that a standard PCFG

model is appropriate for the hosen task, but

ould possibly be enhaned bylexialisation.

In future work we will explore extension to

a lexialised model, and investigate asaded

stohasti parsing, by applying a speialised

stohastihunkparsingmodeltophrasalelds,

toobtainfullonstituentstrutureparses. Fur-

ther we willintegrate the TnTtagger (Brants,

with respet to tagging errors, and extend the

modeltoa freeparsing arhiteture.

Referenes

S. Ait-Mokhtar and J. Chanod. 1997. Inremental

Finite-StateParsing. InProeedingsofANLP-97.

B. Crysmann, A. Frank, B. Kiefer, S. Muller,

G.Neumann,J.Piskorski, U.Shafer,M. Siegel,

H.Uszkoreit,F.Xu,M.Beker,andH-U.Krieger.

2002. An Integrated Arhiteture for Deep and

ShallowProessing. InProeedingsofACL2002,

UniversityofPennsylvania,Philadelphia.

A.Belz. 2001. Optimisationoforpus-derivedprob-

abilistigrammars. InProeedingsofCorpusLin-

guistis2001,pp.46{57.

T. Brants, W. Skut, and B. Krenn. 1997. Tag-

ging Grammatial Funtions. In Proeedings of

EMNLP,Providene,RI,USA.

T. Brants. 1997. Internal and external tagsets in

part-of-speeh tagging. In Proeedings of Eu-

rospeeh,Rhodes,Greee.

T. Brants. 2000. TnT - A Statistial Part-of-

SpeehTagger. InProeedingsoftheANLP-2000,

Rhodes,Greee.

C. Braun. 1999. Flahes und robustes Parsen

Deutsher Satzgefuge. Master's thesis, Saarland

University.

E. Charniak. 1996. Tree-bank Grammars. In

AAAI-96. Proeedingsof the ThirteenthNational

Conferene on Artiial Intelligene, pp. 1031{

1036.MITPress.

M.Collins. 1997. Three generativemodelsfor sta-

tistialparsing. InProeedingsoftheACL-97,pp.

16{23.

A.Frank. 2000. AutomatiF-strutureAnnotation

of Treebank Trees. In M. Butt and T.H. King,

(eds),ProeedingsoftheLFG00Conferene,CSLI

OnlinePubliations,Stanford,CA.

N.Gala-Pavia. 1999. UsingtheInrementalFinite-

State Arhiteture to reate a Spanish Shallow

Parser. InProeedingsofXVCongresofSEPLN,

Lleida,Spain.

T.Hohle. 1983. Topologishe Felder. Universityof

Cologne.

B. Kiefer and O. Sherf. 1996. Gimme more HQ

parsers. The generi parser lass of diso. Ms.,

DFKI,Saarbruken,Germany.

G.Neumann, C. Braun,and J. Piskorski. 2000. A

Divide-and-ConquerStrategyforShallowParsing

ofGerman FreeTexts. InProeedingsof ANLP,

pp.239{246,Seattle,Washington.

O.Waushkuhn. 1996. EinWerkzeugzurpartiellen

syntaktishenAnalysedeutsherTextkorpora. In

D.Gibbon, (ed),Proeedingsof the ThirdKON-

VENS Conferene, pp. 356{368, Berlin. Mouton

deGruyter.