Markus Beker and Anette Frank
LanguageTehnology Lab
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbruken
fmbeker,frankgdfki.de
Abstrat
We presentanewapproahtotopologialpars-
ing of Germanwhih is orpus-based and built
onasimplemodelofprobabilistiCFGparsing.
The topologialeld model of Germanprovides
a linguistiallymotivated, at maro struture
foromplexsentenes. Besidesthepratialas-
petof developinga robust and aurate topo-
logialparserforhybridshallowanddeepNLP,
we investigatetowhatextenttopologialstru-
turesanbehandledbyontext-freeprobabilis-
ti models. We disuss experiments with sys-
temativariantsofatopologialtreebankgram-
mar,whihyield ompetitiveresults.
1
1 Introdution
We presentanewapproahtotopologialpars-
ingforGermanwhihisorpus-basedandbuilt
on a simple model of probabilisti CFG pars-
ing. Topologial parsing is of speial interest
forshallowpre-proessingoflanguageslikeGer-
man,whih exhibitfreeword orderand theso-
alledverb-seond(V2)property. Thetopologi-
aleldmodel(Hohle,1983)isatheory-neutral
modelof lausalsyntaxthat providesa linguis-
tially well-motivated,butat maro struture
for omplex sentenes. As opposed to hunk-
based partial parsing, the topologial model is
ompatible with deep syntati analysis, and
thusperfetlysuitedforintegrated shallowand
deep NLP, by guiding deep syntati analysis
bypartial,topologialbraketing(Crysmannet
1
Theideas that ledto thispapergrew from disus-
sions with Feiyu Xu and Jakub Piskorski. The work
was in part supportedby a BMBF grant to the DFKI
projetwhiteboard(FKZ01IW002). Speialthanks
gotoBerndKiefer forproviding uswithaCFG parser
and for his support in tehnial issues, and to Hubert
Shlarb and Holger Neis for manual orretion of our
al.,2002),orforpre-struturingofomplexsen-
tenesforhunk-basedproessing(Neumannet
al., 2000),asa divideand onquer strategy.
Previous approahes to topologial parsing
of German make use of hand-oded gram-
mars (Waushkuhn, 1996; Braun, 1999). In
this paper we pursue a orpus-based, statisti-
al approah, aiming at a robust parser with
high auray. We make use of a treebank-
indued probabilisti non-lexialised CFG, fol-
lowing (Charniak, 1996). While this simple
model is learly outperformed by more rened
stohasti models for full onstituent-struture
parsing, 2
ourexperiment isinterestinginshow-
ingthatfor topologialparsinga robustparser
withhighaurayguresanbeobtainedwith
a standard stohasti model of non-lexialised
ontext-freetreebank grammars.
Topologial strutures are partial or under-
speied in that they do not enode internal
strutureanddemarationofsubsententialon-
stituents, i.e. NP, AP, PP or VP onstituents.
Topologial base lauses 3
are haraterised by
morphologial and ategorial properties. Still,
the topologial parsing task is not trivial, in
that the boundaries and relative embedding of
base lauses and the demaration of elds in
generalarenotdeterministi,and alsolexially,
or semantially determined. Thus, the om-
plexity of topologial parsing lies somewhere
between hunk parsing and full onstituent-
struture parsing. The interesting question we
are exploring in our approah is whether this
type of syntati struture an be suessfully
dealtwithusinganon-lexialisedPCFGmodel.
The aim of this paper is three-fold. Besides
the pratial aspet of (i) developing a robust
2
E.g. (Collins,1997)andlaterwork,see(Belz,2001).
3
integration with deep syntati analysis or for
asaded shallow analysis systems, we (ii) in-
vestigate how well topologial strutures an
be modeled byontext-free probabilisti gram-
mars, while (iii) trying to detet spei phe-
nomenathatrequiremoresophistiatedmodels.
The paper is strutured as follows. In Se-
tion 2 we present the eld model for German
and desribe the reation of a topologially
strutured treebank, whih we derive from the
negra orpus (Brants et al., 1997). Setion 3
disussespreviouswork. Setion4desribesour
orpus-basedstohastiapproahtotopologial
parsing. In Setion5 we introdue formalvari-
antsof our treebank grammar, whih illustrate
problemati aspets in topologial stohasti
parsing, and possible strategies to their solu-
tion. Setion 6 presents the testing setup and
evaluation results for dierent grammar vari-
ants. The results are analysedindetail inSe-
tion 7. Setion8 onludes.
2 A Topologial Corpus of German
Germansentenestrutureistraditionallyanal-
ysedintermsofits \eld" ortopologialstru-
ture,whihisdeterminedbythepositionofthe
nite verb in left (LB) or right (RB) braket
position (1). In main lauses the nite verb
typially oupies the seond onstituent posi-
tion,followingtheso-alled\Vorfeld"(VF)(V2
lauses). The Vorfeldan be missing inyes/no
questions or embeddedonditional lauses (V1
lauses), as well as insubordinate lauses with
omplementizer. In subordinate lauses the
omplementizer(orwh-/rel-phrase)demarates
the LB position, the nite verb is in RB po-
sition (VL lauses). Arguments and modiers
between LBand RBoupythe \middleeld"
(MF),extraposedmaterial isfoundtotheright
of therightbraket, inthe\Nahfeld" (NF).
(1) Vorfeld Left(LB)MiddleRight(RB)Nahfeld
(VF) Braket Field Braket (NF)
V2 topi/ nite args/ (verbal extraposed
wh-phr. verb adjs omplex) onstituents
V1 - nite args/ (verbal extraposed
- verb adjs omplex) onstituents
VL - ompl args/ (verbal extraposed
wh-phr. - omplex)
rel-phr. - adjs +nite onstituents
ortheother{ onthisdesriptivemodelofGer-
man sentene struture. It is thus straightfor-
ward to dene mappings from topologial to
deep syntati strutures of almost any syn-
tati framework. Its ompatibility with deep
syntati analysis makes topologial syntati
struture an ideal andidate forinterleaving of
shallowanddeepNLP(Crysmannetal.,2002).
For our orpus-based approah, no topo-
logiallyannotatedorpusofGermanwasavail-
able. Thenegratreebank(Brantsetal.,1997),
a largeannotatedorpusofGermannewspaper
text, follows an annotation sheme whihom-
bines strutural and dependeny annotations.
However,theruialtopologiallues,inparti-
ular the distintion between fronted or lause-
nalverbposition,aswellasthedelimitationof
pre-, middle-and post-elds arenotenoded.
To derive a topologial \treebankgrammar"
from the negra orpus, we applied the tree-
bank onversionmethod of(Frank,2000). This
methodisbuiltonageneraltreedesriptionlan-
guage, and allows the denition of ne-grained
rulesforstrutureonversion. Conversionrules
speify partial strutural onstraints and a-
tions for tree modiations, whih are applied
by removing or adding tree desription predi-
atesfromthetreesthatsatisfytheonstraints.
We derived a topologial orpus from the
negra treebank, by dening linguistially in-
formed onversion rules whih exploit addi-
tional annotations in the orpus, i.e. indiret
linguisti evidene, to assign topologial lues.
In a seond step we indued topologial stru-
turesbyattening irrelevant internal struture
within topologial elds and introduing topo-
logial ategory nodes DF, VF, MF, and NF as
well as LB and RB for left and right brakets.
4
Basi lauses are marked with labels CL whih
expand to various patterns of DF, VF, LB, MF,
LB, and NF nodes. Basi lauses an be em-
bedded within phrasal elds VF, MF, NF. The
resulting strutures give (i) an internal stru-
ture of basi lauses in terms of elds whih
are internally attened to POS sequenes, and
(ii) an overall hierarhial struture of lausal
embedding, inluding oordination. (2) gives
an example of a omplex topologial stru-
4
DFmarksaspeial\disourseeld"preedingVF,as
inNaja ,er kommthaltspater{Well ,hewill omelater.
VF-TOPIC $ LB-VFIN MF RB-PTK NF
,
CL-WH VVFIN NE ART NN PTKVZ CL-INF
wies Souza die Polizei an
VF-WH MF RB-VFIN $ MF RB-VINF NF
,
PWAV NE VVFIN ART NN PTKZUVVINF CL-REL
Wie BBC meldete denHauptling zu fassen
$VF-REL MF RB-VFIN
,
PRELS PRF VVPP VVFIN
der sih verstekt halt
As BBCreported orderedSouza the polie the hieftain to ath who himself hidden keeps
ture. It illustrates the use of parameterised
ategorynodes, whihdistinguishvarioustypes
oflauses: CL-V2,-V1,-INF,-REL,-WH,pre-elds:
VF-TOPIC,-WH,-REL, left: LB-COMPL,-VFIN and
right brakets: RB-VFIN,-VINF,-VPART,-PTK.
Theautomatiallyderived topologialorpus
is used for extration of a stohasti treebank
grammar with reserved development and test
setions. Thetestorpuswasmanuallyheked
and orreted by two independent annotators.
Manual orretion of the test setion yielded
93.0%labelledpreisionand 93.7% labelledre-
allof theautomationversion proedure.
3 Topologial Parsing of German
While partial parsers for detetion of lausal
struture are now available in many varieties
and for many languages, 5
this type of pars-
ing approah has always been onsidered dif-
ultforlanguageslikeGerman.(Waushkuhn,
1996) was among the rst to present a par-
tial parser for German. In a rst step, the
oarsesyntatilausestrutureisdeteted,us-
ing indiators like verbs, onjuntions, puntu-
ation,et. Ane grainedanalysisisarriedout
in the seond step, by grouping the remaining
eldsintosequenes of minimal "base" NPs or
PPs. Theanalysisisstillpartialinthatattah-
mentsofbaseNPsandPPsarenotdetermined.
The grammarisdened asaCFG withfeature
strutures,where grammarrulesare annotated
withmanuallyadjustedweightsforparse rank-
ing. Grammar rules, inluding the assoiated
weights, are handoded. (Waushkuhn, 1996)
reports overage of 85.7% for lausal analysis.
No guresaregiven forpreisionorreall.
(Braun, 1999; Neumann et al., 2000) report
an approah to topologialparsing of German,
based on asaded nite state automata. In
5
See for example (Ait-Mokhtar and Chanod, 1997;
a rst pass, possible verb groups are identi-
ed. Aseondpassidentiessubordinatelause
strutures, using similar ues as(Waushkuhn,
1996). (Braun, 1999) arried out an evalua-
tion over400sentenesand reports overage of
94.3%, preision of89.7%and reall of84.75%.
While these approahes are similar to our
workininduingtopologialstruturefromkey
linguisti indiators, they suer from several
problems. (i) Hand-oding of rules is labori-
ous 6
and likelyto miss out rare or exeptional
phenomena,inludingungrammatialonstru-
tions. (ii) Ambiguities are either resolved by
manually assigned weights, or simply by using
agreedystrategy(Braun,1999). (iii)Theseap-
proahes heavily exploit presriptive puntua-
tion rules. This isproblemati forperformane
inuened deviations from standard puntua-
tion or less standardised text sorts, leading to
either a lossofoverage,orauray.
4 A Stohasti Topologial Parser
In response tothese problems we investigate a
orpus-based,stohastiapproahtotopologial
parsing. It has been demonstrated 2
that sto-
hasti parsing an ahieve high gures of ro-
bustness and auray, while mostly restrited
topurely onstituent-based syntati analysis.
For our task of topologial parsing, we in-
vestigatetheadequayof thevery simple,non-
lexialisedmodelof(Charniak,1996),ifapplied
toratherat,topologialstrutures. Ourwork-
ing hypothesis was that the model shouldper-
form well,even ifnotlexialised,sine(i) there
arelessattahmentdeisions,due totherather
at target strutures. (ii) Topologial stru-
turesassuh,aswellasattahmentdeisionsfor
baselausesarelessdependent onlexialinfor-
mation,than, e.g.,attahment of PPs. Finally,
(iii) a orpus-based stohasti grammar has a
6
strutions andperformane-inuenedinput.
Followingthemethodof(Charniak,1996)we
extrat a ontext free grammar from the or-
pusdesribedinSetion2. Fromthisgrammar
we deriveformalgrammarvariants(see Setion
5). Ruleprobabilitiesareestimatedusingmaxi-
mumlikelihood. Weemployaexibleandefi-
entCFGhartparser(KieferandSherf,1996),
whihweextendedtomanageruleprobabilities.
Currently, we let the parser ompute the full
searh spae. N-best parse trees are eÆiently
determined by applying the Viterbi algorithm
overpaked tree strutures.
5 Variations of TopologialGrammars
Aspartofourexperimentalsetupweinduefor-
mal variants of the topologial treebank gram-
mar. The aim isto explore dierent strategies,
or`models',andhowwelltheyperforminterms
ofoverageandauray.
7
Thesegrammarvari-
ants illustrate problemati aspets in topologi-
al stohasti parsing, and strategies to their
solution. In partiular, we disuss(a) parame-
terisationofeldategories,(b)alternative ap-
proahes to puntuation, () the use of binary
eldstrutures toaddresssparsenessproblems,
and (d)theeets of grammarpruning.
(a) Parameterised ategories Our topo-
logial orpus denes maximally informative
strutureswheretopologialategoriesareasso-
iated with more ne-grained syntati labels.
For instane, relative lauses, whih dominate
a nite right braket daughter RB-VFIN, are
marked CL-REL, as opposed to verb-seond
lausesCL-V2withnite leftbraket(LB-VFIN)
(see (2)). A VF ategory that ontains a rela-
tive pronounwillbemarkedVF-REL.Suh ne-
grained labels impliitly enode a larger syn-
tati ontext (f. (Belz, 2001)): for example,
a relative pronoun inVF-REL predits(through
oourrenedataintheorpus)thatitisdomi-
natedbyagrandfatherategoryCL-REL,whih
takesa right braket daughter RB-VFIN, as op-
posedtoa leftbraketdaughter.
Weextratgrammarvariantswithandwith-
out parameterised ategories, to investigate to
whihextentamorene-grainedand impliitly
7
Heneforth we use auray as ameasure for both
rayin atopologialmodelof syntax.
(b)Puntuation Themaximaldeorationof
atreeontainspuntuationmarkslikeommas,
quotes, olons, et.
8
While the orret attah-
ment of puntuation marks is not part of our
evaluation,theguidingintuitionwasthatpun-
tuation should help to identify lause bound-
aries. Ontheotherhand,irregularitiesinpun-
tuationsettingausenoiseinthedata,inreases
grammar size, and ould ause overage prob-
lems. Weomparetheperformaneofgrammar
variantswith and withoutpuntuation.
() Binarisation Phrasal topologial elds
VF, MF, NF are underspeied for onstituent
boundariesofNPs,PPs,et. Theeldsareradi-
allyattened, diretlyexpandingtosequenes
of POS ategories. We expet a great variety
of POS sequenes as expansions of eld ate-
gories,butatthesametimerekonwithonsid-
erablesparseness problems,due tounseenPOS
sequenes.
Toaddressthisproblem,weintrodue(right-
branhing) binary eld strutures. The at
struture for the two onstituents Souza die
Polizei in (2) is transformed to the tree (3).
Learning rules from binary subtrees effetive-
ly indues a unigram language model where
thenumberof\ells"orrespondstotherather
smallnumberof POSategories. Again,we ex-
perimentwithatvs. binarygrammarversions,
totesttheir respetive overage andauray.
(3)
MF
NE MF
Souza ART MF
die NN
Polizei
(d) Pruning Due to automati transforma-
tion, the topologial orpus ontains some ill-
formedstrutures. Wetestwhethernoiseinthe
grammaran be redued bypruning single o-
urrenesofrules. Weomparetheperformane
of pruned andunpruned grammars.
8
Experimentalsetup Thenegraorpuswas
split into randomised setions for training
(16476),development(1000)andtesting(1058),
plusfurtherheld-outdataforlaterexperiments.
Fortraininganddevelopmentweusedtheauto-
matially derived topologial orpus, while the
testdatawasmanually orreted(Setion2).
Totesttheperformaneofthegrammarinde-
pendentlyfromatagger,theinputtotheparser
onsistsofthemanuallydisambiguatedPOSse-
quenes ofthetestorpus.
9
Evaluation Measures For evaluation we
employthePARSEVALmeasuresoflabeledre-
allandpreisionandrossingbrakets,aswell
asompletemath,i.e. fullstrutureidentity.
10
To aommodate for the dierenes between
grammarversions,evaluationwasondutedas
follows. The evaluation measures in Tables 1
and 2 disregard puntuation and are based on
simplenodelabels,i.e. ategoryparametersare
stripped. Finally,toallowlearomparisonbe-
tweenbinarisedandatgrammarversionsbina-
risedparsetreesareompiledtoattreesbefore
evaluation againstattargettrees.
11
Results We onduted systemati tests for
all ombinations of grammar variants: para
(parameterised ategories), bin (binarised),
pnt (puntuation), prun (pruning single
rule ourrenes), seeresultsinTable1.
Tables 2 and 3 give more detailed evalua-
tion guresforthe best performingmodel (v1)
para+.bin+.pnt+.prun+. Table2listslabeled
reall and preision resultsfor individualtopo-
logialategories. Field ategories VF...NF re-
eive high gures above 90%, to the exeption
ofNF,yetwithloweroverallproportion(quota).
Table3reportsalternativeevaluationgures,
namelyevaluation bydisregardingategorypa-
rameters (param ), or by evaluating on om-
plex ategorylabels (param +); and by taking
ornotpuntuation intoaount (punt+/ ).
Finally, Fig. 4 displays a learning urve for
stepwiseextension ofthetrainingorpus.
9
8sentenesweresetapartduetowrongPOStags.
10
We veried our results using the evaluation tool
evalbbySatoshiSekine
http://www.s.nyu.edu /s /p ro je t s/p ro te us /e val b/ .
11
Evaluatinglabeledreallandpreisiononbinarised
treeswouldyieldundulyhighgures,duetoahighnum-
Table1 shows betterperformaneof grammars
v1-8usingparameterised ategories,asopposed
totheomplementaryversions v9-16. Parame-
terisedgrammarsmakeuseofariherstruture,
whih is mapped to oarser topologial ate-
goriesforevaluation.
12
Theimpliitontextual-
isationinategorylabelslearlyimprovespars-
ingresults. While therule set grows,a relative
lossofoverageisonlyvisiblefornon-binarised
versionsv5-8as opposed tov13-16.
Binarisation shows dramati eets in ove-
rage and auray. Binarised grammars are
smallerthantheiratounterparts,butfarless
onstrained,allowingthederivationofvirtually
any POS sequene. Flat grammars suer from
lakofoverage,espeiallythoseusingrihat-
egory labels and/or puntuation. We see dra-
matidierenesofabout100%ompletemath
improvement between v6/v2, v8/v4, v16/v12,
and signiant ontrasts in LP/LR and CB
measures. Thus, binarisationsolvesthe sparse-
nessproblem forattopologialCFGs without
jeopardisingauray.
Usingpuntuationinparsing leadstoimpro-
ved auray measures, yet only in binarised
grammars, where sparseness problems are ir-
umvented. Flat grammars with puntuation
show lower overage than their ounterparts {
higher auray measures are probably due to
lower overage. Use of puntuation is similar
toparameterisationof labels,inthatgrammar-
internally it helps to disriminate elds, while
forevaluation itislteredfromtheparse trees.
Pruningofsingleruleourrenesleadstosig-
niant redution in grammar size, in partiu-
lar fornon-binarised grammars. Here,pruning
inurs signiant loss in overage. This is ex-
peted, sineextremely at rulesare likelynot
to re-our several times. For binarised gram-
marspruningyieldsrulesetsofabout1/3,with
almostunhanged100%overage. Ourhypoth-
esiswasthatpruningimprovesthequalityofthe
grammarbyeliminating noiseimported byau-
tomatitreebankonversion. Thisisonrmed,
in all binary grammars, by improved auray
measures. Sine in binary grammars generi
eld rules are binarised and frequently our-
ing,rule pruningis likelytoeliminatenoise.
12
Thus,parameterisationorrespondstothenotionof
# (trainedon16476sents.) size in% len in% len in% in% in% in%
1 para+.bin+.pnt+.prun+
a) 40 867 100.0 14.6 80.4 13.1 93.4 92.9 92.1 98.9
b)all 867 99.8 15.9 78.6 13.7 92.4 92.2 90.7 98.5
2 para+.bin+.pnt+.prun- 2308 99.9 14.6 79.1 13.0 93.3 92.7 92.1 99.1
3 para+.bin+.pnt-.prun+ 679 100.0 14.6 80.8 13.1 92.8 91.7 89.1 98.0
4 para+.bin+.pnt-.prun- 1917 99.9 14.6 79.6 13.0 92.2 91.5 89.0 97.9
5 para+.bin-.pnt+.prun+ 2962 57.5 10.3 49.7 5.7 63.2 79.9 59.3 87.6
6 para+.bin-.pnt+.prun- 19536 88.4 13.6 37.5 6.5 54.0 73.1 48.0 78.8
7 para+.bin-.pnt-.prun+ 2839 67.2 11.6 45.8 6.0 59.8 76.5 52.7 83.3
8 para+.bin-.pnt-.prun- 18365 92.5 13.9 38.9 6.8 55.2 73.6 47.5 78.6
9 para-.bin+.pnt+.prun+ 634 100.0 14.6 74.9 12.4 89.3 89.0 87.5 97.9
10 para-.bin+.pnt+.prun- 1827 99.9 14.6 72.7 12.3 88.3 88.2 86.7 97.7
11 para-.bin+.pnt-.prun+ 489 100.0 14.6 71.6 11.9 86.0 84.5 80.6 95.7
12 para-.bin+.pnt-.prun- 1528 99.9 14.5 70.4 11.8 85.6 84.3 80.9 95.4
13 para-.bin-.pnt+.prun+ 2756 76.4 12.8 37.4 5.6 53.4 71.7 46.6 80.1
14 para-.bin-.pnt+.prun- 18979 94.9 14.2 34.6 6.4 53.4 71.5 46.9 80.4
15 para-.bin-.pnt-.prun+ 2675 80.4 13.3 36.9 5.8 53.2 71.1 45.7 80.5
16 para-.bin-.pnt-.prun- 17885 96.6 14.2 35.4 6.5 53.7 70.7 46.8 82.3
Table 1: Resultsforsystemati grammarvariations(sentene length40,exept 1b)
LP LR
Category in% quota in% quota
CL 88.9 24.3 92.2 23.2
MF 93.2 23.8 93.1 23.7
LB 99.6 17.9 99.4 17.8
VF 96.1 16.3 91.8 16.9
RB 96.3 13.7 95.8 13.7
NF 82.6 3.6 73.4 4.1
S 4.8 0.3 5.3 0.3
DF 16.7 0.1 6.7 0.2
all 93.4 100.0 92.9 100.0
Table2:Category-speievaluation(v1,40) 13
eval perf. math LP LR
param punt in% len in% in%
{ { 80.4 13.1 93.4 92.9
+ { 79.6 13.1 92.7 92.2
{ + 78.5 12.8 92.1 91.6
+ + 77.7 12.8 91.5 91.0
Table3: Dierent evaluation shemes(v1,40)
In sum, our best performing model (v1)
makes use of a maximally disriminative sym-
boligrammar(parameterisedategories,pun-
tuation), resolves sparseness problems by rule
binarisation, and an aord rule pruning to
eliminate noise. Applied to full sentene
lengths (v1b) we note a drop in performane,
13
S-ategorieswereusedfornon-standardbaselauses,
butinsigniantlysoforoverage,and onlyby
1% inLP and 0.7%in LR.
Table 3 details alternative evaluation mea-
sures. Evaluation on parameterised ategories
inurs a slight drop in auray, but in high
ranges.
14
Evaluation of puntuation attah-
ment { whih is of little importane { yields a
furtherdrop.
The learning urve in Fig. 4 is surprising
in that we obtain relatively high performane
from rather small training orpora and gram-
mar sizes (size grows almost linearly from 313
to2308).
15
Saturationregarding overage and
aurayis obtainedaroundtrainingsize6000.
Finally, we determined phenomena that all
for stronger ontextualisation or lexialisation.
A ase in point areverb-seond(V2) sentenes
with a fronted V2 lause in Vorfeld position
(i.e. with VF-V2 ategories), whih allow an
alternative analysis as oordinate lauses with
sharedsubjets. This type of onstrution was
frequentlymis-analysedasaoordinationstru-
ture sine this strutural ambiguity annot be
14
These measuresare relevantforintegrationofshal-
lowanddeepNLP(Crysmannetal.,2002),asparame-
terisedategoriesprovidehighlydisriminativeinforma-
tionthatanbeusedtoguidedeepsyntatiproessing.
15
Note,however, thattheurvepertainsto arobust,
binarisedgrammar.Wehosev2(prun )inordernotto
undulypenalisesmallgrammars. Lakofpruningould
76 78 80 82 84 86 88 90 92 94 96 98 100
0 2000 4000 6000 8000 10000 12000 14000 16000
# of training sentences
labeled precision labeled recall coverage exact match
Figure 4: Learningurve (versionv2)
resolved on thebasis of morphologialor topo-
logial riteria. A promising strategy to en-
haneourmodelis(targeted)lexialisiation,as
these onstrutions typiallyour with a spe-
i typeof \reporting"verbs.
8 Conlusion and Future Work
We presenteda topologialparser forGerman,
using a standard PCFG model trained on an
annotated orpus. We have shown that for
thetaskoftopologialparsinganon-lexialised
PCFG model yields ompetitive results. We
investigated various grammar versions to illus-
trateproblematiaspetsinstohastitopolog-
ial parsing. Category parameterisation (i.e.
ontextualisation)and puntuationwereshown
to inrease auray. Binarisation results in
highoveragegures. Pruningofsingleruleo-
urreneseliminates noise inthe automatially
onstrutedtrainingorpus.
The omplexity of topologial parsing lies
somewhere between the omplexity of hunk
parsing and full onstituent struture parsing.
Our results indiate that a standard PCFG
model is appropriate for the hosen task, but
ould possibly be enhaned bylexialisation.
In future work we will explore extension to
a lexialised model, and investigate asaded
stohasti parsing, by applying a speialised
stohastihunkparsingmodeltophrasalelds,
toobtainfullonstituentstrutureparses. Fur-
ther we willintegrate the TnTtagger (Brants,
with respet to tagging errors, and extend the
modeltoa freeparsing arhiteture.
Referenes
S. Ait-Mokhtar and J. Chanod. 1997. Inremental
Finite-StateParsing. InProeedingsofANLP-97.
B. Crysmann, A. Frank, B. Kiefer, S. Muller,
G.Neumann,J.Piskorski, U.Shafer,M. Siegel,
H.Uszkoreit,F.Xu,M.Beker,andH-U.Krieger.
2002. An Integrated Arhiteture for Deep and
ShallowProessing. InProeedingsofACL2002,
UniversityofPennsylvania,Philadelphia.
A.Belz. 2001. Optimisationoforpus-derivedprob-
abilistigrammars. InProeedingsofCorpusLin-
guistis2001,pp.46{57.
T. Brants, W. Skut, and B. Krenn. 1997. Tag-
ging Grammatial Funtions. In Proeedings of
EMNLP,Providene,RI,USA.
T. Brants. 1997. Internal and external tagsets in
part-of-speeh tagging. In Proeedings of Eu-
rospeeh,Rhodes,Greee.
T. Brants. 2000. TnT - A Statistial Part-of-
SpeehTagger. InProeedingsoftheANLP-2000,
Rhodes,Greee.
C. Braun. 1999. Flahes und robustes Parsen
Deutsher Satzgefuge. Master's thesis, Saarland
University.
E. Charniak. 1996. Tree-bank Grammars. In
AAAI-96. Proeedingsof the ThirteenthNational
Conferene on Artiial Intelligene, pp. 1031{
1036.MITPress.
M.Collins. 1997. Three generativemodelsfor sta-
tistialparsing. InProeedingsoftheACL-97,pp.
16{23.
A.Frank. 2000. AutomatiF-strutureAnnotation
of Treebank Trees. In M. Butt and T.H. King,
(eds),ProeedingsoftheLFG00Conferene,CSLI
OnlinePubliations,Stanford,CA.
N.Gala-Pavia. 1999. UsingtheInrementalFinite-
State Arhiteture to reate a Spanish Shallow
Parser. InProeedingsofXVCongresofSEPLN,
Lleida,Spain.
T.Hohle. 1983. Topologishe Felder. Universityof
Cologne.
B. Kiefer and O. Sherf. 1996. Gimme more HQ
parsers. The generi parser lass of diso. Ms.,
DFKI,Saarbruken,Germany.
G.Neumann, C. Braun,and J. Piskorski. 2000. A
Divide-and-ConquerStrategyforShallowParsing
ofGerman FreeTexts. InProeedingsof ANLP,
pp.239{246,Seattle,Washington.
O.Waushkuhn. 1996. EinWerkzeugzurpartiellen
syntaktishenAnalysedeutsherTextkorpora. In
D.Gibbon, (ed),Proeedingsof the ThirdKON-
VENS Conferene, pp. 356{368, Berlin. Mouton
deGruyter.