Eectiveness of Document Summarisation for the Retrieval of
Multilingual WWW Documents
Joanne Capstick, Gregor Erbach and Hans Uszkoreit
GermanResearchCenter forArticialIntelligenceGmbH
Stuhlsatzenhausweg3
66123 Saarbrucken,Germany
fcapstick,erbach,uszkoreitg@dfki.de
Abstract
Sincefortheforeseeablefuture, retrievalwill be
an interactive task of the user looking through
listsofpotentiallyrelevantdocuments,adequate
support through various types of information
is very important. A psychological experiment
was conducted to examine the extent to which
dierent types of automatically generated sum-
maries aid retrievaland systematically evaluate
user needs and behaviour in the area of cross-
languageretrievalfortheWWW.
Motivation
TheresearchdescribedherewascarriedoutintheMU-
LINEXproject(Erbachetal. 1997),thegoalofwhich
is to develop techniques for the eective retrieval of
multilingual documents from the WWW. Special at-
tention isgiven to presenting search resultsinsuch a
way that users are supported in selecting documents
thatarerelevanttotheirinformationneeds. Thepre-
sentation of search results includes: the language of
thedocument;anautomaticallygeneratedsummary;a
thematicclassication;title;URLanddocumentsize.
Thepurpose of thesummaries is to increaseretrieval
eectiveness (recall of relevant documents) and eÆ-
ciency (precision invisitingrelevantdocuments with-
out wasting time by looking at too many irrelevant
ones). In order to get some information as to what
extentdierenttypesofautomaticallygeneratedsum-
maries aid retrieval and systematically evaluate user
needsandbehaviourinthisarea,weconducted apsy-
chological experiment. Subjects were given dierent
kindsofsummariesandtheirretrievalperformanceon
twotasksweremeasured.
Design of the Experiment
Since thepurpose of thesummary is given byits ap-
plicationcontextwedecidedtouseamock-upsystem.
This enabledusto simulate thesearch task ina real-
Thesubjectsweregiventhetasktosubmitaprede-
terminedquerytoasearchengine,andtoselectdocu-
mentswhichwererelevanttoagiveninformationneed
by looking through lists of potentially relevant doc-
uments. The information needs were formulated as
follows:
1. Whatisgoodandbadfortheheart?
2. Whataretheeectsofozoneonhumanhealth?
A totalof84subjectsweretested; allofthem were
nativespeakers of German and theywere principally
humanity and law students. Six groups of 14 sub-
jectswerepresentedwithdierentkindsofresultslists.
Documentswereorderedbyrelevance,ortheme:
A1. relevance,rstncharactersofdocumenttext
A2. relevance,query-independentsummaries
A3. relevance,query-specicsummaries
B1. theme,rstncharactersofdocument text
B2. theme,query-independentsummaries
B3. theme,query-specicsummaries
IngroupBthesecondaryorderingcriteriumwasrel-
evance. In group A the category of a document was
stated, although this was not an ordering criterium.
The result lists presented to the subjects contained
100 documents from three languages (English, Ger-
man, French). TheresultlistswereretrievedwithAl-
taVista before the experiments in order to eliminate
anyvariationthatmightarisebychangestothedocu-
mentbaseduringthedurationoftheexperimentsand
problems withdead links,movedpages etc. Thethe-
maticcategories were assigned manually to thedocu-
ments, and the summaries were generated automati-
cally. The length of the summaries was xed at 200
characters.
The query-independent summaries were generated
bymakinguseofthestructuralmarkupofdocuments
(headings)andbylayoutspecications(boldface,font
mentsinwhich stemmedquerytermsoccurred.
Subjectswereaskedtousesummariesandthematic
categories to select documents relevant to the stated
information need. In addition, they could optionally
access machine translations of the summaries which
weregeneratedbySystran.
Data Gathered
Questionnaire Eachquestionnairecontaineddemo-
graphic information about the subject and feedback
aboutaspectsofthesystembeingexamined,i.e. sum-
maries,summary translations,thematiccategories.
Subjectswereaskedtomakejudgementsabouttheir
languageandcomputingskillsforuseascontrol data.
Subjectsalsoprovided feedbackabouttheappeal and
usefulness of the user interface and the information
presented. Toobtainthisfeedbackasemanticdieren-
tialdesigned forsoftware evaluationwas used.
1
Sub-
jectswerealsoaskedtoprovideadditionalcomments.
Logles The logles contained the following infor-
mation for each request to the web server: date and
time; URI-encoded data and path information, e.g.
documentsanddocumentsummariesvisited.
In addition, each document in the document base
was assigned a humanjudgement of the relevance to
thegiveninformationneeds.
Evaluation of the Experiment
Byanalysingtheloglesweexpectedtobeabletocom-
pare theeectiveness of dierent summarytypesand
gaininsightintothesubjects'useofsummarytransla-
tions,thusgivingusinformationaboutsubjectperfor-
mance and behaviour. In addition, the evaluation of
the semantic dierentials from the questionnaire and
thesubjects'commentstoldusaboutthesubjects'as-
sessmentofthesystem.
Thequantitativeevaluationof subjectperformance
andthesemanticdierentialsdidnotprovidestatisti-
callysignicantresults. However,thequalitativeeval-
uation of the summaries and summary translations
proved fruitful. The following is an overview of the
resultsgained from theexperiment. Moredetailscan
befoundin(Unzetal. 1998).
Subjects' Comments
SummaryType Thecommentsareorderedaccord-
ing to positive and negative aspects, as well as areas
for improvement. Themost prominent comments for
each summarytypearelisted.
1
A semantic dierential makes use of a graded rating
scale which enablesa quantitative evaluation of subjects'
wereseen asa timesavingfactor,enablinggreateref-
ciency. Their compact representation and briefness
wasalsopositivelycommentedupon.
A1.&B1. helpful,informative,time-saver.
A2.&B2. helpful,informative,time-saver;compactrep.
A3.&B3. informative,enablequickoverview;briefness.
Thelackofqualityandclarityofsummarieswascrit-
icised and subjects requested longer summaries (this
was also suggested as an improvement by a few sub-
jects). Inthe case of summaries consisting simply of
therst200characterstheirabruptterminationwithin
a word was criticised. A couple of users also com-
mentedonthefactthatthesewerenotrealsummaries
ratherthebeginningofthedocument.
The query-dependent summaries were criticisedby
a few users as being disjointed { clearly the fault of
our generation method { and being unrepresentative
oftheactualcontentofthepage,whichcouldoftenbe
thecasewithquery-specicsummaries,sincethefocus
ofthesummaryisaspectsofthedocumentrelevantto
thequeryandnotthedocumentasa whole.
A1.&B1. uninformative; too short; abrupt termination;
notrealsummary.
A2.&B2. misleading;tooshort.
A3.&B3. uninformative;tooshort;lackofcoherence;un-
representativeofdocument content.
Usersfromthedierentgroups,althoughprincipally
from the query-independent group, suggested listing
the key words and terms from the text in the sum-
mary.
A1.&B1. extendsummary.
A2.&B2. moreprecisewithkeywords.
A3.&B3. moreinformative.
Summary Translations These were used princi-
pally as an aid to understanding due to poor foreign
language skillsor simplyoutofinterestandcuriosity.
A fewsubjectsusedthetranslationsto testtheirown
translation.
Reasons given for not using translations were lack
ofnecessityandsuÆcientknowledgeofthenon-native
languages, alsothatenough informationwas available
inGerman. Somesubjectsdiscontinuedtheuseofsum-
marytranslationsoutoffrustrationduetothequality
of the translations. A few subjects found the use of
summary translationstootimeconsuming.
Semantic Dierentials
According to the semantic dierential, there was no
the query-independent summaries received a slightly
higherratingthantheothers. Thesummarytypeused
also had no statisticallysignicant eect onthe sub-
jects'assessmentofsummarytranslations.
Thesemanticdierentialusedwasdesignedforsoft-
ware evaluationingeneral. There is nodoubt a need
toidentifymoreappropriatefactorsfortheuser-rating
ofsummariesintheretrievalcontext. Thesecouldbe
obtainedinpartfromsummarytypologiesas outlined
byHutchins(Hutchins1993)or intrinsiccriteriasuch
as conciseness. The subjects' comments could prove
a useful source of criteria, e.g. time-savingvs. time-
consuming,coherentvs. disjointed.
Analysis of the Logles
62subjectsvisitedaGermansummarytranslation(12
ofthemonlyonce). 21subjectsvisitedanEnglishand
10aFrenchsummarytranslation.
Eectofthe SummaryType Nostatisticallysig-
nicantdierencebetweensubjects'performance 2
wrt.
thesummarytypeusedwasfound.
Thismaybeduetoanumberofreasonssuchasthe
qualityof thesummariesand thegenerationmethods
used. Perhapsthesummariesweretooshorttobeable
tofulltheirpurpose. Thesubjectswereperhapsnot
placedunderenoughtimepressure. Thefactthatsub-
jectshadtodealwithsummariesindierentlanguages
perhapsreducedthesignicanceofthesummarytype.
Iftheproblemwas lackofcontrol,i.e. toomanyvari-
ablefactors, then a tighter testenvironmentneedsto
be designed. This could be achieved by simplifying
thesystemandreducingvariability,e.g. removingthe
multilingualdimensionand only using German docu-
ments. However, it is important to ensure that the
users' goals and the summary purpose are preserved.
Itisclearthatweneedamoreappropriateorimproved
methodforthequantitativeevaluationofthesummary
types.
Conclusion
In spite of criticisms regarding the quality and clar-
ity of thesummaries, subjects still regarded them as
a helpfulandtime-saving factorforthe taskinhand.
Subjects would, in general, prefer longer summaries;
although this may lead to screen real-estate prob-
lems. Subjectsalsopraisedtheavailabilityofsummary
translations,althoughqualitywasa problemheretoo.
From thecomments, itseemsnecessary toreduce the
timeandeortneededtousesummary translations.
2
Performance was measuredas the numberofrelevant
maries quantitatively; although the experiments give
indicationsastowhichmethodsmayworkbetterinthe
future. There isthe needto set upa more controlled
test environment which doesnotdisconnect the sum-
mary from itspurpose or change the usergoals. The
commentswererenderedespeciallyvaluablebythefact
thattheyweremadeaboutaconcretesituationjustex-
perienced by thesubjects, andnotdisconnected from
the situationas isoften thecasewith questionnaires.
The comments could also serve as a useful source of
ranking criteria for semantic dierentials used in fu-
turesummaryevaluationwork.
A possible extension to this work may be the in
depthanalysis ofsummariesforcross-lingualinforma-
tion retrieval wrt. a descriptive framework for sum-
maries(e.g. SparckJones1993);andtheidentication
of mappings between elements of the framework and
evaluativemethods.
Acknowledgments
MULINEX is funded by the European Commission's
TelematicsApplications Programme(Language Engi-
neering Sector,LE-4203).
The experiments were designed, carried out and
evaluated by DFKI and the media psychology insti-
tute,MEFISe.V.
References
Endres-Niggemeyer,B.,Hobbs,J.,andSparckJones,
K.eds.1993WorkshoponSummarizingTextforIn-
telligentCommunication.Dagstuhl.
Erbach, G., Neumann, G., and Uszkoreit, H. 1997.
MULINEX: Multilingual Indexing, Navigation and
EditingExtensionsfortheWorldWideWeb.InHull,
D.andOard,D.eds.Cross-LanguageTextandSpeech
Retrieval{Papersfromthe 1997AAAI SpringSym-
posium.AAAIPress,Stanford.
Hutchins, J.1993. Introduction toText Summariza-
tionWorkshop.InEndres-Niggemeyer,B.,Hobbs,J.,
andSparckJones,K.eds.WorkshoponSummarizing
TextforIntelligent Communication.Dagstuhl.
Sparck Jones, K. 1993. Summarising: Analytic
Framework,Key Component,ExperimentalMethod.
In Endres-Niggemeyer, B., Hobbs, J., and Sparck
Jones, K. eds. Workshop on Summarizing Text for
Intelligent Communication.Dagstuhl.
Unz,D.,Capstick,J.,Erbach, G.Heidinger,V.,and
Uszkoreit,H. 1998.PsychologicalExperimentonthe
Presentation of Search Results for the Eective Re-
trieval of Multilingual Documents from the WWW.
TechnicalReport,DFKI,Saarbrucken.