Design and Evaluation of a Psychological Experiment on the Effectivness of Document Summarisation for the Retrieval of Multilingual WWW Documents

(1)

Eectiveness of Document Summarisation for the Retrieval of

Multilingual WWW Documents

Joanne Capstick, Gregor Erbach and Hans Uszkoreit

GermanResearchCenter forArticialIntelligenceGmbH

Stuhlsatzenhausweg3

66123 Saarbrucken,Germany

fcapstick,erbach,uszkoreitg@dfki.de

Abstract

Sincefortheforeseeablefuture, retrievalwill be

an interactive task of the user looking through

listsofpotentiallyrelevantdocuments,adequate

support through various types of information

is very important. A psychological experiment

was conducted to examine the extent to which

dierent types of automatically generated sum-

maries aid retrievaland systematically evaluate

user needs and behaviour in the area of cross-

languageretrievalfortheWWW.

Motivation

TheresearchdescribedherewascarriedoutintheMU-

LINEXproject(Erbachetal. 1997),thegoalofwhich

is to develop techniques for the eective retrieval of

multilingual documents from the WWW. Special at-

tention isgiven to presenting search resultsinsuch a

way that users are supported in selecting documents

thatarerelevanttotheirinformationneeds. Thepre-

sentation of search results includes: the language of

thedocument;anautomaticallygeneratedsummary;a

thematicclassication;title;URLanddocumentsize.

Thepurpose of thesummaries is to increaseretrieval

eectiveness (recall of relevant documents) and eÆ-

ciency (precision invisitingrelevantdocuments with-

out wasting time by looking at too many irrelevant

ones). In order to get some information as to what

extentdierenttypesofautomaticallygeneratedsum-

maries aid retrieval and systematically evaluate user

needsandbehaviourinthisarea,weconducted apsy-

chological experiment. Subjects were given dierent

kindsofsummariesandtheirretrievalperformanceon

twotasksweremeasured.

Design of the Experiment

Since thepurpose of thesummary is given byits ap-

plicationcontextwedecidedtouseamock-upsystem.

This enabledusto simulate thesearch task ina real-

Thesubjectsweregiventhetasktosubmitaprede-

terminedquerytoasearchengine,andtoselectdocu-

mentswhichwererelevanttoagiveninformationneed

by looking through lists of potentially relevant doc-

uments. The information needs were formulated as

follows:

1. Whatisgoodandbadfortheheart?

2. Whataretheeectsofozoneonhumanhealth?

A totalof84subjectsweretested; allofthem were

nativespeakers of German and theywere principally

humanity and law students. Six groups of 14 sub-

jectswerepresentedwithdierentkindsofresultslists.

Documentswereorderedbyrelevance,ortheme:

A1. relevance,rstncharactersofdocumenttext

A2. relevance,query-independentsummaries

A3. relevance,query-specicsummaries

B1. theme,rstncharactersofdocument text

B2. theme,query-independentsummaries

B3. theme,query-specicsummaries

IngroupBthesecondaryorderingcriteriumwasrel-

evance. In group A the category of a document was

stated, although this was not an ordering criterium.

The result lists presented to the subjects contained

100 documents from three languages (English, Ger-

man, French). TheresultlistswereretrievedwithAl-

taVista before the experiments in order to eliminate

anyvariationthatmightarisebychangestothedocu-

mentbaseduringthedurationoftheexperimentsand

problems withdead links,movedpages etc. Thethe-

maticcategories were assigned manually to thedocu-

ments, and the summaries were generated automati-

cally. The length of the summaries was xed at 200

characters.

The query-independent summaries were generated

bymakinguseofthestructuralmarkupofdocuments

(headings)andbylayoutspecications(boldface,font

(2)

mentsinwhich stemmedquerytermsoccurred.

Subjectswereaskedtousesummariesandthematic

categories to select documents relevant to the stated

information need. In addition, they could optionally

access machine translations of the summaries which

weregeneratedbySystran.

Data Gathered

Questionnaire Eachquestionnairecontaineddemo-

graphic information about the subject and feedback

aboutaspectsofthesystembeingexamined,i.e. sum-

maries,summary translations,thematiccategories.

Subjectswereaskedtomakejudgementsabouttheir

languageandcomputingskillsforuseascontrol data.

Subjectsalsoprovided feedbackabouttheappeal and

usefulness of the user interface and the information

presented. Toobtainthisfeedbackasemanticdieren-

tialdesigned forsoftware evaluationwas used.

1

Sub-

jectswerealsoaskedtoprovideadditionalcomments.

Logles The logles contained the following infor-

mation for each request to the web server: date and

time; URI-encoded data and path information, e.g.

documentsanddocumentsummariesvisited.

In addition, each document in the document base

was assigned a humanjudgement of the relevance to

thegiveninformationneeds.

Evaluation of the Experiment

Byanalysingtheloglesweexpectedtobeabletocom-

pare theeectiveness of dierent summarytypesand

gaininsightintothesubjects'useofsummarytransla-

tions,thusgivingusinformationaboutsubjectperfor-

mance and behaviour. In addition, the evaluation of

the semantic dierentials from the questionnaire and

thesubjects'commentstoldusaboutthesubjects'as-

sessmentofthesystem.

Thequantitativeevaluationof subjectperformance

andthesemanticdierentialsdidnotprovidestatisti-

callysignicantresults. However,thequalitativeeval-

uation of the summaries and summary translations

proved fruitful. The following is an overview of the

resultsgained from theexperiment. Moredetailscan

befoundin(Unzetal. 1998).

Subjects' Comments

SummaryType Thecommentsareorderedaccord-

ing to positive and negative aspects, as well as areas

for improvement. Themost prominent comments for

each summarytypearelisted.

1

A semantic dierential makes use of a graded rating

scale which enablesa quantitative evaluation of subjects'

wereseen asa timesavingfactor,enablinggreateref-

ciency. Their compact representation and briefness

wasalsopositivelycommentedupon.

A1.&B1. helpful,informative,time-saver.

A2.&B2. helpful,informative,time-saver;compactrep.

A3.&B3. informative,enablequickoverview;briefness.

Thelackofqualityandclarityofsummarieswascrit-

icised and subjects requested longer summaries (this

was also suggested as an improvement by a few sub-

jects). Inthe case of summaries consisting simply of

therst200characterstheirabruptterminationwithin

a word was criticised. A couple of users also com-

mentedonthefactthatthesewerenotrealsummaries

ratherthebeginningofthedocument.

The query-dependent summaries were criticisedby

a few users as being disjointed { clearly the fault of

our generation method { and being unrepresentative

oftheactualcontentofthepage,whichcouldoftenbe

thecasewithquery-specicsummaries,sincethefocus

ofthesummaryisaspectsofthedocumentrelevantto

thequeryandnotthedocumentasa whole.

A1.&B1. uninformative; too short; abrupt termination;

notrealsummary.

A2.&B2. misleading;tooshort.

A3.&B3. uninformative;tooshort;lackofcoherence;un-

representativeofdocument content.

Usersfromthedierentgroups,althoughprincipally

from the query-independent group, suggested listing

the key words and terms from the text in the sum-

mary.

A1.&B1. extendsummary.

A2.&B2. moreprecisewithkeywords.

A3.&B3. moreinformative.

Summary Translations These were used princi-

pally as an aid to understanding due to poor foreign

language skillsor simplyoutofinterestandcuriosity.

A fewsubjectsusedthetranslationsto testtheirown

translation.

Reasons given for not using translations were lack

ofnecessityandsuÆcientknowledgeofthenon-native

languages, alsothatenough informationwas available

inGerman. Somesubjectsdiscontinuedtheuseofsum-

marytranslationsoutoffrustrationduetothequality

of the translations. A few subjects found the use of

summary translationstootimeconsuming.

Semantic Dierentials

According to the semantic dierential, there was no

(3)

the query-independent summaries received a slightly

higherratingthantheothers. Thesummarytypeused

also had no statisticallysignicant eect onthe sub-

jects'assessmentofsummarytranslations.

Thesemanticdierentialusedwasdesignedforsoft-

ware evaluationingeneral. There is nodoubt a need

toidentifymoreappropriatefactorsfortheuser-rating

ofsummariesintheretrievalcontext. Thesecouldbe

obtainedinpartfromsummarytypologiesas outlined

byHutchins(Hutchins1993)or intrinsiccriteriasuch

as conciseness. The subjects' comments could prove

a useful source of criteria, e.g. time-savingvs. time-

consuming,coherentvs. disjointed.

Analysis of the Logles

62subjectsvisitedaGermansummarytranslation(12

ofthemonlyonce). 21subjectsvisitedanEnglishand

10aFrenchsummarytranslation.

Eectofthe SummaryType Nostatisticallysig-

nicantdierencebetweensubjects'performance 2

wrt.

thesummarytypeusedwasfound.

Thismaybeduetoanumberofreasonssuchasthe

qualityof thesummariesand thegenerationmethods

used. Perhapsthesummariesweretooshorttobeable

tofulltheirpurpose. Thesubjectswereperhapsnot

placedunderenoughtimepressure. Thefactthatsub-

jectshadtodealwithsummariesindierentlanguages

perhapsreducedthesignicanceofthesummarytype.

Iftheproblemwas lackofcontrol,i.e. toomanyvari-

ablefactors, then a tighter testenvironmentneedsto

be designed. This could be achieved by simplifying

thesystemandreducingvariability,e.g. removingthe

multilingualdimensionand only using German docu-

ments. However, it is important to ensure that the

users' goals and the summary purpose are preserved.

Itisclearthatweneedamoreappropriateorimproved

methodforthequantitativeevaluationofthesummary

types.

Conclusion

In spite of criticisms regarding the quality and clar-

ity of thesummaries, subjects still regarded them as

a helpfulandtime-saving factorforthe taskinhand.

Subjects would, in general, prefer longer summaries;

although this may lead to screen real-estate prob-

lems. Subjectsalsopraisedtheavailabilityofsummary

translations,althoughqualitywasa problemheretoo.

From thecomments, itseemsnecessary toreduce the

timeandeortneededtousesummary translations.

2

Performance was measuredas the numberofrelevant

maries quantitatively; although the experiments give

indicationsastowhichmethodsmayworkbetterinthe

future. There isthe needto set upa more controlled

test environment which doesnotdisconnect the sum-

mary from itspurpose or change the usergoals. The

commentswererenderedespeciallyvaluablebythefact

thattheyweremadeaboutaconcretesituationjustex-

perienced by thesubjects, andnotdisconnected from

the situationas isoften thecasewith questionnaires.

The comments could also serve as a useful source of

ranking criteria for semantic dierentials used in fu-

turesummaryevaluationwork.

A possible extension to this work may be the in

depthanalysis ofsummariesforcross-lingualinforma-

tion retrieval wrt. a descriptive framework for sum-

maries(e.g. SparckJones1993);andtheidentication

of mappings between elements of the framework and

evaluativemethods.

Acknowledgments

MULINEX is funded by the European Commission's

TelematicsApplications Programme(Language Engi-

neering Sector,LE-4203).

The experiments were designed, carried out and

evaluated by DFKI and the media psychology insti-

tute,MEFISe.V.

References

Endres-Niggemeyer,B.,Hobbs,J.,andSparckJones,

K.eds.1993WorkshoponSummarizingTextforIn-

telligentCommunication.Dagstuhl.

Erbach, G., Neumann, G., and Uszkoreit, H. 1997.

MULINEX: Multilingual Indexing, Navigation and

EditingExtensionsfortheWorldWideWeb.InHull,

D.andOard,D.eds.Cross-LanguageTextandSpeech

Retrieval{Papersfromthe 1997AAAI SpringSym-

posium.AAAIPress,Stanford.

Hutchins, J.1993. Introduction toText Summariza-

tionWorkshop.InEndres-Niggemeyer,B.,Hobbs,J.,

andSparckJones,K.eds.WorkshoponSummarizing

TextforIntelligent Communication.Dagstuhl.

Sparck Jones, K. 1993. Summarising: Analytic

Framework,Key Component,ExperimentalMethod.

In Endres-Niggemeyer, B., Hobbs, J., and Sparck

Jones, K. eds. Workshop on Summarizing Text for

Intelligent Communication.Dagstuhl.

Unz,D.,Capstick,J.,Erbach, G.Heidinger,V.,and

Uszkoreit,H. 1998.PsychologicalExperimentonthe

Presentation of Search Results for the Eective Re-

trieval of Multilingual Documents from the WWW.

TechnicalReport,DFKI,Saarbrucken.