Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data
Jaideep Srivastava
y
, Robert Cooley
z
, Mukund Deshpande, Pang-Ning Tan
Department of Computer Science and Engineering University of Minnesota
200 Union St SE Minneapolis, MN 55455
fsrivasta,cooley,deshpand,ptang@cs.umn.edu
ABSTRACT
Webusage mining is the application of data miningtech-
niquestodiscoverusagepatternsfromWebdata,inorderto
understandandbetter servetheneedsofWeb-basedappli-
cations. Webusageminingconsistsofthreephases,namely
preprocessing, patterndiscovery,andpattern analysis. This
paperdescribeseachofthesephasesindetail. Givenitsap-
plicationpotential,Webusage mininghas seenarapid in-
creaseininterest,fromboththeresearchandpracticecom-
munities. Thispaperprovides adetailedtaxonomyof the
workinthisarea,includingresearcheortsaswellascom-
mercialoerings. Anup-to-datesurveyoftheexistingwork
isalsoprovided. Finally,abrief overviewofthe WebSIFT
systemas anexampleof aprototypicalWebusage mining
systemisgiven.
Keywords: datamining,worldwideweb,webusagemin-
ing.
1. INTRODUCTION
The ease and speed with which business transactions can
be carried out overthe Web has been a key driving force
intherapidgrowthofelectronic commerce. Specically,e-
commerceactivitythatinvolvestheenduserisundergoing
asignicantrevolution. Theabilitytotrackusers'browsing
behavior down to individual mouseclickshas brought the
vendorandendcustomercloserthaneverbefore. Itisnow
possibleforavendortopersonalizehisproductmessagefor
individualcustomersatamassivescale,aphenomenonthat
isbeingreferredtoasmasscustomization.
Thescenariodescribedaboveisoneofmanypossibleappli-
cationsofWebUsage mining,whichistheprocess ofapply-
ingdataminingtechniquestothediscoveryofusagepatterns
fromWebdata,targetedtowardsvariousapplications. Data
miningeortsassociatedwiththeWeb,calledWebmining,
canbebroadlydividedinto threeclasses, i.e. content min-
ing, usage mining, and structure mining. Web Structure
miningprojects suchas [34; 54] and Web Content mining
projects such as [47; 21] are beyondthe scope ofthis sur-
Canbecontactedatjaideep@amazon.com
y
SupportedbyNSFgrantNSF/EIA-9818338
z
SupportedbyNSFgrantEHR-9554517
vey. AnearlytaxonomyofWebminingisprovidedin[29],
whichalsodescribesthe architectureofthe WebMinersys-
tem[42],oneoftherstsystemsforWebUsagemining. The
proceedingsof therecent WebKDDworkshop [41], heldin
conjunctionwiththeKDD-1999conference,providesasam-
plingofsomeofthecurrentresearchbeingperformedinthe
areaofWebUsageAnalysis,includingWebUsagemining.
Thispaperprovidesanup-to-datesurveyofWebUsagemin-
ing,includingbothacademicandindustrialresearcheorts,
aswellascommercialoerings. Section2describesthevar-
ious kinds ofWebdata that canbe useful for WebUsage
mining. Section 3discusses thechallenges involvedindis-
coveringusage patternsfrom Webdata. Thethree phases
are preprocessing, pattern discovery, andpatterns analysis.
Section 4provides a detailed taxonomy and survey of the
existing eorts inWeb Usage mining, and Section 5 gives
anoverviewoftheWebSIFTsystem[31], asaprototypical
exampleof aWebUsageminingsystem. nally,Section6
discusses privacyconcerns and Section7 concludesthe pa-
per.
2. WEB DATA
Oneofthekeysteps inKnowledgeDiscoveryinDatabases
[33]istocreateasuitabletargetdatasetforthedatamining
tasks. InWebMining, datacanbecollectedat theserver-
side,client-side,proxyservers,orobtainedfromanorgani-
zation's database(whichcontains businessdataorconsoli-
dated Webdata). Each type of datacollection diersnot
only in termsof the location of the data source, but also
thekindsofdataavailable,thesegmentofpopulationfrom
whichthedatawascollected,anditsmethodofimplemen-
tation.
TherearemanykindsofdatathatcanbeusedinWebMin-
ing. Thispaperclassiessuchdataintothefollowingtypes
:
Content: The real data inthe Web pages, i.e. the
datatheWebpagewasdesignedtoconveytotheusers.
Thisusuallyconsistsof,butisnotlimitedto,textand
graphics.
Structure: Datawhichdescribestheorganization of
thecontent. Intra-pagestructureinformationincludes
thearrangementofvariousHTMLorXMLtagswithin
agivenpage. Thiscanberepresentedasatreestruc-
ture,wherethehhtmlitagbecomestherootofthetree.
ishyper-linksconnectingonepagetoanother.
Usage: Data that describesthe pattern of usage of
Webpages,suchasIPaddresses,pagereferences,and
thedateandtimeofaccesses.
User Prole: Data that provides demographic in-
formationaboutusers oftheWebsite. Thisincludes
registrationdataandcustomerproleinformation.
2.1 Data Sources
Theusage data collected at the dierent sources will rep-
resent the navigationpatterns of dierentsegmentsof the
overallWebtraÆc,rangingfromsingle-user,single-sitebrows-
ingbehaviortomulti-user,multi-siteaccesspatterns.
2.1.1 Server Level Collection
AWebserverlogisanimportantsourceforperformingWeb
UsageMiningbecauseitexplicitlyrecordsthebrowsingbe-
haviorof sitevisitors. Thedatarecorded inserverlogsre-
ectsthe(possiblyconcurrent)accessofaWebsitebymul-
tipleusers. Theselog lescanbestoredinvariousformats
suchas Common log or Extendedlog formats. An exam-
pleofExtendedlogformatisgiveninFigure2(Section3).
However, the site usage data recorded by serverlogs may
notbeentirelyreliableduetothepresenceofvariouslevels
ofcachingwithintheWebenvironment. Cachedpageviews
arenotrecordedinaserverlog. Inaddition,anyimportant
informationpassed throughthe POSTmethodwill notbe
available in a serverlog. Packet sniÆng technology is an
alternativemethodto collectingusagedatathroughserver
logs. Packet sniers monitor network traÆc coming to a
Webserver and extractusage data directly from TCP/IP
packets. TheWebservercanalsostoreotherkindsofusage
informationsuchascookiesandquerydatainseparatelogs.
CookiesaretokensgeneratedbytheWebserverforindivid-
ualclientbrowsersinordertoautomatically trackthe site
visitors. Tracking of individual users is not an easy task
dueto the statelessconnection modeloftheHTTPproto-
col.Cookiesrelyonimplicitusercooperationandthushave
raised growing concerns regardinguser privacy,which will
bediscussedinSection6. Querydataisalsotypicallygen-
eratedbyonline visitors while searching forpages relevant
totheir information needs. Besides usage data,the server
sidealso provides content data, structureinformation and
Webpage meta-information(suchas the size ofa leand
itslastmodiedtime).
The Web server also relies on other utilities such as CGI
scriptstohandledatasentbackfromclientbrowsers. Web
serversimplementingthe CGI standardparsethe URI 1
of
the requested le to determine if it is an application pro-
gram. TheURIfor CGIprogramsmaycontainadditional
parametervaluestobepassedtotheCGIapplication. Once
the CGI program has completed its execution, the Web
serversendthe outputofthe CGIapplicationbackto the
browser.
2.1.2 Client Level Collection
1
UniformResourceIdentier (URI)is amoregeneralde-
nitionthatincludesthecommonlyreferredto UniformRe-
sourceLocator(URL).
moteagent(suchasJavascriptsorJavaapplets)orbymod-
ifyingthe sourcecode ofanexisting browser (suchas Mo-
saic or Mozilla) to enhanceits datacollectioncapabilities.
The implementation ofclient-side datacollection methods
requiresusercooperation,eitherinenablingthefunctional-
ityoftheJavascriptsandJavaapplets,ortovoluntarilyuse
the modiedbrowser. Client-side collectionhasan advan-
tage overserver-sidecollectionbecause itameliorates both
the caching and session identication problems. However,
Javaappletsperformnobetter thanserverlogsintermsof
determiningtheactualviewtimeofapage. Infact,itmay
incursomeadditionaloverheadespeciallywhentheJavaap-
plet is loaded for the rst time. Javascripts, onthe other
hand, consume little interpretation time but cannot cap-
turealluserclicks(suchasreloadorbackbuttons). These
methodswillcollectonlysingle-user,single-sitebrowsingbe-
havior. Amodiedbrowserismuchmoreversatileandwill
allowdatacollectionaboutasingleuserovermultipleWeb
sites. The mostdiÆcult part of usingthismethodis con-
vincingtheuserstousethebrowserfortheirdailybrowsing
activities. Thiscanbe donebyoeringincentivesto users
who are willing to use the browser, similar to the incen-
tiveprogramsoeredbycompaniessuchasNetZero[9]and
AllAdvantage [2] that reward users for clicking onbanner
advertisementswhilesurngtheWeb.
2.1.3 Proxy Level Collection
A Webproxyacts as an intermediatelevel of caching be-
tweenclient browsers andWebservers. Proxycachingcan
be used to reduce the loading time of a Web page expe-
rienced by users as wellas the network traÆc load at the
serverandclientsides[27]. Theperformanceofproxycaches
dependsontheirabilitytopredictfuturepagerequestscor-
rectly. Proxytraces may revealthe actualHTTPrequests
from multiple clients to multiple Web servers. This may
serve as a data sourcefor characterizing the browsing be-
havior of a group of anonymous users sharing a common
proxyserver.
2.2 Data Abstractions
The information provided by the data sources described
abovecanallbeusedtoconstruct/identifyseveraldataab-
stractions, notably users, server sessions, episodes, click-
streams, and page views. Inorderto providesome consis-
tency in the way these terms are dened, the W3C Web
CharacterizationActivity(WCA)[14]haspublishedadraft
ofWebtermdenitionsrelevanttoanalyzingWebusage. A
user is dened as a single individual that is accessing le
from one or more Web servers through a browser. While
thisdenitionseemstrivial,inpracticeitisverydiÆcultto
uniquelyand repeatedly identifyusers. A usermay access
theWebthroughdierentmachines,or usemorethanone
agentonasinglemachine. Apageviewconsistsofeveryle
that contributesto the display ona user's browser at one
time. Page viewsare usuallyassociatedwith asingleuser
action(suchasamouse-click)andcanconsistofseveralles
suchasframes, graphics,andscripts. Whendiscussingand
analyzinguserbehaviors,itisreallytheaggregatepageview
that isof importance. Theuserdoesnotexplicitly askfor
\n" frames and \m" graphicsto be loaded into his or her
browser,theuserrequests a\Webpage." Allof theinfor-
mation to determine which les constitute a page view is
tialseriesof pageviewrequests. Again, thedataavailable
fromtheserversidedoesnotalwaysprovide enoughinfor-
mationto reconstructthe full click-stream for asite. Any
pageviewaccessedthroughaclientorproxy-levelcachewill
notbe\visible"from theserverside. Auser sessionisthe
click-streamofpageviewsfor asingeuseracrosstheentire
Web.Typically,onlytheportionofeachusersessionthatis
accessingaspecicsitecanbeusedforanalysis,sinceaccess
informationisnotpubliclyavailablefromthevastmajority
of Web servers. The set of page-views in a user session
for aparticular Web siteis referred to as a server session
(alsocommonlyreferred to asa visit). Aset ofserverses-
sionsis thenecessaryinputfor any WebUsageanalysisor
dataminingtool. Theendofaserversession isdenedas
thepointwhentheuser'sbrowsingsessionatthatsitehas
ended. Again,thisisasimpleconceptthatis verydiÆcult
totrackreliably. Any semantically meaningful subsetofa
userorserversessionisreferredtoasanepisodebytheW3C
WCA.
3. WEB USAGE MINING
Asshownin Figure1, thereare threemain tasks for per-
forming Web UsageMining or WebUsage Analysis. This
sectionpresentsanoverviewof thetasksfor eachstepand
discussesthechallengesinvolved.
3.1 Preprocessing
Preprocessingconsistsofconvertingtheusage,content,and
structureinformationcontainedinthevariousavailabledata
sourcesintothedataabstractionsnecessaryforpatterndis-
covery.
3.1.1 Usage Preprocessing
Usage preprocessing is arguably the most diÆcult task in
theWebUsageMiningprocessduetotheincompletenessof
theavailabledata. Unlessaclientsidetrackingmechanism
is used, only the IP address, agent, and server side click-
stream are available to identify users and server sessions.
Someofthetypicallyencounteredproblemsare:
Single IP address/Multiple ServerSessions -Internet
serviceproviders(ISPs)typicallyhaveapoolofproxy
servers that users access the Webthrough. A single
proxyservermay haveseveral usersaccessing a Web
site, potentiallyoverthesametimeperiod.
MultipleIPaddress/SingleServerSession-SomeISPs
or privacy toolsrandomlyassign eachrequestfroma
user to one of several IP addresses. In this case, a
singleserversessioncanhavemultipleIPaddresses.
MultipleIPaddress/SingleUser-Auserthataccesses
the Webfrom dierent machineswillhaveadierent
IP addressfromsessiontosession. Thismakestrack-
ingrepeatvisitsfromthesameuserdiÆcult.
Multiple Agent/Singe User -Again, auser that uses
more than one browser, even on the same machine,
willappearasmultipleusers.
Assumingeachuserhasnowbeenidentied(throughcook-
ies,logins, orIP/agent/path analysis),theclick-streamfor
eachusermustbedividedintosessions. Sincepagerequests
to know whena userhasleft aWebsite. A thirty minute
timeout is often usedas the defaultmethodofbreaking a
user'sclick-streamintosessions. Thethirtyminutetimeout
is basedon the results of [23]. Whena session IDis em-
beddedineachURI,thedenitionofasessionissetbythe
contentserver.
Whiletheexactcontentservedasaresult ofeachuserac-
tion is often available from the requesteld in the server
logs,itissometimesnecessarytohaveaccesstothecontent
serverinformationaswell. Sincecontentserverscanmain-
tainstatevariablesfor eachactivesession, the information
necessarytodetermineexactlywhatcontentisservedbya
user requestis not always available in the URI.The nal
problemencounteredwhenpreprocessingusagedataisthat
ofinferringcachedpagereferences. AsdiscussedinSection
2.2,theonlyveriablemethodoftrackingcachedpageviews
is tomonitorusage fromthe client side. Thereferrer eld
foreachrequestcanbeusedtodetectsomeoftheinstances
whencachedpageshavebeenviewed.
Figure 2shows asample log that illustrates several of the
problems discussed above (The rst columnwould not be
present in anactualserver log, and is for illustrative pur-
poses only). IP address 123.456.78.9 is responsible for
three serversessions, and IP addresses 209.456.78.2 and
209.45.78.3 are responsible for a fourth session. Using
a combination of referrer and agent information, lines 1
through11canbedividedintothreesessionsofA-B-F-O-G,
L-R,andA-B-C-J.Pathcompletionwouldaddtwopageref-
erencestotherstsessionA-B-F-O-F-B-G,andonereference
to the thirdsession A-B-A-C-J. Withoutusing cookies, an
embeddedsessionID,oraclient-sidedatacollectionmethod,
thereisnomethodfordeterminingthatlines12and13are
actuallyasingleserversession.
3.1.2 Content Preprocessing
Content preprocessing consists of converting the text, im-
age, scripts, and otherles suchas multimediainto forms
that are useful for the WebUsageMining process. Often,
this consists of performing content mining such as classi-
cation or clustering. Whileapplying data miningto the
contentofWebsitesisaninterestingareaofresearchinits
ownright,inthecontextofWebUsageMining thecontent
of asitecanbeusedto lter theinput to,or outputfrom
the patterndiscovery algorithms. For example, results of
aclassicationalgorithmcouldbeusedtolimitthediscov-
eredpatternstothosecontainingpageviewsaboutacertain
subject or class of products. In addition to classifying or
clustering pageviewsbasedontopics, pageviewscanalso
be classiedaccording totheir intendeduse[50; 30]. Page
viewscanbeintendedtoconveyinformation(throughtext,
graphics,orothermultimedia),gatherinformationfromthe
user,allownavigation(throughalistofhypertextlinks),or
some combination uses. The intended use of a page view
canalsolterthesessionsbeforeorafterpatterndiscovery.
In orderto run content mining algorithmson pageviews,
the informationmustrst beconvertedintoaquantiable
format. Someversionofthevectorspacemodel[51]istyp-
icallyusedto accomplishthis. Textles canbe brokenup
into vectors of words. Keywords or text descriptions can
be substitutedfor graphicsor multimedia. Thecontent of
staticpageviewscanbeeasilypreprocessedbyparsingthe
HTML andreformatting the information or runningaddi-
Preprocessing Pattern Discovery Pattern Analysis Site Files
"Interesting"
Rules, Patterns, and Statistics Rules, Patterns,
and Statistics Preprocessed
Clickstream Data Raw Logs
Figure1:HighLevelWebUsage MiningProcess
# IP Address Userid Time Method/ URL/ Protocol Status Size Referrer Agent 1 123.456.78.9 - [25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 2 123.456.78.9 - [25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I) 3 123.456.78.9 - [25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130 - Mozilla/3.04 (Win95, I) 4 123.456.78.9 - [25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I) 5 123.456.78.9 - [25/Apr/1998:03:06:58 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.01 (X11, I, IRIX6.2, IP22) 6 123.456.78.9 - [25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 7 123.456.78.9 - [25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 200 8140 L.html Mozilla/3.04 (Win95, I) 8 123.456.78.9 - [25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 200 1820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 9 123.456.78.9 - [25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 200 2270 F.html Mozilla/3.04 (Win95, I) 10 123.456.78.9 - [25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 200 9430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 11 123.456.78.9 - [25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 200 7220 B.html Mozilla/3.04 (Win95, I) 12 209.456.78.2 - [25/Apr/1998:05:05:22 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 13 209.456.78.3 - [25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 200 1680 A.html Mozilla/3.04 (Win95, I)
Figure2: SampleWebServerLog
moreofachallenge. Contentserversthatemploypersonal-
izationtechniquesand/ordrawupondatabasestoconstruct
thepageviewsmaybecapableofformingmorepageviews
thancanbepracticallypreprocessed. Agivensetofserver
sessionsmayonlyaccessafractionofthepageviewspossible
for alarge dynamicsite. Also the content may berevised
onaregularbasis. Thecontentofeachpageviewtobepre-
processedmustbe\assembled",eitherbyanHTTPrequest
from a crawler, or a combination of template, script, and
database accesses. If only the portion of pageviews that
areaccessed arepreprocessed,theoutputof anyclassica-
tionorclusteringalgorithmsmaybeskewed.
3.1.3 Structure Preprocessing
Thestructureofasiteiscreatedbythehypertextlinksbe-
tweenpageviews. Thestructurecanbeobtainedandpre-
processedinthesamemannerasthecontentofasite. Again,
dynamiccontent (andtherefore links) posemore problems
thanstaticpageviews. Adierentsitestructuremayhave
tobeconstructedforeachserversession.
3.2 Pattern Discovery
Patterndiscoverydraws uponmethodsand algorithmsde-
velopedfrom several elds such as statistics, datamining,
machine learning and pattern recognition. However, it is
nottheintentofthispapertodescribealltheavailablealgo-
rithmsandtechniquesderivedfromtheseelds. Interested
readersshouldconsultreferencessuchas[33;24]. Thissec-
tiondescribesthekindsofminingactivitiesthathavebeen
appliedtotheWebdomain. Methodsdevelopedfromother
eldsmusttakeintoconsiderationthedierentkindsofdata
abstractionsandpriorknowledgeavailableforWebMining.
For example, inassociationrule discovery, the notion ofa
transaction for market-basket analysis does not take into
considerationthe orderinwhichitems are selected. How-
ever, inWeb UsageMining, a serversession is anordered
sequenceofpagesrequestedbyauser. Furthermore,dueto
thediÆcultyinidentifyinguniquesessions,additionalprior
knowledgeis required(suchas imposingadefault timeout
period,aswaspointedoutintheprevious section).
3.2.1 Statistical Analysis
Statistical techniquesare themostcommonmethodtoex-
tractknowledgeaboutvisitorstoaWebsite. Byanalyzing
thesessionle, onecanperformdierent kindsofdescrip-
tivestatisticalanalyses(frequency,mean,median,etc.) on
variables suchas pageviews, viewingtimeandlengthofa
navigationalpath. ManyWebtraÆcanalysistoolsproduce
aperiodicreportcontainingstatistical informationsuchas
themostfrequentlyaccessedpages,averageviewtimeofa
pageoraveragelengthofapaththroughasite. Thisreport
mayincludelimitedlow-levelerroranalysissuchasdetect-
ingunauthorizedentrypointsorndingthemost common
invalid URI. Despite lacking in the depth of its analysis,
thistypeofknowledgecanbepotentiallyusefulforimprov-
ing thesystem performance, enhancingthe security ofthe
system,facilitatingthesitemodicationtask,andproviding
supportformarketingdecisions.
3.2.2 Association Rules
Associationrulegenerationcanbeusedtorelatepagesthat
aremostoftenreferencedtogetherinasingleserversession.
to sets of pagesthat are accessed together withasupport
valueexceedingsomespeciedthreshold. Thesepagesmay
notbedirectlyconnectedtooneanotherviahyperlinks. For
example, associationrule discoveryusingthe Apriori algo-
rithm[18] (orone ofitsvariants) mayrevealacorrelation
betweenuserswhovisitedapagecontainingelectronicprod-
uctstothosewho accessapageaboutsportingequipment.
Asidefrombeingapplicableforbusinessandmarketingap-
plications, the presence or absence of such rules can help
WebdesignerstorestructuretheirWebsite. Theassociation
rulesmayalsoserveasaheuristicforprefetchingdocuments
in order to reduce user-perceived latency when loading a
pagefromaremotesite.
3.2.3 Clustering
Clustering is atechniqueto grouptogether a set of items
having similar characteristics. In the Web Usage domain,
therearetwokindsofinterestingclusterstobediscovered:
usage clustersand pageclusters. Clustering ofusers tends
toestablishgroupsofusersexhibitingsimilarbrowsingpat-
terns. Suchknowledgeisespeciallyusefulforinferringuser
demographics inorder to performmarketsegmentationin
E-commerceapplicationsorprovidepersonalizedWebcon-
tent to the users. Onthe otherhand, clustering of pages
will discovergroups ofpages havingrelated content. This
information is useful for Internet search engines and Web
assistance providers. In both applications, permanent or
dynamic HTMLpagescanbecreatedthat suggestrelated
hyperlinkstotheuseraccordingtotheuser'squeryorpast
historyofinformationneeds.
3.2.4 Classification
Classication is the task ofmapping a data iteminto one
of several predened classes[33]. Inthe Webdomain, one
is interestedindeveloping aproleofusersbelongingto a
particular class or category. This requires extraction and
selection of features that bestdescribe the properties of a
givenclassorcategory. Classicationcanbedonebyusing
supervised inductive learning algorithms such as decision
tree classiers, naive Bayesian classiers, k-nearest neigh-
bor classiers,SupportVector Machinesetc. Forexample,
classicationonserverlogsmaylead tothediscoveryofin-
terestingrules suchas: 30%ofuserswhoplacedanonline
orderin/Product/Musicareinthe18-25agegroupandlive
ontheWestCoast.
3.2.5 Sequential Patterns
The techniqueof sequential patterndiscovery attempts to
ndinter-sessionpatternssuchthatthepresenceofasetof
itemsisfollowedbyanotheriteminatime-orderedsetofses-
sions or episodes. By usingthis approach,Web marketers
can predict future visit patterns which will be helpful in
placingadvertisementsaimedatcertainusergroups.Other
typesoftemporalanalysisthatcanbeperformedonsequen-
tialpatternsincludestrendanalysis,changepointdetection,
orsimilarityanalysis.
3.2.6 Dependency Modeling
Dependency modeling is another useful pattern discovery
task inWebMining. Thegoal here is to develop amodel
capableof representingsignicantdependenciesamongthe
variousvariables intheWeb domain. Asanexample, one
stagesavisitorundergoeswhileshoppinginanonlinestore
basedontheactionschosen(ie. fromacasualvisitortoase-
riouspotentialbuyer).Thereareseveralprobabilisticlearn-
ingtechniquesthatcanbeemployedtomodelthebrowsing
behavior ofusers. SuchtechniquesincludeHiddenMarkov
ModelsandBayesianBeliefNetworks.ModelingofWebus-
agepatternswill notonly providea theoreticalframework
foranalyzingthebehaviorofusersbutispotentiallyuseful
forpredictingfutureWebresourceconsumption.Suchinfor-
mationmayhelp developstrategiestoincreasethesalesof
productsoeredbytheWebsiteorimprovethenavigational
convenienceofusers.
3.3 Pattern Analysis
Patternanalysis is the last stepin the overall Web Usage
miningprocess as described inFigure 1. The motivation
behindpatternanalysisistolteroutuninterestingrulesor
patternsfromthesetfoundinthepatterndiscoveryphase.
Theexactanalysismethodologyisusuallygovernedbythe
applicationforwhichWebminingisdone. Themostcom-
monformofpatternanalysisconsistsofaknowledgequery
mechanismsuchas SQL.Anothermethodisto loadusage
dataintoadatacubeinordertoperformOLAPoperations.
Visualization techniques, such as graphing patterns or as-
signingcolors todierentvalues,canoftenhighlightoverall
patternsortrendsinthedata. Contentandstructureinfor-
mationcanbeusedto lter outpatternscontainingpages
ofacertain usagetype,content type, orpages thatmatch
acertainhyperlinkstructure.
4. TAXONOMY AND PROJECT SURVEY
Since 1996 there have been several research projects and
commercial products that have analyzed Web usage data
for anumberof dierent purposes. This section describes
the dimensions and application areas that can be used to
classifyWebUsageMiningprojects.
4.1 Taxonomy Dimensions
Whilethenumberofcandidatedimensionsthatcanbeused
to classify Web Usage Mining projects is many, there are
vemajordimensionsthatapplytoeveryproject-thedata
sources usedto gather input, the typesof input data,the
numberofusersrepresentedineachdataset,thenumberof
Websitesrepresentedineachdataset,andtheapplication
area focused on by the project. Usage datacan eitherbe
gatheredat the serverlevel,proxylevel, orclient level,as
discussedinSection2.1. AsshowninFigure3,mostprojects
make use of server side data. All projects analyze usage
dataandsomealsomakeuseofcontent,structure,orprole
data. Thealgorithmsforaprojectcanbedesignedtowork
oninputsrepresentingoneormanyusersandoneormany
Websites. Singleuserprojectsaregenerallyinvolvedinthe
personalizationapplicationarea. Theprojectsthatprovide
multi-siteanalysisuseeitherclientorproxylevelinputdata
in order to easily access usage data from more than one
Website. MostWebUsageMiningprojectstakesingle-site,
multi-user,server-sideusagedata(Webserverlogs)asinput.
4.2 Project Survey
AsshowninFigures3and4,usagepatternsextractedfrom
Web data have been applied to a wide range of applica-
tions. Projects suchas[31; 55;56; 58; 53] havefocused on
the processtowardsoneofthevarioussub-categories. The
WebSIFTprojectisdiscussedinmoredetailinthenextsec-
tion. Chen etal. [25] introduced the concept of maximal
forwardreferencetocharacterizeuserepisodesforthemin-
ingoftraversalpatterns. Amaximalforwardreferenceisthe
sequenceofpagesrequestedbyauseruptothelastpagebe-
forebacktrackingoccursduringaparticular serversession.
TheSpeedTracerproject[56]fromIBMWatsonisbuilton
theworkoriginallyreportedin[25]. Inadditionto episode
identication, SpeedTracermakesuse ofreferrerandagent
information inthe preprocessing routines to identify users
and serversessions inthe absence ofadditional client side
information. The Web Utilization Miner (WUM) system
[55] provides a robustmininglanguagein orderto specify
characteristicsofdiscoveredfrequentpathsthatareinterest-
ingtotheanalyst. Intheirapproach,individualnavigation
paths, called trails, are combined into anaggregated tree
structure. Queries canbeansweredbymapping theminto
theintermediatenodesofthetreestructure.Hanetal. [58]
have loaded Webserverlogs into adata cubestructurein
ordertoperformdataminingaswellasOn-LineAnalytical
Processing(OLAP)activitiessuchasroll-upanddrill-down
of thedata. Their WebLogMinersystem hasbeenused to
discoverassociation rules, performclassication and time-
series analysis (suchas event sequenceanalysis, transition
analysis and trendanalysis). Shahabi et. al. [53; 59] have
one of the few Web Usage mining systems that relies on
clientsidedatacollection. Theclientsideagentsendsback
pagerequestandtimeinformationtotheservereverytime
a pagecontaining theJavaapplet(eithera newpageor a
previouslycachedpage)isloadedordestroyed.
4.2.1 Personalization
PersonalizingtheWebexperienceforauseristheholygrail
of many Web-based applications, e.g. individualized mar-
keting for e-commerce [4]. Making dynamic recommenda-
tionstoaWebuser, basedonher/hisproleinadditionto
usagebehavior isveryattractivetomanyapplications,e.g.
cross-salesand up-salesine-commerce. Webusagemining
isanexcellentapproachforachievingthisgoal,asillustrated
in[43] Existingrecommendationsystems,suchas[8; 6],do
notcurrentlyusedataminingforrecommendations,though
therehavebeensomerecentproposals[16].
TheWebWatcher[37],SiteHelper[45],Letizia[39],andclus-
tering workby Mobasheret. al. [43] and Yanet. al. [57]
haveallconcentratedonprovidingWebSitepersonalization
basedonusageinformation. Webserverlogs wereusedby
Yanet. al. [57] to discover clusters of users havingsim-
ilar access patterns. The system proposed in[57] consists
of anoinemodule thatwill performcluster analysis and
anonlinemodulewhichisresponsiblefordynamiclinkgen-
eration of Webpages. Every site userwill be assigned to
a single cluster based on their current traversal pattern.
The links that are presented to a given user are dynami-
cally selectedbasedonwhatpages otherusersassigned to
thesameclusterhavevisited. TheSiteHelperprojectlearns
auserspreferencesbylookingatthepageaccessesforeach
user. Alist ofkeywords from pagesthat a userhas spent
a signicant amount of time viewingis compiledand pre-
sented to the user. Based onfeedback about the keyword
list, recommendations for other pages within the site are
made. WebWatcher \follows" auser as heor she browses
Project Application Data Source Data Type User Site Focus Server Proxy Client Structure Content Usage Profile Single Multi Single Multi
WebSIFT (CTS99) General x x x x x x
SpeedTracer (WYB98,CPY96) General x x x x
WUM (SF98) General x x x x x
Shahabi (SZAS97,ZASS97) General x x x x x
Site Helper (NW97) Personalization x x x x x
Letizia (Lie95) Personalization x x x x x
Web Watcher (JFM97) Personalization x x x x x x
Krishnapuram(NKJ99) Personalization x x x x
Analog (YJGD96) Personalization x x x x
Mobasher (MCS99) Personalization x x x x x
Tuzhilin(PT98) Business x x x x
SurfAid Business x x x x x
Buchner(BM98) Business x x x x x
WebTrends,Hitlist,Accrue,etc. Business x x x x
WebLogMiner (ZXH98) Business x x x x
PageGather,SCML (PE98,PE99) Site Modification x x x x x x
Manley(Man97) Characterization x x x x x
Arlitt(AW96) Characterization x x x x x
Pitkow(PIT97,PIT98) Characterization x x x x x x
Almeida(ABC96) Characterization x x x x
Rexford(CKR98) System Improve. x x x x x
Schechter(SKS98) System Improve. x x x x
Aggarwal(AY97) System Improve. x x x x
Figure3: WebUsageMiningResearchProjectsandProducts
Site Modification
Business Intelligence System
Improvement Personalization
Web Usage Mining
Usage Characterization WebSIFT
WUM SpeedTracer WebLogMiner Shahabi
Site Helper Letizia Web Watcher Mobasher Analog Krishnapuram
Rexford Schecter Aggarwal
Adaptive Sites SurfAid Buchner Tuzhilin
Pitkow Arlitt Manley Almeida
Figure4: Major ApplicationAreasfor WebUsageMining
tothe user. TheWebWatcherstartswith ashortdescrip-
tionofausersinterest. Eachpagerequestisroutedthrough
the WebWatcher proxyserver inorder to easily track the
usersessionacrossmultipleWebsitesand markany inter-
esting links. WebWatcher learns based on the particular
user'sbrowsingplusthe browsing ofother users withsim-
ilar interests. Letizia is a client side agent that searches
theWebforpagessimilartoonesthattheuserhasalready
viewed orbookmarked. Thepagerecommendationsin[43]
arebasedonclustersofpagesfoundfromtheserverlogfora
site. Thesystemrecommendspagesfromclustersthatmost
closelymatchthecurrentsession. Pagesthathavenotbeen
viewedandarenotdirectlylinkedfromthecurrentpageare
recommendedtotheuser. [44]attemptstoclusteruserses-
sionsusingafuzzyclustering algorithm. [44] allows apage
orusertobeassignedtomorethanonecluster.
4.2.2 System Improvement
Performanceandotherservicequalityattributesarecrucial
to user satisfaction from services such as databases, net-
works,etc. Similarqualitiesare expectedfromtheusersof
Webservices. Webusageminingprovidesthekeytounder-
standing Web traÆc behavior, which canin turn be used
for developing policiesfor Webcaching, networktransmis-
sion[27],loadbalancing,ordatadistribution. Securityisan
acutelygrowing concern for Web-basedservices, especially
as electronic commerce continues to grow at an exponen-
tialrate[32]. Webusage miningcanalsoprovidepatterns
which are useful for detecting intrusion, fraud, attempted
break-ins,etc.
Almeidaetal. [19]proposemodelsforpredictingthelocal-
ity, both temporal as well as spatial, amongstWeb pages
requestedfromaparticularuseroragroupofusersaccess-
ing fromthe sameproxyserver. Thelocality measure can
thenbeusedfordecidingpre-fetchingandcachingstrategies
fortheproxyserver. Theincreasinguseofdynamiccontent
hasreduced thebenetsofcaching at boththe client and
serverlevel.Schechteret. al.[52]havedevelopedalgorithms
forcreatingpathprolesfromdatacontainedinserverlogs.
Theseprolesarethenusedtopre-generatedynamicHTML
pagesbasedonthecurrentuserproleinordertoreducela-
tencyduetopagegeneration. Usingproxyinformationfrom
pre-fetchingpageshasalsobeenstudiedby[27]and[17].
4.2.3 Site Modification
Theattractivenessof aWebsite, intermsofbothcontent
andstructure,iscrucialtomanyapplications,e.g.aproduct
catalogfore-commerce.Webusageminingprovidesdetailed
feedbackonuserbehavior,providingtheWebsitedesigner
informationonwhichtobaseredesigndecisions.
While the results of any of the projects could lead to re-
designingthe structureand contentof asite, the adaptive
Websiteproject(SCMLalgorithm)[48; 49]focuses onau-
tomaticallychangingthestructureofasitebasedonusage
patternsdiscoveredfromserverlogs. Clusteringofpagesis
usedtodeterminewhichpagesshouldbedirectlylinked.
4.2.4 Business Intelligence
InformationonhowcustomersareusingaWebsiteiscritical
information for marketers ofe-tailing businesses. Buchner
etal[22]havepresentedaknowledgediscoveryprocessinor-
dertodiscovermarketingintelligencefromWebdata.They
usagedataalongwithmarketingdatafore-commerceappli-
cations. Theyidentiedfourdistinctstepsincustomerrela-
tionshiplifecyclethatcanbesupportedbytheirknowledge
discoverytechniques: customerattraction,customerreten-
tion,crosssalesandcustomerdeparture. Thereareseveral
commercialproducts,suchasSurfAid[11],Accrue[1],Net-
Genesis [7], Aria [3], Hitlist [5], and WebTrends [13] that
provideWebtraÆcanalysismainlyforthepurposeofgath-
ering business intelligence. Accrue, NetGenesis, and Aria
are designed to analyze e-commerce events such as prod-
uctsboughtandadvertisementclick-throughrates inaddi-
tiontostraightforward usagestatistics. Accrueprovidesa
pathanalysisvisualizationtoolandIBM'sSurfAidprovides
OLAPthroughadatacubeandclusteringofusersinaddi-
tiontopageview statistics. Padmanabhanet. al. [46] use
Webserverlogstogeneratebeliefsabouttheaccesspatterns
ofWebpagesatagivenWebsite. Algorithmsforndingin-
terestingrulesbasedontheunexpectednessoftherulewere
alsodeveloped.
4.2.5 Usage Characterization
Whilemostprojectsthatworkoncharacterizingtheusage,
content, and structure of the Web don't necessarily con-
sider themselvesto be engaged indata mining, there is a
large amount of overlap between Webcharacterization re-
searchandWebUsagemining. Catledgeetal. [23]discuss
theresultsofastudyconductedattheGeorgiaInstituteof
Technology,inwhichtheWebbrowserXmosaicwas modi-
edtolog clientsideactivity. Theresultscollectedprovide
detailed information about the user's interaction with the
browserinterfaceaswellasthenavigationalstrategyusedto
browseaparticularsite. Theprojectalsoprovidesdetailed
statistics aboutoccurrenceofthevariousclientsideevents
suchastheclickingtheback/forwardbuttons,savingale,
addingtobookmarksetc. Pitkowetal.[36]proposeamodel
whichcanbeusedtopredicttheprobabilitydistributionfor
variouspagesausermightvisitonagivensite. Thismodel
worksbyassigningavaluetoallthepagesonasitebasedon
various attributesof that page. Theformulas andthresh-
old valuesusedinthemodelare derivedfromanextensive
empirical studycarried outonvarious browsingcommuni-
ties and their browsing patternsArlitt et. al. [20] discuss
variousperformancemetricsforWebserversalongwithde-
tails about the relationshipbetween eachof these metrics
for dierent workloads. Manley [40] develops a technique
for generating a custom made benchmark for a given site
based onitscurrent workload. This benchmark, whichhe
calls aself-conguring benchmark, canbe usedto perform
scalabilityandloadbalancingstudiesonaWebserver. Chi
et. al. [35] describe asystem called WEEV (WebEcology
andEvolutionVisualization)whichisavisualizationtoolto
study the evolving relationship of web usage, content and
sitetopologywithrespecttotime.
5. WEBSIFT OVERVIEW
TheWebSIFTsystem[31]is designed toperformWebUs-
ageMining fromserverlogsintheextendedNSCAformat
(includesreferrerandagentelds). Thepreprocessingalgo-
rithms includeidentifying users, serversessions, and infer-
ringcachedpagereferencesthroughtheuse ofthe referrer
eld. Thedetails ofthealgorithmsusedforthesestepsare
containedin [30]. Inaddition to creating a server session
preprocessing,andprovidestheoptiontoconvertserverses-
sionsintoepisodes. Eachepisodeiseitherthesubsetofall
content pages ina server session, or all of the navigation
pagesuptoandincludingeachcontentpage. Severalalgo-
rithms for identifyingepisodes (referredto as transactions
inthepaper)aredescribedandevaluatedin[28].
Theserver session or episode les can be run throughse-
quentialpatternanalysis,associationrulediscovery,cluster-
ing, orgeneral statisticsalgorithms, as showninFigure 5.
Theresultsofthevariousknowledgediscoverytoolscanbe
analyzedthrougha simpleknowledgequerymechanism,a
visualizationtool(associationrulemapwithcondenceand
supportweighted edges), or the information lter (OLAP
toolssuchasadatacubearepossibleasshowninFigure5,
butare notcurrentlyimplemented). Theinformationlter
makesuseof thepreprocessedcontent andstructureinfor-
mationto automaticallylter the resultsof theknowledge
discoveryalgorithmsforpatternsthatarepotentiallyinter-
esting. Forexample,usageclustersthatcontainpageviews
from multiple content clusters are potentially interesting,
whereasusageclustersthatmatchcontentclustersmaynot
beinteresting. Thedetailsofthemethodtheinformationl-
terusestocombineandcompareevidencefromthedierent
datasourcesarecontainedin[31].
6. PRIVACY ISSUES
Privacyisasensitivetopicwhichhasbeenattractingalotof
attentionrecentlyduetorapidgrowthofe-commerce. Itis
furthercomplicatedbytheglobalandself-regulatorynature
ofthe Web. Theissue of privacy revolvesaround the fact
that mostusers want to maintain strictanonymity onthe
Web.Theyareextremelyaversetotheideathatsomeoneis
monitoringtheWebsitestheyvisitandthetimetheyspend
onthosesites.
Ontheotherhand,siteadministratorsareinterestedinnd-
ingoutthedemographicsofusersaswellastheusagestatis-
ticsofdierentsectionsoftheirWebsite. Thisinformation
wouldallowthemtoimprovethedesignoftheWebsiteand
would ensure that the content caters to the largest popu-
lation of users visiting their site. The site administrators
alsowant theabilitytoidentifyauseruniquelyeverytime
shevisitsthesite,inordertopersonalize theWebsiteand
improvethebrowsingexperience.
Themainchallengeistocomeupwithguidelinesandrules
suchthatsiteadministratorscan performvarious analyses
ontheusagedatawithoutcompromisingtheidentityofan
individualuser. Furthermore,thereshouldbestrictregula-
tionstopreventtheusagedatafrombeingexchanged/sold
toother sites. Theusers shouldbemade awareofthe pri-
vacy policies followed by any given site, so that they can
make an informed decision about revealing their personal
data. Thesuccessof anysuchguidelinescanonlybeguar-
anteediftheyarebackedupbyalegalframework.
TheW3ChasanongoinginitiativecalledPlatformforPri-
vacy Preferences (P3P) [10; 38]. P3P provides a protocol
whichallows thesiteadministratorstopublishtheprivacy
policies followed by a site in a machine readable format.
Whentheuservisitsthesiteforthe rsttimethe browser
readstheprivacypoliciesfollowedbythesiteandthencom-
paresthatwiththatsecuritysettingconguredbytheuser.
Ifthepoliciesaresatisfactorythebrowsercontinuesrequest-
usedto arrive ata settingwhichisacceptableto theuser.
AnotheraimofP3Pistoprovideguidelinesforindependent
organizations whichcanensurethat sites complywith the
policystatementtheyarepublishing[12].
TheEuropeanUnionhastakenaleadinsettinguparegu-
latoryframeworkfor InternetPrivacyand hasissued adi-
rective whichsets guidelines for processing and transfer of
personal data [15]. Unfortunately inU.S.there is nouni-
fyingframeworkinplace,thoughU.S.FederalTradeCom-
mission (FTC) aftera study of commercial Web sites has
recommendedthatCongressdevelop legislationto regulate
thepersonalinformationbeingcollectedatWebsites[26].
7. CONCLUSIONS
Thispaperhasattempted toprovideanup-to-datesurvey
of the rapidly growing area of Web Usage mining. With
thegrowthofWeb-basedapplications,specicallyelectronic
commerce,thereissignicantinterestinanalyzingWebus-
age data to better understand Web usage, and apply the
knowledgetobetterserveusers. Thishasledtoanumberof
commercialoeringsfordoingsuchanalysis. However,Web
Usageminingraisessomehardscienticquestionsthatmust
beansweredbeforerobusttoolscanbedeveloped.Thisar-
ticlehasaimedatdescribingsuchchallenges,andthehope
isthattheresearchcommunitywilltakeupthechallengeof
addressingthem.
8. REFERENCES
[1] Accrue.http://www.accrue.com.
[2] Alladvantage.http://www.alladvantage.com.
[3] Andromediaaria.http://www.andromedia.com.
[4] Broadvision. http://www.broadvision.com.
[5] Hitlistcommerce.http://www.marketwave.com.
[6] Likeminds.http://www.andromedia.com.
[7] Netgenesis.http://www.netgenesis.com.
[8] Netperceptions.http://www.netperceptions.com.
[9] Netzero.http://www.netzero.com.
[10] Platform for privacy project.
http://www.w3.org/P3P/.
[11] Surfaid analytics.http://surfaid.dfw.ibm.com.
[12] Truste: Building a web you can believe in.
http://www.truste.org/.
[13] Webtrendsloganalyzer.http://www.webtrends.com.
[14] Worldwidewebcommitteewebusagecharacterization
activity.http://www.w3.org/WCA.
[15] Europeancommission. thedirective ontheprotection
of individuals with regard ot the processing of per-
sonal data and on the free movement of such data.
http://www2.echo.lu/,1998.
Episode File
Access Log Referrer Log Agent Log
Data Cleaning User Identification Session Identification
Path Completion
Server Session File
Episode Identification
Information Filter Site Files
Classification Algorithm
Site Spider
Registration or Remote Agent
Data INPUT PREPROCESSING PATTERN DISCOVERY PATTERN ANALYSIS
Knowledge Query Mechanism OLAP/
Visualization
"Interesting" Rules, Patterns, and Statistics Page Classification
Site Content
Site Topology
Association Rules Association Rule Mining
Usage Statistics Standard Statistics Package
Sequential Patterns Sequential
Pattern Mining
Clustering
Page Clusters User Clusters
Figure5: ArchitecturefortheWebSIFTSystem
the5thACMSIGKDDInt'lConferenceonKnowledge
DiscoveryandDataMining(KDD99).
[17] CharuC Aggarwal andPhilip S Yu.Ondisk caching
of web objects in proxy servers. In CIKM 97, pages
238{245,LasVegas,Nevada,1997.
[18] R.AgrawalandR.Srikant.Fastalgorithmsformining
association rules. InProc. of the 20th VLDB Confer-
ence,pages487{499,Santiago,Chile,1994.
[19] VirgilioAlmeida, Azer Bestavros,Mark Crovella, and
Adriana de Oliveira. Characterizing reference locality
inthewww. Technical ReportTR-96-11,Boston Uni-
versity,1996.
[20] MartinFArlittandCareyLWilliamson.Internetweb
servers: Workload characterization and performance
implications.IEEE/ACMTransactionsonNetworking,
5(5):631{645,1997.
[21] M. Balabanovic and Y. Shoham. Learning informa-
tion retrieval agents: Experiments with automated
webbrowsing. InOn-line WorkingNotesof theAAAI
Spring Symposium Series on Information Gathering
fromDistributed, Heterogeneous Environments,1995.
[22] Alex Buchner and Maurice D Mulvenna. Discovering
internet marketingintelligencethrough online analyt-
icalwebusagemining. SIGMOD Record, 27(4):54{61,
1998.
[23] L.CatledgeandJ.Pitkow.Characterizingbrowsingbe-
haviorsontheworldwideweb.ComputerNetworksand
ISDNSystems,27(6),1995.
[24] M.S. Chen, J. Han, and P.S. Yu. Data mining: An
overviewfrom adatabase perspective. IEEE Transac-
tions on Knowledge and Data Engineering, 8(6):866{
883,1996.
[25] M.S. Chen, J.S. Park, and P.S. Yu. Data mining
for path traversal patterns ina web environment. In
16thInternationalConference onDistributedComput-
ingSystems, pages385{392,1996.
[26] Roger Clarke. Internet privacy concerns conf the case
forintervention.42(2):60{67,1999.
[27] E.Cohen,B.Krishnamurthy,andJ.Rexford.Improv-
ing end-to-end performance of the web using server
volumesand proxylters.In Proc. ACMSIGCOMM,
pages241{253,1998.
[28] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-
vastava. Grouping web page references into transac-
tionsforminingworldwidewebbrowsingpatterns.In
KnowledgeandDataEngineeringWorkshop,pages2{9,
NewportBeach,CA,1997.IEEE.
[29] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-
vastava.Webmining: Informationandpatterndiscov-
ery on the world wide web. In International Confer-
ence on Tools with Articial Intelligence, pages 558{
567,NewportBeach,1997.IEEE.
vastava.Data preparationfor miningworld wideweb
browsing patterns. Knowledge and Information Sys-
tems, 1(1),1999.
[31] RobertCooley,Pang-NingTan,andJaideepSrivastava.
Discoveryofinterestingusagepatternsfromwebdata.
TechnicalReportTR99-022,UniversityofMinnesota,
1999.
[32] T.FawcettandF.Provost.Activitymonitoring: Notic-
ing interesting changes in behavior. In Fifth ACM
SIGKDD InternationalConferenceonKnowledgeDis-
coveryandDataMining,pages53{62,SanDiego,CA,
1999.ACM.
[33] U. Fayyad,G.Piatetsky-Shapiro,andP.Smyth.From
data miningto knowledge discovery: An overview. In
Proc.ACMKDD,1994.
[34] DavidGibson, JonKleinberg, and Prabhakar Ragha-
van. Inferringwebcommunitiesfromlinktopology.In
Conference onHypertextandHypermedia.ACM,1998.
[35] Chi E. H., Pitkow J., Mackinlay J., Pirolli P., Goss-
weiler,andCardS.K.Visualizingtheevolutionofweb
ecologies.InCHI'98,LosAngeles,California,1998.
[36] BernardoHuberman,PeterPirolli, JamesPitkow,and
Rajan Kukose. Strong regularities in world wideweb
surng.Technicalreport,XeroxPARC,1998.
[37] T.Joachims,D.Freitag,andT.Mitchell.Webwatcher:
Atourguidefortheworldwideweb.InThe15thInter-
national ConferenceonArticialIntelligence, Nagoya,
Japan,1997.
[38] Reagle Josephand CranorLorrie Faith.Theplatform
for privacypreferences.42(2):48{55,1999.
[39] H. Lieberman. Letizia: An agent that assists web
browsing.InProc.ofthe1995InternationalJointCon-
ference on Articial Intelligence, Montreal, Canada,
1995.
[40] Stephen Lee Manley. An Analysis of Issues Facing
World Wide Web Servers. Undergraduate, Harvard,
1997.
[41] B.MasandandM. Spiliopoulou,editors.Workshopon
Web Usage Analysis and User Proling (WebKDD),
1999.
[42] B.Mobasher,N.Jain, E.Han,andJ.Srivastava.Web
mining: Patterndiscoveryfromworldwidewebtrans-
actions.(TR96-050),1996.
[43] Bamshad Mobasher, RobertCooley, and Jaideep Sri-
vastava. Creating adaptive web sites through usage-
based clustering of urls. In Knowledge and Data En-
gineering Workshop,1999.
[44] Olfa Nasraoui, Raghu Krishnapuram, and Anupam
Joshi. Mining web access logs using a fuzzy rela-
tionalclusteringalgorithmbasedonarobustestimator.
In EighthInternational World Wide Web Conference,
Toronto,Canada,1999.
that helps incremental exploration of the world wide
web.In6thInternationalWorldWideWebConference,
SantaClara,CA,1997.
[46] BalajiPadmanabhanandAlexanderTuzhilin.Abelief-
drivenmethodfordiscoveringunexpectedpatterns.In
FourthInternationalConferenceonKnowledgeDiscov-
ery and Data Mining, pages 94{100, New York, New
York,1998.
[47] M.Pazzani, L.Nguyen,andS.Mantik.Learningfrom
hotlistsandcoldlists: Towards awwwinformationl-
tering and seeking agent. InIEEE 1995 International
ConferenceonTools withArticialIntelligence,1995.
[48] MikePerkowitz andOrenEtzioni.Adaptivewebsites:
Automaticallysynthesizingwebpages.InFifteenthNa-
tional Conference on Articial Intelligence, Madison,
WI,1998.
[49] MikePerkowitz andOrenEtzioni.Adaptivewebsites:
Conceptual cluster mining. In Sixteenth International
JointConference onArticialIntelligence,Stockholm,
Sweden,1999.
[50] Peter Pirolli, James Pitkow, and Ramana Rao. Silk
from a sow's ear: Extracting usable structures from
theweb.InCHI-96,Vancouver,1996.
[51] G.SaltonandM.J.McGill.IntroductiontoModernIn-
formationRetrieval.McGraw-Hill, NewYork,1983.
[52] S. Schechter, M. Krishnan, and M. D. Smith. Using
pathproles to predict httprequests. In7th Interna-
tional World Wide Web Conference, Brisbane, Aus-
tralia,1998.
[53] Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and
VishalShah.Knowledgediscoveryfromusersweb-page
navigation. In Workshop on Research Issues in Data
Engineering,Birmingham,England,1997.
[54] E.Spertus.Parasite: Miningstructuralinformationon
theweb.Computer Networks andISDNSystems: The
International Journalof Computer andTelecommuni-
cationNetworking, 29:1205{1215,1997.
[55] Myra Spiliopoulou and Lukas C Faulstich. Wum: A
webutilization miner.InEDBT WorkshopWebDB98,
Valencia,Spain,1998.SpringerVerlag.
[56] Kun-lungWu,PhilipSYu,andAllenBallman.Speed-
tracer: A web usage mining and analysis tool. IBM
SystemsJournal,37(1),1998.
[57] T.Yan,M.Jacobsen,H.Garcia-Molina,andU.Dayal.
Fromuseraccess patterns to dynamichypertextlink-
ing. In Fifth International World Wide Web Confer-
ence,Paris, France,1996.
[58] O. R. Zaiane, M. Xin, and J. Han. Discovering web
access patternsand trendsby applyingolap and data
miningtechnologyonweblogs.InAdvancesinDigital
Libraries,pages19{29,SantaBarbara,CA,1998.
VishalShah.Analysisanddesignofserverinformativewww-
sites.InSixthInternationalConferenceonInformationand
KnowledgeManagement, LasVegas,Nevada,1997.
About theAuthors:
Jaideep Srivastava receivedthe B.Tech. degreeincomputer
sciencefromtheIndian Instituteof Technology,Kanpur,India,
in 1983, and the M.S. and Ph.D.degrees incomputer science
from the University of California, Berkeley, in1985 and 1988,
respectively. Since1988hehasbeenonthefacultyoftheCom-
puterScienceDepartment,UniversityofMinnesota,Minneapolis,
where heiscurrentlyanAssociateProfessor. In1983hewasa
researchengineerwithUptronDigitalSystems,Lucknow,India.
Hehas publishedover110papersinrefereedjournals andcon-
ferences inthe areasof databases,parallel processing,articial
intelligence,andmulti-media.Hiscurrentresearchisintheareas
of databases, distributedsystems,and multi-mediacomputing.
Hehasgivenanumberofinvitedtalksandparticipatedinpanel
discussionsonthesetopics. Dr.Srivastavaisaseniormemberof
the IEEEComputerSocietyandthe ACM.Hisprofessionalac-
tivitieshaveincludedbeingonvariousprogramcommittees,and
refereeingforjournals,conferences,andtheNSF.
RobertCooleyiscurrentlypursuingaPh.D.incomputersci-
ence at the University of Minnesota. He received an M.S. in
computersciencefromMinnesotain1998. Hisresearchinterests
includeDataMiningandInformationRetrieval.
MukundDeshpandeisaPh.D.studentintheDepartmentof
ComputerScience atthe UniversityofMinnesota. Hereceived
anM.E.insystemscience&automationfromIndianInstituteof
Science,Bangalore,Indiain1997.
Pang-NingTaniscurrentlyworkingtowardshisPh.D.inCom-
puterScienceatUniversityofMinnesota. Hisprimaryresearch
interestisinDataMining.HereceivedanM.S.inPhysicsfrom
UniversityofMinnesotain1996.