Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

(1)

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Jaideep Srivastava

y

, Robert Cooley

z

, Mukund Deshpande, Pang-Ning Tan

Department of Computer Science and Engineering University of Minnesota

200 Union St SE Minneapolis, MN 55455

fsrivasta,cooley,deshpand,ptang@cs.umn.edu

ABSTRACT

Webusage mining is the application of data miningtech-

niquestodiscoverusagepatternsfromWebdata,inorderto

understandandbetter servetheneedsofWeb-basedappli-

cations. Webusageminingconsistsofthreephases,namely

preprocessing, patterndiscovery,andpattern analysis. This

paperdescribeseachofthesephasesindetail. Givenitsap-

plicationpotential,Webusage mininghas seenarapid in-

creaseininterest,fromboththeresearchandpracticecom-

munities. Thispaperprovides adetailedtaxonomyof the

workinthisarea,includingresearcheortsaswellascom-

mercialoerings. Anup-to-datesurveyoftheexistingwork

isalsoprovided. Finally,abrief overviewofthe WebSIFT

systemas anexampleof aprototypicalWebusage mining

systemisgiven.

Keywords: datamining,worldwideweb,webusagemin-

ing.

1. INTRODUCTION

The ease and speed with which business transactions can

be carried out overthe Web has been a key driving force

intherapidgrowthofelectronic commerce. Specically,e-

commerceactivitythatinvolvestheenduserisundergoing

asignicantrevolution. Theabilitytotrackusers'browsing

behavior down to individual mouseclickshas brought the

vendorandendcustomercloserthaneverbefore. Itisnow

possibleforavendortopersonalizehisproductmessagefor

individualcustomersatamassivescale,aphenomenonthat

isbeingreferredtoasmasscustomization.

Thescenariodescribedaboveisoneofmanypossibleappli-

cationsofWebUsage mining,whichistheprocess ofapply-

ingdataminingtechniquestothediscoveryofusagepatterns

fromWebdata,targetedtowardsvariousapplications. Data

miningeortsassociatedwiththeWeb,calledWebmining,

canbebroadlydividedinto threeclasses, i.e. content min-

ing, usage mining, and structure mining. Web Structure

miningprojects suchas [34; 54] and Web Content mining

projects such as [47; 21] are beyondthe scope ofthis sur-

Canbecontactedatjaideep@amazon.com

y

SupportedbyNSFgrantNSF/EIA-9818338

z

SupportedbyNSFgrantEHR-9554517

vey. AnearlytaxonomyofWebminingisprovidedin[29],

whichalsodescribesthe architectureofthe WebMinersys-

tem[42],oneoftherstsystemsforWebUsagemining. The

proceedingsof therecent WebKDDworkshop [41], heldin

conjunctionwiththeKDD-1999conference,providesasam-

plingofsomeofthecurrentresearchbeingperformedinthe

areaofWebUsageAnalysis,includingWebUsagemining.

Thispaperprovidesanup-to-datesurveyofWebUsagemin-

ing,includingbothacademicandindustrialresearcheorts,

aswellascommercialoerings. Section2describesthevar-

ious kinds ofWebdata that canbe useful for WebUsage

mining. Section 3discusses thechallenges involvedindis-

coveringusage patternsfrom Webdata. Thethree phases

are preprocessing, pattern discovery, andpatterns analysis.

Section 4provides a detailed taxonomy and survey of the

existing eorts inWeb Usage mining, and Section 5 gives

anoverviewoftheWebSIFTsystem[31], asaprototypical

exampleof aWebUsageminingsystem. nally,Section6

discusses privacyconcerns and Section7 concludesthe pa-

per.

2. WEB DATA

Oneofthekeysteps inKnowledgeDiscoveryinDatabases

[33]istocreateasuitabletargetdatasetforthedatamining

tasks. InWebMining, datacanbecollectedat theserver-

side,client-side,proxyservers,orobtainedfromanorgani-

zation's database(whichcontains businessdataorconsoli-

dated Webdata). Each type of datacollection diersnot

only in termsof the location of the data source, but also

thekindsofdataavailable,thesegmentofpopulationfrom

whichthedatawascollected,anditsmethodofimplemen-

tation.

TherearemanykindsofdatathatcanbeusedinWebMin-

ing. Thispaperclassiessuchdataintothefollowingtypes

:

Content: The real data inthe Web pages, i.e. the

datatheWebpagewasdesignedtoconveytotheusers.

Thisusuallyconsistsof,butisnotlimitedto,textand

graphics.

Structure: Datawhichdescribestheorganization of

thecontent. Intra-pagestructureinformationincludes

thearrangementofvariousHTMLorXMLtagswithin

agivenpage. Thiscanberepresentedasatreestruc-

ture,wherethehhtmlitagbecomestherootofthetree.

(2)

ishyper-linksconnectingonepagetoanother.

Usage: Data that describesthe pattern of usage of

Webpages,suchasIPaddresses,pagereferences,and

thedateandtimeofaccesses.

User Prole: Data that provides demographic in-

formationaboutusers oftheWebsite. Thisincludes

registrationdataandcustomerproleinformation.

2.1 Data Sources

Theusage data collected at the dierent sources will rep-

resent the navigationpatterns of dierentsegmentsof the

overallWebtraÆc,rangingfromsingle-user,single-sitebrows-

ingbehaviortomulti-user,multi-siteaccesspatterns.

2.1.1 Server Level Collection

AWebserverlogisanimportantsourceforperformingWeb

UsageMiningbecauseitexplicitlyrecordsthebrowsingbe-

haviorof sitevisitors. Thedatarecorded inserverlogsre-

ectsthe(possiblyconcurrent)accessofaWebsitebymul-

tipleusers. Theselog lescanbestoredinvariousformats

suchas Common log or Extendedlog formats. An exam-

pleofExtendedlogformatisgiveninFigure2(Section3).

However, the site usage data recorded by serverlogs may

notbeentirelyreliableduetothepresenceofvariouslevels

ofcachingwithintheWebenvironment. Cachedpageviews

arenotrecordedinaserverlog. Inaddition,anyimportant

informationpassed throughthe POSTmethodwill notbe

available in a serverlog. Packet sniÆng technology is an

alternativemethodto collectingusagedatathroughserver

logs. Packet sniers monitor network traÆc coming to a

Webserver and extractusage data directly from TCP/IP

packets. TheWebservercanalsostoreotherkindsofusage

informationsuchascookiesandquerydatainseparatelogs.

CookiesaretokensgeneratedbytheWebserverforindivid-

ualclientbrowsersinordertoautomatically trackthe site

visitors. Tracking of individual users is not an easy task

dueto the statelessconnection modeloftheHTTPproto-

col.Cookiesrelyonimplicitusercooperationandthushave

raised growing concerns regardinguser privacy,which will

bediscussedinSection6. Querydataisalsotypicallygen-

eratedbyonline visitors while searching forpages relevant

totheir information needs. Besides usage data,the server

sidealso provides content data, structureinformation and

Webpage meta-information(suchas the size ofa leand

itslastmodiedtime).

The Web server also relies on other utilities such as CGI

scriptstohandledatasentbackfromclientbrowsers. Web

serversimplementingthe CGI standardparsethe URI 1

of

the requested le to determine if it is an application pro-

gram. TheURIfor CGIprogramsmaycontainadditional

parametervaluestobepassedtotheCGIapplication. Once

the CGI program has completed its execution, the Web

serversendthe outputofthe CGIapplicationbackto the

browser.

2.1.2 Client Level Collection

1

UniformResourceIdentier (URI)is amoregeneralde-

nitionthatincludesthecommonlyreferredto UniformRe-

sourceLocator(URL).

moteagent(suchasJavascriptsorJavaapplets)orbymod-

ifyingthe sourcecode ofanexisting browser (suchas Mo-

saic or Mozilla) to enhanceits datacollectioncapabilities.

The implementation ofclient-side datacollection methods

requiresusercooperation,eitherinenablingthefunctional-

ityoftheJavascriptsandJavaapplets,ortovoluntarilyuse

the modiedbrowser. Client-side collectionhasan advan-

tage overserver-sidecollectionbecause itameliorates both

the caching and session identication problems. However,

Javaappletsperformnobetter thanserverlogsintermsof

determiningtheactualviewtimeofapage. Infact,itmay

incursomeadditionaloverheadespeciallywhentheJavaap-

plet is loaded for the rst time. Javascripts, onthe other

hand, consume little interpretation time but cannot cap-

turealluserclicks(suchasreloadorbackbuttons). These

methodswillcollectonlysingle-user,single-sitebrowsingbe-

havior. Amodiedbrowserismuchmoreversatileandwill

allowdatacollectionaboutasingleuserovermultipleWeb

sites. The mostdiÆcult part of usingthismethodis con-

vincingtheuserstousethebrowserfortheirdailybrowsing

activities. Thiscanbe donebyoeringincentivesto users

who are willing to use the browser, similar to the incen-

tiveprogramsoeredbycompaniessuchasNetZero[9]and

AllAdvantage [2] that reward users for clicking onbanner

advertisementswhilesurngtheWeb.

2.1.3 Proxy Level Collection

A Webproxyacts as an intermediatelevel of caching be-

tweenclient browsers andWebservers. Proxycachingcan

be used to reduce the loading time of a Web page expe-

rienced by users as wellas the network traÆc load at the

serverandclientsides[27]. Theperformanceofproxycaches

dependsontheirabilitytopredictfuturepagerequestscor-

rectly. Proxytraces may revealthe actualHTTPrequests

from multiple clients to multiple Web servers. This may

serve as a data sourcefor characterizing the browsing be-

havior of a group of anonymous users sharing a common

proxyserver.

2.2 Data Abstractions

The information provided by the data sources described

abovecanallbeusedtoconstruct/identifyseveraldataab-

stractions, notably users, server sessions, episodes, click-

streams, and page views. Inorderto providesome consis-

tency in the way these terms are dened, the W3C Web

CharacterizationActivity(WCA)[14]haspublishedadraft

ofWebtermdenitionsrelevanttoanalyzingWebusage. A

user is dened as a single individual that is accessing le

from one or more Web servers through a browser. While

thisdenitionseemstrivial,inpracticeitisverydiÆcultto

uniquelyand repeatedly identifyusers. A usermay access

theWebthroughdierentmachines,or usemorethanone

agentonasinglemachine. Apageviewconsistsofeveryle

that contributesto the display ona user's browser at one

time. Page viewsare usuallyassociatedwith asingleuser

action(suchasamouse-click)andcanconsistofseveralles

suchasframes, graphics,andscripts. Whendiscussingand

analyzinguserbehaviors,itisreallytheaggregatepageview

that isof importance. Theuserdoesnotexplicitly askfor

\n" frames and \m" graphicsto be loaded into his or her

browser,theuserrequests a\Webpage." Allof theinfor-

mation to determine which les constitute a page view is

(3)

tialseriesof pageviewrequests. Again, thedataavailable

fromtheserversidedoesnotalwaysprovide enoughinfor-

mationto reconstructthe full click-stream for asite. Any

pageviewaccessedthroughaclientorproxy-levelcachewill

notbe\visible"from theserverside. Auser sessionisthe

click-streamofpageviewsfor asingeuseracrosstheentire

Web.Typically,onlytheportionofeachusersessionthatis

accessingaspecicsitecanbeusedforanalysis,sinceaccess

informationisnotpubliclyavailablefromthevastmajority

of Web servers. The set of page-views in a user session

for aparticular Web siteis referred to as a server session

(alsocommonlyreferred to asa visit). Aset ofserverses-

sionsis thenecessaryinputfor any WebUsageanalysisor

dataminingtool. Theendofaserversession isdenedas

thepointwhentheuser'sbrowsingsessionatthatsitehas

ended. Again,thisisasimpleconceptthatis verydiÆcult

totrackreliably. Any semantically meaningful subsetofa

userorserversessionisreferredtoasanepisodebytheW3C

WCA.

3. WEB USAGE MINING

Asshownin Figure1, thereare threemain tasks for per-

forming Web UsageMining or WebUsage Analysis. This

sectionpresentsanoverviewof thetasksfor eachstepand

discussesthechallengesinvolved.

3.1 Preprocessing

Preprocessingconsistsofconvertingtheusage,content,and

structureinformationcontainedinthevariousavailabledata

sourcesintothedataabstractionsnecessaryforpatterndis-

covery.

3.1.1 Usage Preprocessing

Usage preprocessing is arguably the most diÆcult task in

theWebUsageMiningprocessduetotheincompletenessof

theavailabledata. Unlessaclientsidetrackingmechanism

is used, only the IP address, agent, and server side click-

stream are available to identify users and server sessions.

Someofthetypicallyencounteredproblemsare:

Single IP address/Multiple ServerSessions -Internet

serviceproviders(ISPs)typicallyhaveapoolofproxy

servers that users access the Webthrough. A single

proxyservermay haveseveral usersaccessing a Web

site, potentiallyoverthesametimeperiod.

MultipleIPaddress/SingleServerSession-SomeISPs

or privacy toolsrandomlyassign eachrequestfroma

user to one of several IP addresses. In this case, a

singleserversessioncanhavemultipleIPaddresses.

MultipleIPaddress/SingleUser-Auserthataccesses

the Webfrom dierent machineswillhaveadierent

IP addressfromsessiontosession. Thismakestrack-

ingrepeatvisitsfromthesameuserdiÆcult.

Multiple Agent/Singe User -Again, auser that uses

more than one browser, even on the same machine,

willappearasmultipleusers.

Assumingeachuserhasnowbeenidentied(throughcook-

ies,logins, orIP/agent/path analysis),theclick-streamfor

eachusermustbedividedintosessions. Sincepagerequests

to know whena userhasleft aWebsite. A thirty minute

timeout is often usedas the defaultmethodofbreaking a

user'sclick-streamintosessions. Thethirtyminutetimeout

is basedon the results of [23]. Whena session IDis em-

beddedineachURI,thedenitionofasessionissetbythe

contentserver.

Whiletheexactcontentservedasaresult ofeachuserac-

tion is often available from the requesteld in the server

logs,itissometimesnecessarytohaveaccesstothecontent

serverinformationaswell. Sincecontentserverscanmain-

tainstatevariablesfor eachactivesession, the information

necessarytodetermineexactlywhatcontentisservedbya

user requestis not always available in the URI.The nal

problemencounteredwhenpreprocessingusagedataisthat

ofinferringcachedpagereferences. AsdiscussedinSection

2.2,theonlyveriablemethodoftrackingcachedpageviews

is tomonitorusage fromthe client side. Thereferrer eld

foreachrequestcanbeusedtodetectsomeoftheinstances

whencachedpageshavebeenviewed.

Figure 2shows asample log that illustrates several of the

problems discussed above (The rst columnwould not be

present in anactualserver log, and is for illustrative pur-

poses only). IP address 123.456.78.9 is responsible for

three serversessions, and IP addresses 209.456.78.2 and

209.45.78.3 are responsible for a fourth session. Using

a combination of referrer and agent information, lines 1

through11canbedividedintothreesessionsofA-B-F-O-G,

L-R,andA-B-C-J.Pathcompletionwouldaddtwopageref-

erencestotherstsessionA-B-F-O-F-B-G,andonereference

to the thirdsession A-B-A-C-J. Withoutusing cookies, an

embeddedsessionID,oraclient-sidedatacollectionmethod,

thereisnomethodfordeterminingthatlines12and13are

actuallyasingleserversession.

3.1.2 Content Preprocessing

Content preprocessing consists of converting the text, im-

age, scripts, and otherles suchas multimediainto forms

that are useful for the WebUsageMining process. Often,

this consists of performing content mining such as classi-

cation or clustering. Whileapplying data miningto the

contentofWebsitesisaninterestingareaofresearchinits

ownright,inthecontextofWebUsageMining thecontent

of asitecanbeusedto lter theinput to,or outputfrom

the patterndiscovery algorithms. For example, results of

aclassicationalgorithmcouldbeusedtolimitthediscov-

eredpatternstothosecontainingpageviewsaboutacertain

subject or class of products. In addition to classifying or

clustering pageviewsbasedontopics, pageviewscanalso

be classiedaccording totheir intendeduse[50; 30]. Page

viewscanbeintendedtoconveyinformation(throughtext,

graphics,orothermultimedia),gatherinformationfromthe

user,allownavigation(throughalistofhypertextlinks),or

some combination uses. The intended use of a page view

canalsolterthesessionsbeforeorafterpatterndiscovery.

In orderto run content mining algorithmson pageviews,

the informationmustrst beconvertedintoaquantiable

format. Someversionofthevectorspacemodel[51]istyp-

icallyusedto accomplishthis. Textles canbe brokenup

into vectors of words. Keywords or text descriptions can

be substitutedfor graphicsor multimedia. Thecontent of

staticpageviewscanbeeasilypreprocessedbyparsingthe

HTML andreformatting the information or runningaddi-

(4)

Preprocessing Pattern Discovery Pattern Analysis Site Files

"Interesting"

Rules, Patterns, and Statistics Rules, Patterns,

and Statistics Preprocessed

Clickstream Data Raw Logs

Figure1:HighLevelWebUsage MiningProcess

# IP Address Userid Time Method/ URL/ Protocol Status Size Referrer Agent 1 123.456.78.9 - [25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 2 123.456.78.9 - [25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I) 3 123.456.78.9 - [25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130 - Mozilla/3.04 (Win95, I) 4 123.456.78.9 - [25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I) 5 123.456.78.9 - [25/Apr/1998:03:06:58 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.01 (X11, I, IRIX6.2, IP22) 6 123.456.78.9 - [25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 7 123.456.78.9 - [25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 200 8140 L.html Mozilla/3.04 (Win95, I) 8 123.456.78.9 - [25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 200 1820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 9 123.456.78.9 - [25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 200 2270 F.html Mozilla/3.04 (Win95, I) 10 123.456.78.9 - [25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 200 9430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 11 123.456.78.9 - [25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 200 7220 B.html Mozilla/3.04 (Win95, I) 12 209.456.78.2 - [25/Apr/1998:05:05:22 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 13 209.456.78.3 - [25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 200 1680 A.html Mozilla/3.04 (Win95, I)

Figure2: SampleWebServerLog

(5)

moreofachallenge. Contentserversthatemploypersonal-

izationtechniquesand/ordrawupondatabasestoconstruct

thepageviewsmaybecapableofformingmorepageviews

thancanbepracticallypreprocessed. Agivensetofserver

sessionsmayonlyaccessafractionofthepageviewspossible

for alarge dynamicsite. Also the content may berevised

onaregularbasis. Thecontentofeachpageviewtobepre-

processedmustbe\assembled",eitherbyanHTTPrequest

from a crawler, or a combination of template, script, and

database accesses. If only the portion of pageviews that

areaccessed arepreprocessed,theoutputof anyclassica-

tionorclusteringalgorithmsmaybeskewed.

3.1.3 Structure Preprocessing

Thestructureofasiteiscreatedbythehypertextlinksbe-

tweenpageviews. Thestructurecanbeobtainedandpre-

processedinthesamemannerasthecontentofasite. Again,

dynamiccontent (andtherefore links) posemore problems

thanstaticpageviews. Adierentsitestructuremayhave

tobeconstructedforeachserversession.

3.2 Pattern Discovery

Patterndiscoverydraws uponmethodsand algorithmsde-

velopedfrom several elds such as statistics, datamining,

machine learning and pattern recognition. However, it is

nottheintentofthispapertodescribealltheavailablealgo-

rithmsandtechniquesderivedfromtheseelds. Interested

readersshouldconsultreferencessuchas[33;24]. Thissec-

tiondescribesthekindsofminingactivitiesthathavebeen

appliedtotheWebdomain. Methodsdevelopedfromother

eldsmusttakeintoconsiderationthedierentkindsofdata

abstractionsandpriorknowledgeavailableforWebMining.

For example, inassociationrule discovery, the notion ofa

transaction for market-basket analysis does not take into

considerationthe orderinwhichitems are selected. How-

ever, inWeb UsageMining, a serversession is anordered

sequenceofpagesrequestedbyauser. Furthermore,dueto

thediÆcultyinidentifyinguniquesessions,additionalprior

knowledgeis required(suchas imposingadefault timeout

period,aswaspointedoutintheprevious section).

3.2.1 Statistical Analysis

Statistical techniquesare themostcommonmethodtoex-

tractknowledgeaboutvisitorstoaWebsite. Byanalyzing

thesessionle, onecanperformdierent kindsofdescrip-

tivestatisticalanalyses(frequency,mean,median,etc.) on

variables suchas pageviews, viewingtimeandlengthofa

navigationalpath. ManyWebtraÆcanalysistoolsproduce

aperiodicreportcontainingstatistical informationsuchas

themostfrequentlyaccessedpages,averageviewtimeofa

pageoraveragelengthofapaththroughasite. Thisreport

mayincludelimitedlow-levelerroranalysissuchasdetect-

ingunauthorizedentrypointsorndingthemost common

invalid URI. Despite lacking in the depth of its analysis,

thistypeofknowledgecanbepotentiallyusefulforimprov-

ing thesystem performance, enhancingthe security ofthe

system,facilitatingthesitemodicationtask,andproviding

supportformarketingdecisions.

3.2.2 Association Rules

Associationrulegenerationcanbeusedtorelatepagesthat

aremostoftenreferencedtogetherinasingleserversession.

to sets of pagesthat are accessed together withasupport

valueexceedingsomespeciedthreshold. Thesepagesmay

notbedirectlyconnectedtooneanotherviahyperlinks. For

example, associationrule discoveryusingthe Apriori algo-

rithm[18] (orone ofitsvariants) mayrevealacorrelation

betweenuserswhovisitedapagecontainingelectronicprod-

uctstothosewho accessapageaboutsportingequipment.

Asidefrombeingapplicableforbusinessandmarketingap-

plications, the presence or absence of such rules can help

WebdesignerstorestructuretheirWebsite. Theassociation

rulesmayalsoserveasaheuristicforprefetchingdocuments

in order to reduce user-perceived latency when loading a

pagefromaremotesite.

3.2.3 Clustering

Clustering is atechniqueto grouptogether a set of items

having similar characteristics. In the Web Usage domain,

therearetwokindsofinterestingclusterstobediscovered:

usage clustersand pageclusters. Clustering ofusers tends

toestablishgroupsofusersexhibitingsimilarbrowsingpat-

terns. Suchknowledgeisespeciallyusefulforinferringuser

demographics inorder to performmarketsegmentationin

E-commerceapplicationsorprovidepersonalizedWebcon-

tent to the users. Onthe otherhand, clustering of pages

will discovergroups ofpages havingrelated content. This

information is useful for Internet search engines and Web

assistance providers. In both applications, permanent or

dynamic HTMLpagescanbecreatedthat suggestrelated

hyperlinkstotheuseraccordingtotheuser'squeryorpast

historyofinformationneeds.

3.2.4 Classification

Classication is the task ofmapping a data iteminto one

of several predened classes[33]. Inthe Webdomain, one

is interestedindeveloping aproleofusersbelongingto a

particular class or category. This requires extraction and

selection of features that bestdescribe the properties of a

givenclassorcategory. Classicationcanbedonebyusing

supervised inductive learning algorithms such as decision

tree classiers, naive Bayesian classiers, k-nearest neigh-

bor classiers,SupportVector Machinesetc. Forexample,

classicationonserverlogsmaylead tothediscoveryofin-

terestingrules suchas: 30%ofuserswhoplacedanonline

orderin/Product/Musicareinthe18-25agegroupandlive

ontheWestCoast.

3.2.5 Sequential Patterns

The techniqueof sequential patterndiscovery attempts to

ndinter-sessionpatternssuchthatthepresenceofasetof

itemsisfollowedbyanotheriteminatime-orderedsetofses-

sions or episodes. By usingthis approach,Web marketers

can predict future visit patterns which will be helpful in

placingadvertisementsaimedatcertainusergroups.Other

typesoftemporalanalysisthatcanbeperformedonsequen-

tialpatternsincludestrendanalysis,changepointdetection,

orsimilarityanalysis.

3.2.6 Dependency Modeling

Dependency modeling is another useful pattern discovery

task inWebMining. Thegoal here is to develop amodel

capableof representingsignicantdependenciesamongthe

variousvariables intheWeb domain. Asanexample, one

(6)

stagesavisitorundergoeswhileshoppinginanonlinestore

basedontheactionschosen(ie. fromacasualvisitortoase-

riouspotentialbuyer).Thereareseveralprobabilisticlearn-

ingtechniquesthatcanbeemployedtomodelthebrowsing

behavior ofusers. SuchtechniquesincludeHiddenMarkov

ModelsandBayesianBeliefNetworks.ModelingofWebus-

agepatternswill notonly providea theoreticalframework

foranalyzingthebehaviorofusersbutispotentiallyuseful

forpredictingfutureWebresourceconsumption.Suchinfor-

mationmayhelp developstrategiestoincreasethesalesof

productsoeredbytheWebsiteorimprovethenavigational

convenienceofusers.

3.3 Pattern Analysis

Patternanalysis is the last stepin the overall Web Usage

miningprocess as described inFigure 1. The motivation

behindpatternanalysisistolteroutuninterestingrulesor

patternsfromthesetfoundinthepatterndiscoveryphase.

Theexactanalysismethodologyisusuallygovernedbythe

applicationforwhichWebminingisdone. Themostcom-

monformofpatternanalysisconsistsofaknowledgequery

mechanismsuchas SQL.Anothermethodisto loadusage

dataintoadatacubeinordertoperformOLAPoperations.

Visualization techniques, such as graphing patterns or as-

signingcolors todierentvalues,canoftenhighlightoverall

patternsortrendsinthedata. Contentandstructureinfor-

mationcanbeusedto lter outpatternscontainingpages

ofacertain usagetype,content type, orpages thatmatch

acertainhyperlinkstructure.

4. TAXONOMY AND PROJECT SURVEY

Since 1996 there have been several research projects and

commercial products that have analyzed Web usage data

for anumberof dierent purposes. This section describes

the dimensions and application areas that can be used to

classifyWebUsageMiningprojects.

4.1 Taxonomy Dimensions

Whilethenumberofcandidatedimensionsthatcanbeused

to classify Web Usage Mining projects is many, there are

vemajordimensionsthatapplytoeveryproject-thedata

sources usedto gather input, the typesof input data,the

numberofusersrepresentedineachdataset,thenumberof

Websitesrepresentedineachdataset,andtheapplication

area focused on by the project. Usage datacan eitherbe

gatheredat the serverlevel,proxylevel, orclient level,as

discussedinSection2.1. AsshowninFigure3,mostprojects

make use of server side data. All projects analyze usage

dataandsomealsomakeuseofcontent,structure,orprole

data. Thealgorithmsforaprojectcanbedesignedtowork

oninputsrepresentingoneormanyusersandoneormany

Websites. Singleuserprojectsaregenerallyinvolvedinthe

personalizationapplicationarea. Theprojectsthatprovide

multi-siteanalysisuseeitherclientorproxylevelinputdata

in order to easily access usage data from more than one

Website. MostWebUsageMiningprojectstakesingle-site,

multi-user,server-sideusagedata(Webserverlogs)asinput.

4.2 Project Survey

AsshowninFigures3and4,usagepatternsextractedfrom

Web data have been applied to a wide range of applica-

tions. Projects suchas[31; 55;56; 58; 53] havefocused on

the processtowardsoneofthevarioussub-categories. The

WebSIFTprojectisdiscussedinmoredetailinthenextsec-

tion. Chen etal. [25] introduced the concept of maximal

forwardreferencetocharacterizeuserepisodesforthemin-

ingoftraversalpatterns. Amaximalforwardreferenceisthe

sequenceofpagesrequestedbyauseruptothelastpagebe-

forebacktrackingoccursduringaparticular serversession.

TheSpeedTracerproject[56]fromIBMWatsonisbuilton

theworkoriginallyreportedin[25]. Inadditionto episode

identication, SpeedTracermakesuse ofreferrerandagent

information inthe preprocessing routines to identify users

and serversessions inthe absence ofadditional client side

information. The Web Utilization Miner (WUM) system

[55] provides a robustmininglanguagein orderto specify

characteristicsofdiscoveredfrequentpathsthatareinterest-

ingtotheanalyst. Intheirapproach,individualnavigation

paths, called trails, are combined into anaggregated tree

structure. Queries canbeansweredbymapping theminto

theintermediatenodesofthetreestructure.Hanetal. [58]

have loaded Webserverlogs into adata cubestructurein

ordertoperformdataminingaswellasOn-LineAnalytical

Processing(OLAP)activitiessuchasroll-upanddrill-down

of thedata. Their WebLogMinersystem hasbeenused to

discoverassociation rules, performclassication and time-

series analysis (suchas event sequenceanalysis, transition

analysis and trendanalysis). Shahabi et. al. [53; 59] have

one of the few Web Usage mining systems that relies on

clientsidedatacollection. Theclientsideagentsendsback

pagerequestandtimeinformationtotheservereverytime

a pagecontaining theJavaapplet(eithera newpageor a

previouslycachedpage)isloadedordestroyed.

4.2.1 Personalization

PersonalizingtheWebexperienceforauseristheholygrail

of many Web-based applications, e.g. individualized mar-

keting for e-commerce [4]. Making dynamic recommenda-

tionstoaWebuser, basedonher/hisproleinadditionto

usagebehavior isveryattractivetomanyapplications,e.g.

cross-salesand up-salesine-commerce. Webusagemining

isanexcellentapproachforachievingthisgoal,asillustrated

in[43] Existingrecommendationsystems,suchas[8; 6],do

notcurrentlyusedataminingforrecommendations,though

therehavebeensomerecentproposals[16].

TheWebWatcher[37],SiteHelper[45],Letizia[39],andclus-

tering workby Mobasheret. al. [43] and Yanet. al. [57]

haveallconcentratedonprovidingWebSitepersonalization

basedonusageinformation. Webserverlogs wereusedby

Yanet. al. [57] to discover clusters of users havingsim-

ilar access patterns. The system proposed in[57] consists

of anoinemodule thatwill performcluster analysis and

anonlinemodulewhichisresponsiblefordynamiclinkgen-

eration of Webpages. Every site userwill be assigned to

a single cluster based on their current traversal pattern.

The links that are presented to a given user are dynami-

cally selectedbasedonwhatpages otherusersassigned to

thesameclusterhavevisited. TheSiteHelperprojectlearns

auserspreferencesbylookingatthepageaccessesforeach

user. Alist ofkeywords from pagesthat a userhas spent

a signicant amount of time viewingis compiledand pre-

sented to the user. Based onfeedback about the keyword

list, recommendations for other pages within the site are

made. WebWatcher \follows" auser as heor she browses

(7)

Project Application Data Source Data Type User Site Focus Server Proxy Client Structure Content Usage Profile Single Multi Single Multi

WebSIFT (CTS99) General x x x x x x

SpeedTracer (WYB98,CPY96) General x x x x

WUM (SF98) General x x x x x

Shahabi (SZAS97,ZASS97) General x x x x x

Site Helper (NW97) Personalization x x x x x

Letizia (Lie95) Personalization x x x x x

Web Watcher (JFM97) Personalization x x x x x x

Krishnapuram(NKJ99) Personalization x x x x

Analog (YJGD96) Personalization x x x x

Mobasher (MCS99) Personalization x x x x x

Tuzhilin(PT98) Business x x x x

SurfAid Business x x x x x

Buchner(BM98) Business x x x x x

WebTrends,Hitlist,Accrue,etc. Business x x x x

WebLogMiner (ZXH98) Business x x x x

PageGather,SCML (PE98,PE99) Site Modification x x x x x x

Manley(Man97) Characterization x x x x x

Arlitt(AW96) Characterization x x x x x

Pitkow(PIT97,PIT98) Characterization x x x x x x

Almeida(ABC96) Characterization x x x x

Rexford(CKR98) System Improve. x x x x x

Schechter(SKS98) System Improve. x x x x

Aggarwal(AY97) System Improve. x x x x

Figure3: WebUsageMiningResearchProjectsandProducts

Site Modification

Business Intelligence System

Improvement Personalization

Web Usage Mining

Usage Characterization WebSIFT

WUM SpeedTracer WebLogMiner Shahabi

Site Helper Letizia Web Watcher Mobasher Analog Krishnapuram

Rexford Schecter Aggarwal

Adaptive Sites SurfAid Buchner Tuzhilin

Pitkow Arlitt Manley Almeida

Figure4: Major ApplicationAreasfor WebUsageMining

(8)

tothe user. TheWebWatcherstartswith ashortdescrip-

tionofausersinterest. Eachpagerequestisroutedthrough

the WebWatcher proxyserver inorder to easily track the

usersessionacrossmultipleWebsitesand markany inter-

esting links. WebWatcher learns based on the particular

user'sbrowsingplusthe browsing ofother users withsim-

ilar interests. Letizia is a client side agent that searches

theWebforpagessimilartoonesthattheuserhasalready

viewed orbookmarked. Thepagerecommendationsin[43]

arebasedonclustersofpagesfoundfromtheserverlogfora

site. Thesystemrecommendspagesfromclustersthatmost

closelymatchthecurrentsession. Pagesthathavenotbeen

viewedandarenotdirectlylinkedfromthecurrentpageare

recommendedtotheuser. [44]attemptstoclusteruserses-

sionsusingafuzzyclustering algorithm. [44] allows apage

orusertobeassignedtomorethanonecluster.

4.2.2 System Improvement

Performanceandotherservicequalityattributesarecrucial

to user satisfaction from services such as databases, net-

works,etc. Similarqualitiesare expectedfromtheusersof

Webservices. Webusageminingprovidesthekeytounder-

standing Web traÆc behavior, which canin turn be used

for developing policiesfor Webcaching, networktransmis-

sion[27],loadbalancing,ordatadistribution. Securityisan

acutelygrowing concern for Web-basedservices, especially

as electronic commerce continues to grow at an exponen-

tialrate[32]. Webusage miningcanalsoprovidepatterns

which are useful for detecting intrusion, fraud, attempted

break-ins,etc.

Almeidaetal. [19]proposemodelsforpredictingthelocal-

ity, both temporal as well as spatial, amongstWeb pages

requestedfromaparticularuseroragroupofusersaccess-

ing fromthe sameproxyserver. Thelocality measure can

thenbeusedfordecidingpre-fetchingandcachingstrategies

fortheproxyserver. Theincreasinguseofdynamiccontent

hasreduced thebenetsofcaching at boththe client and

serverlevel.Schechteret. al.[52]havedevelopedalgorithms

forcreatingpathprolesfromdatacontainedinserverlogs.

Theseprolesarethenusedtopre-generatedynamicHTML

pagesbasedonthecurrentuserproleinordertoreducela-

tencyduetopagegeneration. Usingproxyinformationfrom

pre-fetchingpageshasalsobeenstudiedby[27]and[17].

4.2.3 Site Modification

Theattractivenessof aWebsite, intermsofbothcontent

andstructure,iscrucialtomanyapplications,e.g.aproduct

catalogfore-commerce.Webusageminingprovidesdetailed

feedbackonuserbehavior,providingtheWebsitedesigner

informationonwhichtobaseredesigndecisions.

While the results of any of the projects could lead to re-

designingthe structureand contentof asite, the adaptive

Websiteproject(SCMLalgorithm)[48; 49]focuses onau-

tomaticallychangingthestructureofasitebasedonusage

patternsdiscoveredfromserverlogs. Clusteringofpagesis

usedtodeterminewhichpagesshouldbedirectlylinked.

4.2.4 Business Intelligence

InformationonhowcustomersareusingaWebsiteiscritical

information for marketers ofe-tailing businesses. Buchner

etal[22]havepresentedaknowledgediscoveryprocessinor-

dertodiscovermarketingintelligencefromWebdata.They

usagedataalongwithmarketingdatafore-commerceappli-

cations. Theyidentiedfourdistinctstepsincustomerrela-

tionshiplifecyclethatcanbesupportedbytheirknowledge

discoverytechniques: customerattraction,customerreten-

tion,crosssalesandcustomerdeparture. Thereareseveral

commercialproducts,suchasSurfAid[11],Accrue[1],Net-

Genesis [7], Aria [3], Hitlist [5], and WebTrends [13] that

provideWebtraÆcanalysismainlyforthepurposeofgath-

ering business intelligence. Accrue, NetGenesis, and Aria

are designed to analyze e-commerce events such as prod-

uctsboughtandadvertisementclick-throughrates inaddi-

tiontostraightforward usagestatistics. Accrueprovidesa

pathanalysisvisualizationtoolandIBM'sSurfAidprovides

OLAPthroughadatacubeandclusteringofusersinaddi-

tiontopageview statistics. Padmanabhanet. al. [46] use

Webserverlogstogeneratebeliefsabouttheaccesspatterns

ofWebpagesatagivenWebsite. Algorithmsforndingin-

terestingrulesbasedontheunexpectednessoftherulewere

alsodeveloped.

4.2.5 Usage Characterization

Whilemostprojectsthatworkoncharacterizingtheusage,

content, and structure of the Web don't necessarily con-

sider themselvesto be engaged indata mining, there is a

large amount of overlap between Webcharacterization re-

searchandWebUsagemining. Catledgeetal. [23]discuss

theresultsofastudyconductedattheGeorgiaInstituteof

Technology,inwhichtheWebbrowserXmosaicwas modi-

edtolog clientsideactivity. Theresultscollectedprovide

detailed information about the user's interaction with the

browserinterfaceaswellasthenavigationalstrategyusedto

browseaparticularsite. Theprojectalsoprovidesdetailed

statistics aboutoccurrenceofthevariousclientsideevents

suchastheclickingtheback/forwardbuttons,savingale,

addingtobookmarksetc. Pitkowetal.[36]proposeamodel

whichcanbeusedtopredicttheprobabilitydistributionfor

variouspagesausermightvisitonagivensite. Thismodel

worksbyassigningavaluetoallthepagesonasitebasedon

various attributesof that page. Theformulas andthresh-

old valuesusedinthemodelare derivedfromanextensive

empirical studycarried outonvarious browsingcommuni-

ties and their browsing patternsArlitt et. al. [20] discuss

variousperformancemetricsforWebserversalongwithde-

tails about the relationshipbetween eachof these metrics

for dierent workloads. Manley [40] develops a technique

for generating a custom made benchmark for a given site

based onitscurrent workload. This benchmark, whichhe

calls aself-conguring benchmark, canbe usedto perform

scalabilityandloadbalancingstudiesonaWebserver. Chi

et. al. [35] describe asystem called WEEV (WebEcology

andEvolutionVisualization)whichisavisualizationtoolto

study the evolving relationship of web usage, content and

sitetopologywithrespecttotime.

5. WEBSIFT OVERVIEW

TheWebSIFTsystem[31]is designed toperformWebUs-

ageMining fromserverlogsintheextendedNSCAformat

(includesreferrerandagentelds). Thepreprocessingalgo-

rithms includeidentifying users, serversessions, and infer-

ringcachedpagereferencesthroughtheuse ofthe referrer

eld. Thedetails ofthealgorithmsusedforthesestepsare

containedin [30]. Inaddition to creating a server session

(9)

preprocessing,andprovidestheoptiontoconvertserverses-

sionsintoepisodes. Eachepisodeiseitherthesubsetofall

content pages ina server session, or all of the navigation

pagesuptoandincludingeachcontentpage. Severalalgo-

rithms for identifyingepisodes (referredto as transactions

inthepaper)aredescribedandevaluatedin[28].

Theserver session or episode les can be run throughse-

quentialpatternanalysis,associationrulediscovery,cluster-

ing, orgeneral statisticsalgorithms, as showninFigure 5.

Theresultsofthevariousknowledgediscoverytoolscanbe

analyzedthrougha simpleknowledgequerymechanism,a

visualizationtool(associationrulemapwithcondenceand

supportweighted edges), or the information lter (OLAP

toolssuchasadatacubearepossibleasshowninFigure5,

butare notcurrentlyimplemented). Theinformationlter

makesuseof thepreprocessedcontent andstructureinfor-

mationto automaticallylter the resultsof theknowledge

discoveryalgorithmsforpatternsthatarepotentiallyinter-

esting. Forexample,usageclustersthatcontainpageviews

from multiple content clusters are potentially interesting,

whereasusageclustersthatmatchcontentclustersmaynot

beinteresting. Thedetailsofthemethodtheinformationl-

terusestocombineandcompareevidencefromthedierent

datasourcesarecontainedin[31].

6. PRIVACY ISSUES

Privacyisasensitivetopicwhichhasbeenattractingalotof

attentionrecentlyduetorapidgrowthofe-commerce. Itis

furthercomplicatedbytheglobalandself-regulatorynature

ofthe Web. Theissue of privacy revolvesaround the fact

that mostusers want to maintain strictanonymity onthe

Web.Theyareextremelyaversetotheideathatsomeoneis

monitoringtheWebsitestheyvisitandthetimetheyspend

onthosesites.

Ontheotherhand,siteadministratorsareinterestedinnd-

ingoutthedemographicsofusersaswellastheusagestatis-

ticsofdierentsectionsoftheirWebsite. Thisinformation

wouldallowthemtoimprovethedesignoftheWebsiteand

would ensure that the content caters to the largest popu-

lation of users visiting their site. The site administrators

alsowant theabilitytoidentifyauseruniquelyeverytime

shevisitsthesite,inordertopersonalize theWebsiteand

improvethebrowsingexperience.

Themainchallengeistocomeupwithguidelinesandrules

suchthatsiteadministratorscan performvarious analyses

ontheusagedatawithoutcompromisingtheidentityofan

individualuser. Furthermore,thereshouldbestrictregula-

tionstopreventtheusagedatafrombeingexchanged/sold

toother sites. Theusers shouldbemade awareofthe pri-

vacy policies followed by any given site, so that they can

make an informed decision about revealing their personal

data. Thesuccessof anysuchguidelinescanonlybeguar-

anteediftheyarebackedupbyalegalframework.

TheW3ChasanongoinginitiativecalledPlatformforPri-

vacy Preferences (P3P) [10; 38]. P3P provides a protocol

whichallows thesiteadministratorstopublishtheprivacy

policies followed by a site in a machine readable format.

Whentheuservisitsthesiteforthe rsttimethe browser

readstheprivacypoliciesfollowedbythesiteandthencom-

paresthatwiththatsecuritysettingconguredbytheuser.

Ifthepoliciesaresatisfactorythebrowsercontinuesrequest-

usedto arrive ata settingwhichisacceptableto theuser.

AnotheraimofP3Pistoprovideguidelinesforindependent

organizations whichcanensurethat sites complywith the

policystatementtheyarepublishing[12].

TheEuropeanUnionhastakenaleadinsettinguparegu-

latoryframeworkfor InternetPrivacyand hasissued adi-

rective whichsets guidelines for processing and transfer of

personal data [15]. Unfortunately inU.S.there is nouni-

fyingframeworkinplace,thoughU.S.FederalTradeCom-

mission (FTC) aftera study of commercial Web sites has

recommendedthatCongressdevelop legislationto regulate

thepersonalinformationbeingcollectedatWebsites[26].

7. CONCLUSIONS

Thispaperhasattempted toprovideanup-to-datesurvey

of the rapidly growing area of Web Usage mining. With

thegrowthofWeb-basedapplications,specicallyelectronic

commerce,thereissignicantinterestinanalyzingWebus-

age data to better understand Web usage, and apply the

knowledgetobetterserveusers. Thishasledtoanumberof

commercialoeringsfordoingsuchanalysis. However,Web

Usageminingraisessomehardscienticquestionsthatmust

beansweredbeforerobusttoolscanbedeveloped.Thisar-

ticlehasaimedatdescribingsuchchallenges,andthehope

isthattheresearchcommunitywilltakeupthechallengeof

addressingthem.

8. REFERENCES

[1] Accrue.http://www.accrue.com.

[2] Alladvantage.http://www.alladvantage.com.

[3] Andromediaaria.http://www.andromedia.com.

[4] Broadvision. http://www.broadvision.com.

[5] Hitlistcommerce.http://www.marketwave.com.

[6] Likeminds.http://www.andromedia.com.

[7] Netgenesis.http://www.netgenesis.com.

[8] Netperceptions.http://www.netperceptions.com.

[9] Netzero.http://www.netzero.com.

[10] Platform for privacy project.

http://www.w3.org/P3P/.

[11] Surfaid analytics.http://surfaid.dfw.ibm.com.

[12] Truste: Building a web you can believe in.

http://www.truste.org/.

[13] Webtrendsloganalyzer.http://www.webtrends.com.

[14] Worldwidewebcommitteewebusagecharacterization

activity.http://www.w3.org/WCA.

[15] Europeancommission. thedirective ontheprotection

of individuals with regard ot the processing of per-

sonal data and on the free movement of such data.

http://www2.echo.lu/,1998.

(10)

Episode File

Access Log Referrer Log Agent Log

Data Cleaning User Identification Session Identification

Path Completion

Server Session File

Episode Identification

Information Filter Site Files

Classification Algorithm

Site Spider

Registration or Remote Agent

Data INPUT PREPROCESSING PATTERN DISCOVERY PATTERN ANALYSIS

Knowledge Query Mechanism OLAP/

Visualization

"Interesting" Rules, Patterns, and Statistics Page Classification

Site Content

Site Topology

Association Rules Association Rule Mining

Usage Statistics Standard Statistics Package

Sequential Patterns Sequential

Pattern Mining

Clustering

Page Clusters User Clusters

Figure5: ArchitecturefortheWebSIFTSystem

(11)

the5thACMSIGKDDInt'lConferenceonKnowledge

DiscoveryandDataMining(KDD99).

[17] CharuC Aggarwal andPhilip S Yu.Ondisk caching

of web objects in proxy servers. In CIKM 97, pages

238{245,LasVegas,Nevada,1997.

[18] R.AgrawalandR.Srikant.Fastalgorithmsformining

association rules. InProc. of the 20th VLDB Confer-

ence,pages487{499,Santiago,Chile,1994.

[19] VirgilioAlmeida, Azer Bestavros,Mark Crovella, and

Adriana de Oliveira. Characterizing reference locality

inthewww. Technical ReportTR-96-11,Boston Uni-

versity,1996.

[20] MartinFArlittandCareyLWilliamson.Internetweb

servers: Workload characterization and performance

implications.IEEE/ACMTransactionsonNetworking,

5(5):631{645,1997.

[21] M. Balabanovic and Y. Shoham. Learning informa-

tion retrieval agents: Experiments with automated

webbrowsing. InOn-line WorkingNotesof theAAAI

Spring Symposium Series on Information Gathering

fromDistributed, Heterogeneous Environments,1995.

[22] Alex Buchner and Maurice D Mulvenna. Discovering

internet marketingintelligencethrough online analyt-

icalwebusagemining. SIGMOD Record, 27(4):54{61,

1998.

[23] L.CatledgeandJ.Pitkow.Characterizingbrowsingbe-

haviorsontheworldwideweb.ComputerNetworksand

ISDNSystems,27(6),1995.

[24] M.S. Chen, J. Han, and P.S. Yu. Data mining: An

overviewfrom adatabase perspective. IEEE Transac-

tions on Knowledge and Data Engineering, 8(6):866{

883,1996.

[25] M.S. Chen, J.S. Park, and P.S. Yu. Data mining

for path traversal patterns ina web environment. In

16thInternationalConference onDistributedComput-

ingSystems, pages385{392,1996.

[26] Roger Clarke. Internet privacy concerns conf the case

forintervention.42(2):60{67,1999.

[27] E.Cohen,B.Krishnamurthy,andJ.Rexford.Improv-

ing end-to-end performance of the web using server

volumesand proxylters.In Proc. ACMSIGCOMM,

pages241{253,1998.

[28] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-

vastava. Grouping web page references into transac-

tionsforminingworldwidewebbrowsingpatterns.In

KnowledgeandDataEngineeringWorkshop,pages2{9,

NewportBeach,CA,1997.IEEE.

[29] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-

vastava.Webmining: Informationandpatterndiscov-

ery on the world wide web. In International Confer-

ence on Tools with Articial Intelligence, pages 558{

567,NewportBeach,1997.IEEE.

vastava.Data preparationfor miningworld wideweb

browsing patterns. Knowledge and Information Sys-

tems, 1(1),1999.

[31] RobertCooley,Pang-NingTan,andJaideepSrivastava.

Discoveryofinterestingusagepatternsfromwebdata.

TechnicalReportTR99-022,UniversityofMinnesota,

1999.

[32] T.FawcettandF.Provost.Activitymonitoring: Notic-

ing interesting changes in behavior. In Fifth ACM

SIGKDD InternationalConferenceonKnowledgeDis-

coveryandDataMining,pages53{62,SanDiego,CA,

1999.ACM.

[33] U. Fayyad,G.Piatetsky-Shapiro,andP.Smyth.From

data miningto knowledge discovery: An overview. In

Proc.ACMKDD,1994.

[34] DavidGibson, JonKleinberg, and Prabhakar Ragha-

van. Inferringwebcommunitiesfromlinktopology.In

Conference onHypertextandHypermedia.ACM,1998.

[35] Chi E. H., Pitkow J., Mackinlay J., Pirolli P., Goss-

weiler,andCardS.K.Visualizingtheevolutionofweb

ecologies.InCHI'98,LosAngeles,California,1998.

[36] BernardoHuberman,PeterPirolli, JamesPitkow,and

Rajan Kukose. Strong regularities in world wideweb

surng.Technicalreport,XeroxPARC,1998.

[37] T.Joachims,D.Freitag,andT.Mitchell.Webwatcher:

Atourguidefortheworldwideweb.InThe15thInter-

national ConferenceonArticialIntelligence, Nagoya,

Japan,1997.

[38] Reagle Josephand CranorLorrie Faith.Theplatform

for privacypreferences.42(2):48{55,1999.

[39] H. Lieberman. Letizia: An agent that assists web

browsing.InProc.ofthe1995InternationalJointCon-

ference on Articial Intelligence, Montreal, Canada,

1995.

[40] Stephen Lee Manley. An Analysis of Issues Facing

World Wide Web Servers. Undergraduate, Harvard,

1997.

[41] B.MasandandM. Spiliopoulou,editors.Workshopon

Web Usage Analysis and User Proling (WebKDD),

1999.

[42] B.Mobasher,N.Jain, E.Han,andJ.Srivastava.Web

mining: Patterndiscoveryfromworldwidewebtrans-

actions.(TR96-050),1996.

[43] Bamshad Mobasher, RobertCooley, and Jaideep Sri-

vastava. Creating adaptive web sites through usage-

based clustering of urls. In Knowledge and Data En-

gineering Workshop,1999.

[44] Olfa Nasraoui, Raghu Krishnapuram, and Anupam

Joshi. Mining web access logs using a fuzzy rela-

tionalclusteringalgorithmbasedonarobustestimator.

In EighthInternational World Wide Web Conference,

Toronto,Canada,1999.

(12)

that helps incremental exploration of the world wide

web.In6thInternationalWorldWideWebConference,

SantaClara,CA,1997.

[46] BalajiPadmanabhanandAlexanderTuzhilin.Abelief-

drivenmethodfordiscoveringunexpectedpatterns.In

FourthInternationalConferenceonKnowledgeDiscov-

ery and Data Mining, pages 94{100, New York, New

York,1998.

[47] M.Pazzani, L.Nguyen,andS.Mantik.Learningfrom

hotlistsandcoldlists: Towards awwwinformationl-

tering and seeking agent. InIEEE 1995 International

ConferenceonTools withArticialIntelligence,1995.

[48] MikePerkowitz andOrenEtzioni.Adaptivewebsites:

Automaticallysynthesizingwebpages.InFifteenthNa-

tional Conference on Articial Intelligence, Madison,

WI,1998.

[49] MikePerkowitz andOrenEtzioni.Adaptivewebsites:

Conceptual cluster mining. In Sixteenth International

JointConference onArticialIntelligence,Stockholm,

Sweden,1999.

[50] Peter Pirolli, James Pitkow, and Ramana Rao. Silk

from a sow's ear: Extracting usable structures from

theweb.InCHI-96,Vancouver,1996.

[51] G.SaltonandM.J.McGill.IntroductiontoModernIn-

formationRetrieval.McGraw-Hill, NewYork,1983.

[52] S. Schechter, M. Krishnan, and M. D. Smith. Using

pathproles to predict httprequests. In7th Interna-

tional World Wide Web Conference, Brisbane, Aus-

tralia,1998.

[53] Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and

VishalShah.Knowledgediscoveryfromusersweb-page

navigation. In Workshop on Research Issues in Data

Engineering,Birmingham,England,1997.

[54] E.Spertus.Parasite: Miningstructuralinformationon

theweb.Computer Networks andISDNSystems: The

International Journalof Computer andTelecommuni-

cationNetworking, 29:1205{1215,1997.

[55] Myra Spiliopoulou and Lukas C Faulstich. Wum: A

webutilization miner.InEDBT WorkshopWebDB98,

Valencia,Spain,1998.SpringerVerlag.

[56] Kun-lungWu,PhilipSYu,andAllenBallman.Speed-

tracer: A web usage mining and analysis tool. IBM

SystemsJournal,37(1),1998.

[57] T.Yan,M.Jacobsen,H.Garcia-Molina,andU.Dayal.

Fromuseraccess patterns to dynamichypertextlink-

ing. In Fifth International World Wide Web Confer-

ence,Paris, France,1996.

[58] O. R. Zaiane, M. Xin, and J. Han. Discovering web

access patternsand trendsby applyingolap and data

miningtechnologyonweblogs.InAdvancesinDigital

Libraries,pages19{29,SantaBarbara,CA,1998.

VishalShah.Analysisanddesignofserverinformativewww-

sites.InSixthInternationalConferenceonInformationand

KnowledgeManagement, LasVegas,Nevada,1997.

About theAuthors:

Jaideep Srivastava receivedthe B.Tech. degreeincomputer

sciencefromtheIndian Instituteof Technology,Kanpur,India,

in 1983, and the M.S. and Ph.D.degrees incomputer science

from the University of California, Berkeley, in1985 and 1988,

respectively. Since1988hehasbeenonthefacultyoftheCom-

puterScienceDepartment,UniversityofMinnesota,Minneapolis,

where heiscurrentlyanAssociateProfessor. In1983hewasa

researchengineerwithUptronDigitalSystems,Lucknow,India.

Hehas publishedover110papersinrefereedjournals andcon-

ferences inthe areasof databases,parallel processing,articial

intelligence,andmulti-media.Hiscurrentresearchisintheareas

of databases, distributedsystems,and multi-mediacomputing.

Hehasgivenanumberofinvitedtalksandparticipatedinpanel

discussionsonthesetopics. Dr.Srivastavaisaseniormemberof

the IEEEComputerSocietyandthe ACM.Hisprofessionalac-

tivitieshaveincludedbeingonvariousprogramcommittees,and

refereeingforjournals,conferences,andtheNSF.

RobertCooleyiscurrentlypursuingaPh.D.incomputersci-

ence at the University of Minnesota. He received an M.S. in

computersciencefromMinnesotain1998. Hisresearchinterests

includeDataMiningandInformationRetrieval.

MukundDeshpandeisaPh.D.studentintheDepartmentof

ComputerScience atthe UniversityofMinnesota. Hereceived

anM.E.insystemscience&automationfromIndianInstituteof

Science,Bangalore,Indiain1997.

Pang-NingTaniscurrentlyworkingtowardshisPh.D.inCom-

puterScienceatUniversityofMinnesota. Hisprimaryresearch

interestisinDataMining.HereceivedanM.S.inPhysicsfrom

UniversityofMinnesotain1996.