A Statistical Learning Model of Text Classification for Support Vector Machines

(1)

A Statistical Learning Model of Text Classification for Support Vector Machines

Thorsten Joachims

GMD Forschungszentrum IT, AIS.KD

Schloss Birlinghoven, 53754 Sankt Augustin, Germany

Thorsten.Joachims@gmd.de

ABSTRACT

Thispaperdevelopsatheoreticallearningmodeloftextclas-

sicationforSupportVectorMachines(SVMs). Itconnects

thestatisticalpropertiesoftext-classicationtaskswiththe

generalizationperformanceofaSVMinaquantitativeway.

Unlike conventional approaches to learning textclassiers,

whichrelyprimarilyonempirical evidence, thismodelex-

plains why and whenSVMs perform well for text classi-

cation. Inparticular, it addresses the following questions:

Whycansupportvectormachineshandlethelargefeature

spacesintextclassicationeectively? Howisthis related

to the statistical properties of text? What are suÆcient

conditionsforapplyingSVMstotext-classicationproblems

successfully?

1. INTRODUCTION

Thereareatleasttwowaystomotivatewhyaparticular

learning method is suitable for a particular learning task.

Sinceultimatelyoneisinterestedintheperformanceofthe

method, one way is through comparative studies. Previ-

ous work [11, 4] presents such studies showing that Sup-

portVectorMachines(SVMs)deliverstate-of-the-artclassi-

cationperformance. However,successonbenchmarksisa

brittlejusticationfor alearning algorithm andgivesonly

limitedinsight. Therefore,thispaperanalyzesthesuitabil-

ity of SVMs for learning text classiers from a theoretical

perspective.

In particular, this paper presents an abstract model of

text-classication tasks. This model is based on statisti-

calproperties of text-classicationproblemsthat are both

observable and intuitive. Using this model, it is possible

toprovewhattypesoftext-classicationproblemsareeÆ-

cientlylearnable with SVMs. The central resultis anup-

perbound connecting the expectedgeneralization error of

anSVMwiththestatistical propertiesoftext-classication

tasks.

This paper is structured as follows. After a short in-

troduction to SVMs, it will identify the key properties of

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SIGIR’01, September 9-12, 2001, New Orleans, Louisiana, USA..

Copyright 2001 ACM 1-58113-331-6/01/0009 ...

^$

5.00.

text-classicationtasks. Theymotivatethemodelformally

dened inSection4. In additionto verifying the assump-

tionsofthemodelagainstrealdata,thissectionprovesthe

learnability results. Section 5 further validates the model

usingexperiments,beforeSection6analyzesthecomplexity

oftext-classicationtasksandidentiessuÆcientconditions

forgoodsgeneralizationperformance.

2. SUPPORT VECTOR MACHINES

SVMs[18] weredevelopedby V.Vapniketal. basedon

the structural risk minimization principle from statistical

learningtheory. Intheirbasicform,SVMslearnlineardeci-

sionrulesh(~x)=signfw~~ x+bgdescribedbyaweightvector

~

wandathresholdb. Inputisasampleofntrainingexam-

plesSn=((~x1;y1);;(~xn;yn)),~xi2<

N

,yi 2f 1;+1g.

For alinearlyseparableS

n

,the SVMndsthehyperplane

with maximum Euclidean distance to the closest training

examples. Thisdistanceiscalled themarginÆ,asdepicted

inFigure1. Fornon-separabletrainingsets,theamountof

training error is measured using slack variables

i . Com-

putingthehyperplaneisequivalenttosolvingthefollowing

primaloptimizationproblem[18].

OptimizationProblem1 (SVM(primal)).

minimize: V(w;~ b;

~

)= 1

2

~ ww~+C

n

X

i=1

i (1)

subj. to: 8 n

i=1 :y

i [~w~x

i

+b]1

i

(2)

8 n

i=1

:i>0 (3)

Theconstraints(2)requirethatalltrainingexamplesare

classiedcorrectlyuptosomeslack

i

. Ifatrainingexample

liesonthe\wrong"sideofthehyperplane,thecorrespond-

ing

i

is greater or equal to 1. Therefore, P

n

i=1

i is an

upperboundonthenumberoftrainingerrors. Thefactor

Cin(1)isaparameterthatallowstradingotrainingerror

vs. modelcomplexity.Notethatthemarginoftheresulting

hyperplaneisÆ=1=jjwjj.~

InsteadofsolvingOP1directly,onecanalsoconsiderthe

follwingdualprogram.

OptimizationProblem2 (SVM(dual)).

maximize:W(~)= n

X

i=1

i 1

2 n

X

i=1 n

X

j=1 y

i y

j

i

j (~x

i ~x

j )(4)

subj. to:

n

X

i=1 y

i

=0 (5)

8i2[1::n]:0iC (6)

(2)

h ^*

δ

δ δ

Figure1: Abinaryclassication problem (+vs. )

intwodimensions. Thehyperplaneh

separatespos-

itiveandnegativetrainingexampleswithmaximum

margin Æ. The examples closest to the hyperplane

arecalledsupportvectors (marked withcircles).

DualityimpliesthatW(~

)=V(w~

;b

;

~

)attherespec-

tivesolutionsofbothprograms,andthatW(~)V(w;~ b;

~

)

foranyfeasiblepoint. Fromthesolutionofthedual,thepri-

malsolutioncanbeconstructedas

~ w=

n

X

i=1

iyi~xi and b=yusv w~~ xusv (7)

where(~xusv;yusv)issometrainingexamplewith0<usv <

C. Forallbutdegeneratecases,suchtrainingexamplesex-

istand thehyperplaneis called stable. Onespecial family

ofhyperlanesconsideredinthefollowingarecalledunbiased

hyperplanes. Such hyperplanes are forced to pass through

theorigin,eitherbyaddingtheconstraintb=0inOP1,or

equivalentlybyremovingtheconstraint(5)inOP2. Froma

practicalperspectivefortext classication,SVMrestricted

to unbiased hyperplane achieve a performance similar to

general (i.e. biased) hyperplanes. For the experiments in

thispaper, SVM Light

[12] is usedfor solvingthe dualop-

timizationproblem 1

. MoredetailedintroductionstoSVMs

canbefoundin[2,18].

3. PROPERTIES OF TEXT-CLASSIFICA- TION TASKS

Tomakeusefulstatementsaboutwhyaparticularlearn-

ingmethodsshouldworkwellfortextclassication,itisnec-

essarytoidentifykeyproperties oftext-classicationtasks.

Given abag-of-words representation,the following proper-

tieshold:

High-Dimensional Feature Space.

Independent of the particular choice of terms, text-classication problems in-

volve high-dimensionalfeaturespaces. If eachwordoccur-

ring inthe training documentsis used as a feature, text-

classicationproblemswithafewthousandtrainingexam-

plescanleadto30,000andmoreattributes.

Sparse Document Vectors.

^While ^there^is ^a ^large ^space

ofpotentialfeatures, eachdocumentcontainsonlya small

numberofdistinctwords. Thisimpliesthat documentvec-

torsareverysparse.

1

http://www-ai.informatik.uni-dortmund.de/svmlight

Modulaire Industries said it acquired the design li-

braryandmanufacturingrightsofprivately-ownedBoise

Homesforanundisclosedamountofcash. BoiseHomes

soldcommercialandresidentialprefabricatedstructures,

Modulaire said.

USX,CONSOLIDATEDNATURALENDTALKS

USX Corp's Texas Oil and Gas Corp subsidiary and

ConsolidatedNaturalGasCohavemutuallyagreednot

topursuefurthertheirtalksonConsolidated's possible

purchase ofApolloGas CofromTexasOil. Nodetails

weregiven.

JUSTICEASKSU.S.DISMISSALOFTWAFILING

The Justice Department told the Transportation De-

partment it supporteda requestby USAirGroupthat

the DOT dismiss an application by Trans World Air-

lines Inc for approval to take control of USAir. \Our

rationaleisthatwereviewedtheapplicationforcontrol

ledbyTWAwiththeDOTandascertainedthatitdid

notcontainsuÆcientinformationuponwhichtobasea

competitivereview,"JamesWeiss,anoÆcialinJustice's

AntitrustDivision,toldReuters.

E.D. And F. MAN TO BUY INTO HONG KONG

FIRM

TheU.K.BasedcommodityhouseE.D.AndF.ManLtd

and Singapore'sYeo Hiap Seng Ltd jointlyannounced

thatManwillbuyasubstantialstakeinYeo's71.1 pct

heldunit,YeoHiapSengEnterprisesLtd.Manwillde-

velopthe locally listed softdrinks manufacturer into a

securities and commodities brokerage arm and will re-

namethermManPacic(Holdings)Ltd.

Figure 2: Four documents from the Reuters-21578

category\corporateacquisitions" thatdonotshare

any content words.

Heterogeneous Use of Terms.

^Consider^the⁴^documents

showninFigure2.AlldocumentsareReuters-21578articles

from the category \corporate acquisitions". Nevertheless,

the overlap between their document vectors is verysmall.

Inthisextremecase, thedocumentsdonotshare any con-

tentwords. Theonlywordsthatoccurinatleasttwodocu-

mentsare\it", \the",\and", \of",\for",\an",\a", \not",

\that",and\in".Allthesewordsarestopwordsanditisun-

likelythattheyhelpdiscriminatebetweendocumentsabout

corporateacquisitionsandotherdocuments.However,each

document contains goodkeywords indicating a \corporate

acquisition",justthattheyaredierent.

High Level of Redundancy.

^While ^there ^are ^generally

manydierentfeaturesrelevanttotheclassicationtask,of-

tenseveralsuchcuesoccurinonedocument. Thesecuesare

partlyredundant. Table1[11]showstheresultsofanexper-

imentontheReuters\corporateacquisitions"category. All

features(afterstemmingandstopwordremoval)areranked

according to their (binary) empirical mutual information

(EMI) with the class label (cf. e.g. [14]). Thena naive

(3)

step step step 1

2 3

Figure 3: Structure ofthe argument.

usedfeaturesbyEMIrank PRBE

1-200 89.6

201-500 71.3

501-1000 63.3

1001-2000 58.0

2001-4000 55.4

4001-9947 47.5

random(nolearning) 21.8

Table1: Learningwithoutusingthe\best"features.

Bayesclassieris trained using onlythosefeatures ranked

1-200,201-500,501-1000,1001-2000, 2001-4000, 4001-9947.

Theresults inTable1showthatevenfeaturesrankedlow-

eststillcontainconsiderableinformationandaresomewhat

relevant. Aclassierusingonlythose\worst"featureshasa

precision/recallbreak-evenpoint(PRBE)(e.g. [11])much

betterthanrandom.

Frequency Distribution of Words and Zipf’s Law.

^The

occurrencefrequencies of words innatural-language follow

Zipf'slaw[19]. Zipf'slawstatesthatifonerankswordsby

their termfrequency, ther-thmost frequent words occurs

roughly 1

r

times the term frequency of the most frequent

words. This impliesthat thereisasmallnumberofwords

that occurs very frequently, while most words occur very

infrequently.

4. A DISCRIMINATIVE MODEL OF TEXT CLASSIFICATION

Thegoal ofthissection isastatistical learning modelof

text-classicationtasks. Usingathreestepapproachasil-

lustratedinFigure 3, it provides the relationship between

the properties of text-classication tasks identied above

andtheexpectederrorrateofanSVM.Therststepshows

thatlargemargincombinedwithlowtrainingerrorisasuf-

cientconditionforgoodgeneralization accuracy. Thesec-

ondstepabstractsthepropertiesoftext-classicationtasks

intoamodel,whichthethirdstepconnectstolarge-margin

separation.

4.1 Step 1: Bounding the Expected Error Based on the Margin

Thefollowingbound[14,18]showsthatlargemargincom-

bined with low training error leads to high generalization

accuracy. Itusesresults limitingthe numberof leave-one-

outerrors[10,13]. Thekeyquantitiesare themarginÆ as

denedin Section2, the maximumEuclidean lengthR of

thedocumentvectors~x,andthetrainingloss P

i.

Theorem1 (BoundonExpectedErrorofSVM).

TheexpectederrorrateE(Err n

(h

SVM

))ofaSVMbasedon

n training examples with 0jj~xijjR forallpoints with

non-zero probabilityandsomeconstant C,isboundedby

E(Err n

(h

SVM ))

E

R 2

Æ 2

+C 0

E

n+1

P

i=1 i

n+1

withC 0

=CR 2

ifC1=(R 2

),andC 0

=CR 2

+1other-

wise. Forunbiased hyperplanes equals 1,and forgeneral

stable hyperplanesequals2. Theexpectationsontheright

are overtrainingsetsof sizen+1.

The proof can be found in [14]. Note the R acts as a

scalingconstantforthemarginÆ,asitcaneasilybeseenin

OptimizationProblem1. Forexample,thesquaredmargin

Æ 2

canalwaysbe doubledby scalingthedocumentvectors

~

xtotwicetheir length. TheboundinTheorem1accounts

forsuchscaling.

4.2 Step 2: TCat-Concepts as a Model of Text- Classification Tasks

Unfortunately,itisnotpossible tosimplylook atanew

text-classication task and immediately have a good idea

of whether it has alarge margin. Themargin property is

observable only after training data becomes available and

requires trainingthe SVM.Toovercomethisproblem,this

second steplaysthe basis for connecting the large-margin

propertywithmore intuitiveand moremeaningful proper-

tiesoftext-classicationtasks.

Considerthefollowingstereotypicaltextclassicationtask.

Whilethistaskisarticialandhypothetical,itwillserveasa

motivationforthemodeldevelopedinthissection. Forthis

exampletask,thefollowing describeshowdocumentsfrom

the twoclasses dierintermsofthe frequencywithwhich

certain typesofwords occur inthem. Figure4graphically

illustratesthecorresponding\word-frequencyhistogram".

Stopwords Independentlyofwhether adocumentis from

thepositiveorthenegativeclass,eachdocumentcon-

tains 20 word occurrences from a set of 100 words

(i.e. lexiconentries). Thesehigh-frequencywordsare

typically considered stopwords. Note that this does

not specify the individual word frequences, i.e. it is

openwhether onewordoccurs20times,or 20dier-

entwordseachoccuronce,orsomethinginbetween.

MediumFrequency There are 1,000 medium-frequency

words inthe lexicon. From a subsetof 600suchen-

tries,againeachpositiveandnegativedocumentcon-

tains (any bag of) 5occurrences. Butthere are also

(4)

00 00 00 11 11 11 0 0 0 0 0

1 1 1 1 1

0 0 0 1 1 1

00 00 00 11 11 11

00 00 00 00

11 11 11 11

00 00 00 11 11 11

00 00 00 00

11 11 11 11

00 00 11 11

00 00 00 11 11 11

positive documents

negative documents

11100 words in dictionary

20

1

1 5 1

20

1 10 10 5

100 stopwords (irrelevant) 600 irrelevant

3000 positive

3000 negative 4000 irrelevant

50 words per document

200 positive

4 4

9

9 200 negative

Figure4: Asimple exampleof aTCat-concept.

twogroupsof200entrieseachthatoccurprimarilyin

positiveor negative documents,respectively. Inpar-

ticular,fromonegroupthereare4occurrencesineach

positivedocument andonly 1ineachnegative docu-

ment. Respectively,from theothergroupthereare4

occurrencesineachnegativedocumentwhile thereis

onlyoneineachpositivedocument.

LowFrequency Similarly,fortheremaining10,000entries

inthelow-frequencypartofthelexion,thereisasubset

of 4,000 entries of which thereare 10occurrences in

bothpositiveandnegativedocuments. Butthereare

twosets of3,000 entries eachthat occur primarilyin

positiveor negative documentswithafrequencyof9

versus1.

Inhowfardoesthisexampleresemblethepropertiesoftext-

classicationtasks identiedinSection3?

High-Dimensional InputSpace: There are 11,100 fea-

tures,whichisonthesameorderofmagnitudeasreal

text-classicationtasks.

SparseDocument Vectors: Each document is only 50

wordslong,whichmeansthereareatleast11,050zero

entriesineachdocumentvector.

High Levelof Redundancy: Ineachdocumentthereare

4medium-frequencywordsand9low-frequencywords

that indicatetheclass ofthe document. Considering

thedocument lengthof50words, thisisafairlyhigh

levelofredundancy.

HeterogeneousUseofTerms: Boththepositiveandthe

negativedocumentseachhaveagroupof200medium-

frequency words and a group of 3,000 low-frequency

words. From each group there can be an arbitrary

subsetof4forthemedium-frequencywordsand9for

the low-frequency words ineach document. Consid-

ering only the medium-frequencywords, this implies

thattherecanbe50documentsinthesameclassthat

donotshareasinglemedium-frequencytermfromthis

group. Thismimicsthepropertyoftextclassication

tasksidentiedinSection3.

Zipf's Law: There isa smallnumberofwords(100 stop-

words)thatoccurveryfrequently,asetof1,000words

of medium frequency, and a large set of 10,000 low-

frequencywords. ThisdoesresembleZipf'slaw.

Toabstractfromthisparticularexample,thefollowingdef-

initionintroducesaparameterizedmodelthatcandescribe

text-classicationtasksmoregenerally.

Definition1 (HomogeneousTCat-Concepts).

TheTCat-concept

TCat([p

1 :n

1 :f

1 ];:::;[p

s :n

s :f

s

]) (8)

describes a binaryclassicationtask withs disjoint sets of

features(i.e. words). Thei-thsetincludesfifeatures. Each

positiveexamplecontainspioccurrencesoffeaturesfromthe

respectiveset,andeachnegativeexamplecontainsn

i occur-

rences. The same feature can occur multiple times in one

document.

Thisdenitiondoesnot includenoise (e.g. violations of

theoccurrencefrequenciesprescribedbytheTCat-concept).

However, the model canbe extended to handle noise ina

straightforward way [14]. Applying the denition to the

examplein Figure 4,it is easy to verifythat the example

canbedescribedasa

TCat([20:20:100]; #highfreq.

[4:1:200];[1:4:200];[5:5:600]; #mediumfreq.

[9:1:3000];[1:9:3000];[10:10:4000] #lowfreq.

)

concept. Whilethisisanarticialexample,isitpossibleto

modelrealtext-classicationtasksasTCat-concepts?

Empirical Validation.

^Consider text-classication tasks fromtheReuters-21578

2

,theWebKB 3

,and theOhsumed 4

collection. The following analysis shows how they can be

modeledasTCat-concepts.

Letusstartwiththecategory\course"fromtheWebKB

collection. First,wepartitionthefeaturespaceintodisjoint

setsofpositiveindicators,negativeindicators,andirrelevant

features. Using the simple strategy [14] of selecting fea-

turesbytheiroddsratio,thereare98high-frequencywords

that indicatepositive documents (odds ratio greater than

2) and 52 high-frequency words indicating negative docu-

ments(oddsratiolessthan0.5). Anexcerptofthesewords

isgiveninFigure5. Similarly,thereare431(341)medium-

frequencywordsthatindicatepositive(negative)documents

withanoddsratiogreaterthan5(lessthan0.2). Inthelow-

frequencyspectrumthereare5,045positiveindicators(odds

ratiogreaterthan10)and24,276negativeindicators(odds

ratio lessthan0.1). Allotherwords inthevocabularyare

assumedtocarrynoinformation.

Toabstract from the details ofparticular documents, it

is useful to analyse what atypical document for this task

lookslike. Insomesense,an\average"documentcaptures

whatistypical. AnaverageWebKBdocumentis277words

long. For positive examples of the category "course", on

average27.7%ofthe277occurrencescomefromthesetof98

high-frequencypositiveindicatorswhilethesewordsaccount

2

http://www.research.att.com/lewis/reuters215 78 .html

3

http://www.cs.cmu.edu/afs/cs/project/theo-20/ www /data

4

ftp://medir.ohsu.edu/pub/ohsumed

(5)

pos 98words

allanyassignmentassignments

available be book c chapter

class code course cse descrip-

tion discussion document due

eacheecsexamexamsfallnal

...

section set should solution so-

lutions spring structures stu-

dentssyllabustatexttextbook

there thursday topics tuesday

unix use wednesday week will

youyour

431words

accountacrobat adapted addi-

son adt ahead aho allowed al-

ternate announced announce-

ment announcements answers

appointmentapproximately

...

tuesdays turing turn turned

tuth txt uidaho uiowa ullman

understandungradedunitsun-

lessupennusrvectorsviwalter

weaverwedwednesdaysweekly

weeksweightswesleyyurttas

5045words

002cc 009a 00a 00om 01oct

01pm02pm 03oct03pm 03sep

04dec

...

gradable gradebook grade-

books gradefreq1 gradefreq2

gradefreq3 graders gradesheet

gradientsgracagrak

...

zimmermannzinczipizipserzj

zlocate znol zoran zp zwatch

zwherezwienerzyda

neg

acmaddressamaustincacali-

fornia center college computa-

tional conference contact cur-

rentcurrentlyddepartmentdr

facultyfaxgraduategrouphe

...

me member my our paral-

lel performance ph pp pro-

ceedings professor publications

recent research sciences sup-

port technical technology uni-

versity vision was working

52words

aaaiacademyaccessesaccurate

adaptationadvisoradvisoryaf-

liatedaÆliationsagentagents

alberta album alumniamanda

americaamherstannual

...

victoria virginia visiting vis-

itors visualization vita vitae

voice wa watson weather web-

ster went west wi wife wire-

less wisconsin worked work-

shopworkshopswroteyaleyork

341words

0a 0b 0b1 0e 0f 0r 0software

0x82d4 100k 100mhz 100th

1020x620102k103k

...

lunar lunches lunchtime lund

lundberglunedilungluniewski

luo luong lupin lupton lure

lurkerlus

...

zuo zuowei zurich zvi zw

zwaenepoel zwarico zwickau

zwilling zygmunt zzhen00

24276words

highfrequency mediumfrequency lowfrequency

Figure 5: Indicative wordsfor the WebKBcategory\course" partitioned by occurrence frequency.

for only 10.4% of the occurrences in an average negative

document. Assessingthe percentages analogously also for

theotherwordgroups,theycanbedirectlytranslatedinto

thefollowingTCat-concept.

TCat

course

([77:29:98];[4:21:52]; #highfreq.

[16:2:431];[1:12:341]; #mediumfreq.

[9:1:5045];[1:21:24276]; #lowfreq.

[169:191:8116] #rest

)

Thisshows thatthetext-classication taskconnectedwith

theWebKB category\course" canbe modeledasaTCat-

concept,ifoneassumesthatdocumentsareofhomogeneous

lengthandcomposition.Itcanbeshownthatthisassump-

tionofhomogeneitycanberelaxed[14].

SimilarTCat-conceptscanalsobefoundfor othertasks.

FortheReuters-21578category\earn"thesameprocedure

leadstotheTCat-concept

TCatearn([33:2:65];[32:65:152]; #highfreq.

[2:1:171];[3:21:974]; #mediumfreq.

[3:1:3455];[1:10:17020]; #lowfreq.

[78:52:5821] #rest

)

asanaveragecasemodel. ThemodelfortheOhsumedcat-

egory\pathology"is

TCat

pathology

([2:1:10];[1:4:22]; #highfreq.

[2:1:92];[1:2:94]; #mediumfreq.

[5:1:4080];[1:10:20922];#lowfreq.

[197:190:13459] #rest

):

Note that in particular the modelfor \pathology" is sub-

stantially dierent from the other two. This veries that

TCat-concepts can capture some properties of real text-

classication tasks that have the potential to dierentiate

betweentasks. Thefollowingstudiestheirrelevanceforgen-

eralizationperformance.

4.2.1 Step 3: Learnability of TCat-Concepts

This nal step provides the connection between TCat-

concepts andtheboundfor thegeneralizationperformance

ofanSVM.TherstlemmashowsthathomogeneousTCat-

conceptsaregenerallyseparablewithacertainmargin. Us-

ingthefactthattermfrequenciesobeyZipf'slaw,asecond

lemmashowsthattheEuclideanlengthofdocumentvectors

issmallfortext-classicationtasks. These tworesultslead

tothemainlearnability resultforTCat-concepts.

Lemma1 (MarginofNoise-FreeTCat-Concepts).

For TCat([p

1 :n

1 :f

1 ];:::;[p

s :n

s :f

s

])-concepts, there is

always a hyperplane passing through the origin that has a

marginÆ boundedby

Æ 2

ac b

2

a+2b+c

with

a= s

P

i=1 p

2

i

f

i

b= s

P

i=1 p

i n

i

f

i

c= s

P

i=1 n

2

i

f

i

(9)

Proof. Denep~ T

=(p

1

;:::;p

s )

T

and~n T

=(n

1

;:::;n

s )

T

,

as wellas thediagonal matrixF withf

1

;:::;f

s

onthe di-

agonal.

(6)

Themarginofthemaximum-marginhyperplanethatsep-

aratesagiventrainingsample(~x

1

;y

1

);:::;(~x

n

;y

n

)andthat

passesthroughtheorigin canbederivedfromthe solution

ofthefollowing optimizationproblem.

W(w )~ = min 1

2

~ w T

~

w (10)

s:t: y1[~x T

1

~ w]1

.

. (11)

y

n [~x

T

n

~ w]1

Thehyperplanecorrespondingtothesolutionvectorw~

has

amarginÆ=(2W(w~

)) 0:5

. Byaddingconstraintstothis

optimizationproblem,it ispossible tosimplifyitssolution

andget a lowerboundonthemargin. Letus add thead-

ditionalconstraintthatwithineachgroupoffi featuresthe

weights are required to be identical. Then w~ T

~ w = ~v

T

F~v

foravector~vofdimensionalitys. Theconstraints(11)can

alsobesimplied. By denition, eachexamplecontains a

certain number of features from each group. This means

thatall constraints forpositiveexamples are equivalentto

~ p T

~

v 1 and, respectively,~n T

~

v 1 for the negativeex-

amples. Thisleadsto thefollowing simpliedoptimization

problem.

W 0

(~v) = min 1

2

~ v T

F~v (12)

s:t: p~ T

~

v1 (13)

~ n T

~v 1 (14)

Let~v

bethe solution. SinceW 0

(~v

) W(w~

), it follows

that Æ (2W 0

(~v

)) 0:5

is a lower bound for the margin.

Itremains tond anupperboundfor W 0

(~v

)that canbe

computedinclosedform.IntroducingLagrangemultipliers,

thesolutionW 0

(~v

)equalsthevalueL(~v;

+

; )

of

L(~v;

+

; )= 1

2

~v T

F~v

+ ( ~p

T

~v 1)+ (~n T

~v+1) (15)

atitssaddle-point.

+

0and 0are the Lagrange

multipliersforthetwoconstraints(13)and(14). Usingthe

factthat

dL(~v;+; )

d~v

=0 (16)

atthesaddlepointonegetsaclosedformsolutionfor~v.

~v=F 1

[

+

~

p ~n ] (17)

Forease ofnotationonecanequivalentlywrite

~v=F 1

XY~ (18)

withX =( ~p;~n), Y =diag(1; 1),and ~ T

=(+; )ap-

propriately dened. Substituting into the Lagrangian re-

sultsin

L(~)=1 T

~

1

2

~ T

YX T

F 1

XY~ (19)

Tondthesaddlepointsonehastomaximizethisfunction

over~ T

=(

+

; ) T

subjectto

+

0and 0. Since

onlyalowerbound onthemargin is needed, itis possible

todroptheconstraints

+

0and 0. Removingthe

constraintscanonly increasethe objectivefunction at the

solution. Sothe unconstrainedmaximumL 0

(~)

isgreater

orequaltoL(~) . Settingthederivativeof(19)to0

dL 0

(~)

d~

=0 , ~=(YX T

F 1

XY) 1

1 (20)

and substituting into (19) yields the unconstrained maxi-

mum:

L 0

(~v;~)

= 1

2 1

T

(YX T

F 1

XY) 1

1 (21)

Thespecialformof(YX T

F 1

XY)makesitpossibletocom-

puteitsinverseinclosedform.

(YX T

F 1

XY) 1

=

~ p T

F 1

~ p ~p

T

F 1

~ n

~ n T

F 1

~ p ~n

T

F 1

~ n

1

(22)

=

a b

b c

1

(23)

= 1

ac b 2

a b

b c

(24)

Substitutinginto(21) completestheproof.

Thelemmashowsthatanysetofdocuments,whereeach

documentisfullyconsistentwiththespeciedTCat-concept,

isalwayslinearlyseparablewithacertainminimummargin.

Notethatseparabilityimpliesthatthetrainingloss P

i is

zero. Whilethispaperconsidersonlythecaseoffullconsis-

tencyand zeronoise, [14]shows howtheseassumtionscan

berelaxed.

It remains to bound the maximum Euclidean length R

ofdocumentvectorsbeforeitispossibletoapplyTheorem

1. Clearly,thedocumentvectorofadocumentwithlwords

cannothaveaEuclideanlengthgreaterthanl . Nevertheless,

thisboundisverylooseforrealdocumentvectors. Tobound

the quantity R more tightly it is possible to make use of

Zipf'slaw.

Assumethatthetermfrequenciesineverydocumentfol-

lowthegeneralizedZipf'slaw[15]

TFr= c

(r+k)

(25)

withtypicalparametervaluesk5,1:3,andcscaling

with document length. This assumptionabout Zipf's law

doesnotimplythataparticularwordoccurswithacertain

frequencyineverydocument. It ismuchweaker;itmerely

implies that the r-th most frequent word in a document

occurswithaparticularfrequency. InslightabuseofZipf's

lawfor shortdocuments,thefollowing lemmaconnects the

length of the document vectors to Zipf's law. Intuitively,

it states that many words in a document occur with low

frequency, leading to document vectors of relatively short

Euclideanlength.

Lemma2 (LengthofDocumentVectors). If the

ranked term frequencies TF

r

in a document with l terms

havetheformofthegeneralizedZipf'slaw

TFr= c

(r+k)

(26)

basedontheirfrequencyrankr,thenthesquaredEuclidean

lengthofthedocumentvector~xoftermfrequenciesisbounded

by

jj~xjj v

u

t d

X

r=1

c

(r+k)

2

withdsuchthat d

X

r=1 c

(r+k)

=l

(7)

Proof. Fromtheconnectionbetweenthefrequencyrank

ofatermanditsabsolutefrequencyitfollowsthatther-th

mostfrequenttermoccurs

TFr= c

(r+k)

(27)

times. Thedocumentvector~xhasdnon-zeroentrieswhich

arethevaluesTF1;:::;TFd. Therefore,theEuclidianlength

ofthedocumentvector~xis

~ x T

~ x=

d

X

r=1

c

(r+k)

2

(28)

CombiningLemma1andLemma2withTheorem1leads

tothefollowing mainresult.

Theorem2 (LearnabilityofTCat-Concepts).

For TCat([p

1 : n

1 : f

1 ];:::;[p

s : n

s : f

s

])-concepts and

documents with l terms distributed according to the gener-

alizedZipf'slawTFr= c

(r+k )

,theexpectedgeneralization

errorofan(unbiased)SVMaftertrainingonnexamplesis

boundedby

E(Err n

(h

SVM ))

R 2

n+1

a+2b+c

ac b 2

with a=

s

P

i=1 p

2

i

f

i

b= s

P

i=1 pini

f

i

c= s

P

i=1 n

2

i

f

i

R 2

= d

P

r=1

c

(r+k )

2

unless 8 s

i=1 :p

i

=n

i

. d is chosen so that d

P

r=1 c

(r+k )

= l .

ForunbiasedSVMsequals1,andforbiasedSVMsequals

2.

Proof. UsingthefactthatTCat-conceptsareseparable

(andthereforestable),ifatleastfor oneithevalueofpiis

dierentfromn

i

,theresultfromTheorem1reducesto

E(Err n

(hSV

M ))

1

n+1 E

R 2

Æ 2

(29)

sinceall

i

arezeroforasuÆcientlylargevalueofC. Lemma

1givesalowerboundforÆ 2

whichcanbeusedtoboundthe

expectation

E

R 2

Æ 2

a+2b+c

ac b 2

ER 2

(30)

Itremainsfor ustogiveanupperboundfor ER 2

. R 2

is

themaximumEuclidianlengthofany featurevectorinthe

trainingdata. Sincethe termfrequencies ineachexample

followthegeneralizedZipf'slawTFr= c

(r+k )

,itispossible

touseLemma2toboundR 2

andthereforeER 2

.

Empirical Validation.

^The^TCat-model^and^the^lemmata

leading to the main result suggest that text classication

tasks are generally linearlyseparable (i.e.

P

i

=0), and

that the normalized inverse margin R 2

=Æ 2

is small. This

predictioncanbetestedagainstrealdata.

Reuters R

2

Æ 2

n

P

i=1 i

earn 1143 0

acq 1848 0

money-fx 1489 27

grain 585 0

crude 810 4

trade 869 9

interest 2082 33

ship 458 0

wheat 405 2

corn 378 0

WebKB

R 2

Æ 2

P

i=1

i

course 519 0

faculty 1636 0

project 741 0

student 1588 0

Ohsumed

R 2

Æ 2

n

P

i=1

i

Pathology 11614 0

Cardiovasc. 4387 0

Neoplasms 2868 0

NervousSys. 3303 0

Immunologic 2556 0

Table 2: Normalized inverse margin and training

loss for the Reuters (27,658 features), the WebKB

(38,359 features), and the Ohsumed data (38,679

features) for C = 50. As suggested by model-

selection experiments,TFIDF-weighting isusedfor

Reuters andOhsumed,while therepresentation for

WebKB is binary. No stemming is performed and

stopwordremovalisusedonlyontheOhsumeddata.

First,Table 2 indicatesthat all Ohsumedcategories, all

WebKB tasks, and most Reuters-21578 categories are lin-

early separable (i.e.

P

i =0). This meansthat there is

a hyperplaneso that all positiveexamples areon oneside

of the hyperplane,while all negative examples are on the

other. Inseparability on some Reuters categories is often

duetodubiousdocuments(consistingonlyofaheadline)or

obviousmisclassicationsofthehumanindexers.

Second,separabilityispossiblewithalargemargin.Table

2showsthesizeofthenormalizedinversemarginfortheten

most frequent Reuters categories, the WebKB categories,

andthevemostfrequentOhsumedcategories. Intuitively,

R 2

=Æ 2

canbetreatedasan\eective"numberofparameters

due to it's link to VC-dimension [18]. Compared to the

dimensionality ofthe featurespace,thenormalized inverse

marginistypicallysmall.

These experimentalndings inconnection withthe the-

oretical results fromabove validatethat TCat-conceptsdo

capture an important and widely present property of text

classicationtasks.

5. COMPARING THE THEORETICAL MODEL WITH EXPERIMENTAL RESULTS

Theprevioussectionsformallydescribesthat alargeex-

pected margin with low training error leads to a low ex-

pected prediction error. Furthermore, they indicate how

margin is related to the properties of TCat-concepts, and

experimentally verifythatreal text-classicationtasks can

be modeled withTCat-concepts. This section veries not

only that the individual steps are well justied, but also

thattheirconjunctionproducesmeaningfulresults. Toshow

this,thissectioncomparesthegeneralizationperformanceas

predictedbythemodelwiththegeneralizationperformance

foundinexperiments.

In Section 4.2 a TCat-model for the WebKB category

(8)

E(Err n

(hSVM)) Err n

test (hSV

M )

WebKB\course" 11.2% 4.4%

Reuters \earn" 1.5% 1.3%

Ohsumed\pathology" 94.5% 23.1%

Table 3: Comparing the expected error predicted

by the model with the error rate and the pre-

cision/recall breakeven point on the test set for

the WebKB category \course", the Reuters cate-

gory \earn", and the Ohsumed category \pathol-

ogy"withTFweightingandC=1000. Nostopword

removaland nostemming areused.

\course" was estimated. Furthermore, the parameters of

Zipf's law for the full WebKB collection are c = 470000,

k =5,and =1:25. Subjectto the assumptionsof The-

orem 2, substituting the estimatedvalues into the bound

leads to the following characterization of the expected er-

ror.

E(Err n

(hSV

M ))

0:23311899:7

n+1

443

n+1

(31)

ndenotes thenumberoftrainingexamples. Consequently,

aftertrainingon 3957 examplesthe modelpredictsanex-

pectedgeneralizationerroroflessthan11.2%.

AnanalogprocedurefortheReuterscategory\earn"leads

tothebound

E(Err n

(hSV

M ))

0:1802762:9

n+1

138

n+1

(32)

sothattheexpectedgeneralizationerrorafter9603training

examples is less than 1.5%. Similarly, the bound for the

Ohsumedcategory\pathology"is

E(Err n

(h

SVM ))

7:41231275:8

n+1

9457

n+1

; (33)

leadingtoanexpectedgeneralizationerroroflessthan94.5%

after10,000trainingexamples.

Table 3 compares the expectedgeneralization errorpre-

dictedbytheestimatedmodelswiththegeneralizationper-

formanceobservedinexperiments.Whileitisunreasonable

to expectthat themodel preciselypredicts the exactper-

formanceobservedonthe testset, Table 3shows that the

modelcaptureswhichclassicationtasks aremorediÆcult

than others. In particular, it does correctly predict that

\earn"istheeasiesttask,\course"isthesecondeasiesttask,

and that \pathology" isthe most diÆcult one. Whilethe

TCatmodelisprobablynotdetailedenoughto besuitable

forperformanceestimationinmostapplicationsettings(e.g.

[13]),thisgivessomevalidationthatTCat-conceptscanfor-

malizethekeypropertiesoftext-classicationtasksrelevant

forlearnabilitywithSVMs. Morecanbefoundin[14].

6. SENSITIVITY ANALYSIS: DIFFICULT AND EASY LEARNING TASKS

Theprevious sectionrevealedthattheboundontheex-

pected generalization error can be large for some TCat-

conceptswhileitissmallforothers.Goingthroughdierent

scenarios, it is now possible to identify the keyproperties

thatmakeatext-classicationtask\easy"or\diÆcult"for

anSVMtolearn[14].

Occurrence Frequency Giventhattheotherparameters

stay constant, the boundontheerror ratedecreases,

if the frequency of the discriminative features is in-

creased.

Discriminative PowerofTermSets The extent to

which vocabulary diers between classes makes a

dierence for learnability. The value of the bound

decreases, if the dierence in class conditional word

frequenciesincreases.

LevelofRedundancy The higher the redundancy, the

lowertheboundonthegeneralization error. Thisim-

plies that it is desirable to have many clues ineach

document.

Similarly, the model can be used to analyse the eect of

TFIDF weighting on the eectiveness of SVMs depending

onthepropertiesofthetask[14].

7. LIMITATIONS OF THE MODEL AND OPEN QUESTIONS

Everymodelabstractsfrom realityinsome senseand it

isimportanttoclearly pointtheassumptionsout.

First, each document is assumed to exactly follow the

same generalized Zipf's law, neglecting variance and dis-

cretizationinaccuraciesthatoccur especiallyfor shortdoc-

uments. Inparticular,this implies that all documentsare

ofequallength.

Second,the modelxesthenumberof occurrences from

eachwordsetintheTCat-model. Whilethedegreeofviola-

tionofthisassumptioncanbecapturedintermsofattribute

noise,itmightbeusefulandpossiblenottospecifytheex-

actnumberofoccurrencesperwordset,butonlyupperand

lowerbounds. This could make themodel moreaccurate.

However,it comeswiththecostofanincreasednumberof

parameters, makingthe modelless understandable. While

the formal analysis of noise in [14] demonstrates that the

modeldoesnotbreak inthepresenceof noise,thebounds

couldbetightened. Alongthesamelines,parametricnoise

modelscouldbeincorporatedtomodelthetypesofnoisein

text-classicationproblems.

Finally, the general approach taken in this paper is to

model only upper bounds onthe error rate. While these

are important to derive suÆcient conditions for the learn-

ability of text-classicationtasks, lowerbounds may be of

interest as well. They could answerthe question of which

text-classicationtaskscannotbelearnedwithSVMs.

8. RELATED WORK

Whileotherlearning algorithmscanalso beanalyzedin

terms of formal models, these models typically make as-

sumptionsunjustiedfor text.

Themost popular suchalgorithm isnaiveBayes. Naive

Bayes is commonly justied using assumptions of condi-

tional independence or linked dependence [3]. However,

theseassumptionsaregenerallyacceptedtobefalsefortext.

Whilemorecomplexdependencemodelscansomewhatre-

movethe degreeofviolation[17], aprincipalproblemwith

usinggenerativemodelsfortextremains.Findingagenera-

tivemodelsfornaturallanguageappearsmuchmorediÆcult

thansolvingatextclassicationtask. Therefore,thispaper

presented a discriminative model of text classication. It

(9)

butionof words enough todescribeclassication accuracy.

Thiswayitispossibletoavoidfalseindependenceassump-

tions.

Anothermodelusedtodescribethe propertiesof textis

the2-Poisson model[1]. However,liketheBernoullimodel

itisrejectedbytests[8,9]. Descriptionorientedapproaches

[7][6][5]providepowerfulmodelingtoolandcanavoidhigh-

dimensionalfeaturespaces,butrequireimplicitassumptions

inthewaydescriptionvectorsaregenerated.

While dierent in itsmotivationand its goal, the work

ofPapadimitriou et. alis most similarinspiritto the ap-

proachpresentedhere[16]. Theyshowthatlatentsemantic

indexingleadstoasuitablelow-dimensionalrepresentation,

given assumptionsabout the distributionof words. These

assumptionsare similarinhow theyexploit the dierence

ofworddistributions. However,theydonotshowhowtheir

assumptionsrelateto thestatistical properties of textand

theydonotderivegeneralization-error bounds.

9. SUMMARY AND CONCLUSIONS

Thispaperdevelopstherstmodeloflearningtextclassi-

ersfromexamplesthatmakesitpossibletoquantitatively

connectthestatisticalpropertiesoftextwiththegeneraliza-

tionperformanceofthelearner. Themodelisthe resultof

takingadiscriminativeapproach. Unlikeconventionalgen-

erative models, it doesnot involve independence assump-

tions. Thediscriminativemodelfocusesonthoseproperties

of the text classication tasks that are suÆcient for good

generalizationperformance, avoidingmuchofthecomplex-

ity ofnaturallanguage.

Based on this discriminative model, the paperexplains

howSVMscanachieve goodclassicationperformance de-

spitethe high-dimensionalfeaturespacesintextclassica-

tion. Theresulting bounds ontheexpectedgeneralization

error give a formal understanding of what kind of text-

classication task can be solved with SVMs. This makes

itpossibletoidentifythat{intuitively{highredundancy,

high discriminative powerof termsets, and discriminative

featuresinthehigh-frequencyrangearesuÆcientconditions

forgoodgeneralization. Finally,themodelprovidesaformal

basisfordevelopingnewalgorithmsthataremostappropri-

ateinspecicscenarios.

10. REFERENCES

[1] A.BooksteinandD.R.Swanson.Probabilisticmodels

forautomatedindexing.JournaloftheAmerican

SocietyforInformationScience,25(5):312{318, 1974.

[2] C.Burges.Atutorialonsupportvectormachinesfor

patternrecognition.DataMiningandKnowledge

Discovery,2(2):121{167,1998.

[3] W.Cooper.Someinconsistenciesandmisnomersin

probabilisticinformationretrieval.InInternational

ACMSIGIRConference onResearch and

DevelopmentinInformationRetrieval,pages57{61,

1991.

[4] S.Dumais,J.Platt,D.Heckerman,andM. Sahami.

Inductivelearningalgorithmsandrepresentationsfor

textcategorization.InProceedings ofACM-CIKM98,

November1998.

[5] N.Fuhr,S.Hartmann,G.Lustig,M.Schwantner,

K.Tzeras,andG.Knorz.Air/x-arule-based

RIAO,pages606{623,1991.

[6] N.FuhrandG.Knorz.Retrievaltestevaluationofa

rulebasedautomaticindexing(air/phys).InC.van

Rijsbergen,editor,ResearchandDevelopmentin

InformationRetrieval: Proceedings oftheThirdJoint

BCS andACM Symposium,pages391{408.

CambridgeUniversityPress, July1984.

[7] N.Govert,M.Lalmas,andN.Fuhr.Aprobabilistic

description-orientedapproachforcategorisingWeb

documents.InProceedingsof CIKM-99,8thACM

InternationalConference onInformation and

KnowledgeManagement,pages475{482,KansasCity,

US,1999.ACMPress, NewYork,US.

[8] S.P.Harter.Aprobabilistic approachtoautomated

keywordindexing.PartI:onthedistributionof

specialtywordsinatechnicalliterature.Journalofthe

AmericanSociety forInformationScience,

26(4):197{206, 1975.

[9] S.P.Harter.Aprobabilistic approachtoautomated

keywordindexing.PartII:Analgorithmfor

probabilistic indexing.JournaloftheAmerican

Society forInformationScience,26(5):280{289,1975.

[10] T.JaakkolaandD.Haussler.Probabilistickernel

regressionmodels.InConferenceonAIandStatistics,

1999.

[11] T.Joachims. Textcategorizationwithsupportvector

machines:Learningwithmanyrelevantfeatures.In

ProceedingsoftheEuropeanConferenceonMachine

Learning,pages137{142,Berlin,1998. Springer.

[12] T.Joachims. Makinglarge-scaleSVMlearning

practical.InB.Scholkopf,C.Burges,andA.Smola,

editors, AdvancesinKernelMethods-Support Vector

Learning,chapter11.MITPress,Cambridge,MA,

1999.

[13] T.Joachims. Estimatingthegeneralization

performanceofaSVMeÆciently.InProceedingsofthe

InternationalConference onMachine Learning,San

Francisco,2000.MorganKaufman.

[14] T.Joachims. TheMaximum-MarginApproach to

LearningTextClassiers: Methods,Theory,and

Algorithms.PhDthesis,UniversitatDortmund,2001.

Kluwer,toappear.

[15] B.Mandelbrot.Anoteonaclassofskewdistribution

functions: AnalysisandcritiqueofapaperbyH.A.

Simon.InformationandControl,2(1):90{99,Apr.

1959.

[16] C.H.Papadimitriou,P.Raghavan,H.Tamaki,and

S.Vempala.Latentsemanticindexing: Aprobabilistic

analysis.InACM,editor,PODS '98.Proceedings of

theACMSIGACT-SIGMOD-SIGARTSymposiumon

Principles ofDatabaseSystems,June,1998,Seattle,

Washington,pages159{168,NewYork,NY10036,

USA,1998.ACMPress.

[17] M. Sahami.UsingMachineLearningtoImprove

InformationAccess.PhDthesis,StanfordUniversity,

1998.

[18] V.Vapnik.StatisticalLearningTheory.Wiley,

Chichester,GB,1998.

[19] G.K.Zipf.HumanBehavior andthePrincipleof

LeastEort: AnIntroductiontoHumanEcology.

Addison-Wesley,Cambridge,MA,USA,1949.

A Statistical Learning Model of Text Classification for Support Vector Machines