A Statistical Learning Model of Text Classification for Support Vector Machines
Thorsten Joachims
GMD Forschungszentrum IT, AIS.KD
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
Thorsten.Joachims@gmd.de
ABSTRACT
Thispaperdevelopsatheoreticallearningmodeloftextclas-
sicationforSupportVectorMachines(SVMs). Itconnects
thestatisticalpropertiesoftext-classicationtaskswiththe
generalizationperformanceofaSVMinaquantitativeway.
Unlike conventional approaches to learning textclassiers,
whichrelyprimarilyonempirical evidence, thismodelex-
plains why and whenSVMs perform well for text classi-
cation. Inparticular, it addresses the following questions:
Whycansupportvectormachineshandlethelargefeature
spacesintextclassicationeectively? Howisthis related
to the statistical properties of text? What are suÆcient
conditionsforapplyingSVMstotext-classicationproblems
successfully?
1. INTRODUCTION
Thereareatleasttwowaystomotivatewhyaparticular
learning method is suitable for a particular learning task.
Sinceultimatelyoneisinterestedintheperformanceofthe
method, one way is through comparative studies. Previ-
ous work [11, 4] presents such studies showing that Sup-
portVectorMachines(SVMs)deliverstate-of-the-artclassi-
cationperformance. However,successonbenchmarksisa
brittlejusticationfor alearning algorithm andgivesonly
limitedinsight. Therefore,thispaperanalyzesthesuitabil-
ity of SVMs for learning text classiers from a theoretical
perspective.
In particular, this paper presents an abstract model of
text-classication tasks. This model is based on statisti-
calproperties of text-classicationproblemsthat are both
observable and intuitive. Using this model, it is possible
toprovewhattypesoftext-classicationproblemsareeÆ-
cientlylearnable with SVMs. The central resultis anup-
perbound connecting the expectedgeneralization error of
anSVMwiththestatistical propertiesoftext-classication
tasks.
This paper is structured as follows. After a short in-
troduction to SVMs, it will identify the key properties of
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SIGIR’01, September 9-12, 2001, New Orleans, Louisiana, USA..
Copyright 2001 ACM 1-58113-331-6/01/0009 ...
$5.00.
text-classicationtasks. Theymotivatethemodelformally
dened inSection4. In additionto verifying the assump-
tionsofthemodelagainstrealdata,thissectionprovesthe
learnability results. Section 5 further validates the model
usingexperiments,beforeSection6analyzesthecomplexity
oftext-classicationtasksandidentiessuÆcientconditions
forgoodsgeneralizationperformance.
2. SUPPORT VECTOR MACHINES
SVMs[18] weredevelopedby V.Vapniketal. basedon
the structural risk minimization principle from statistical
learningtheory. Intheirbasicform,SVMslearnlineardeci-
sionrulesh(~x)=signfw~~ x+bgdescribedbyaweightvector
~
wandathresholdb. Inputisasampleofntrainingexam-
plesSn=((~x1;y1);;(~xn;yn)),~xi2<
N
,yi 2f 1;+1g.
For alinearlyseparableS
n
,the SVMndsthehyperplane
with maximum Euclidean distance to the closest training
examples. Thisdistanceiscalled themarginÆ,asdepicted
inFigure1. Fornon-separabletrainingsets,theamountof
training error is measured using slack variables
i . Com-
putingthehyperplaneisequivalenttosolvingthefollowing
primaloptimizationproblem[18].
OptimizationProblem1 (SVM(primal)).
minimize: V(w;~ b;
~
)= 1
2
~ ww~+C
n
X
i=1
i (1)
subj. to: 8 n
i=1 :y
i [~w~x
i
+b]1
i
(2)
8 n
i=1
:i>0 (3)
Theconstraints(2)requirethatalltrainingexamplesare
classiedcorrectlyuptosomeslack
i
. Ifatrainingexample
liesonthe\wrong"sideofthehyperplane,thecorrespond-
ing
i
is greater or equal to 1. Therefore, P
n
i=1
i is an
upperboundonthenumberoftrainingerrors. Thefactor
Cin(1)isaparameterthatallowstradingotrainingerror
vs. modelcomplexity.Notethatthemarginoftheresulting
hyperplaneisÆ=1=jjwjj.~
InsteadofsolvingOP1directly,onecanalsoconsiderthe
follwingdualprogram.
OptimizationProblem2 (SVM(dual)).
maximize:W(~)= n
X
i=1
i 1
2 n
X
i=1 n
X
j=1 y
i y
j
i
j (~x
i ~x
j )(4)
subj. to:
n
X
i=1 y
i
i
=0 (5)
8i2[1::n]:0iC (6)
h *
δ
δ δ
Figure1: Abinaryclassication problem (+vs. )
intwodimensions. Thehyperplaneh
separatespos-
itiveandnegativetrainingexampleswithmaximum
margin Æ. The examples closest to the hyperplane
arecalledsupportvectors (marked withcircles).
DualityimpliesthatW(~
)=V(w~
;b
;
~
)attherespec-
tivesolutionsofbothprograms,andthatW(~)V(w;~ b;
~
)
foranyfeasiblepoint. Fromthesolutionofthedual,thepri-
malsolutioncanbeconstructedas
~ w=
n
X
i=1
iyi~xi and b=yusv w~~ xusv (7)
where(~xusv;yusv)issometrainingexamplewith0<usv <
C. Forallbutdegeneratecases,suchtrainingexamplesex-
istand thehyperplaneis called stable. Onespecial family
ofhyperlanesconsideredinthefollowingarecalledunbiased
hyperplanes. Such hyperplanes are forced to pass through
theorigin,eitherbyaddingtheconstraintb=0inOP1,or
equivalentlybyremovingtheconstraint(5)inOP2. Froma
practicalperspectivefortext classication,SVMrestricted
to unbiased hyperplane achieve a performance similar to
general (i.e. biased) hyperplanes. For the experiments in
thispaper, SVM Light
[12] is usedfor solvingthe dualop-
timizationproblem 1
. MoredetailedintroductionstoSVMs
canbefoundin[2,18].
3. PROPERTIES OF TEXT-CLASSIFICA- TION TASKS
Tomakeusefulstatementsaboutwhyaparticularlearn-
ingmethodsshouldworkwellfortextclassication,itisnec-
essarytoidentifykeyproperties oftext-classicationtasks.
Given abag-of-words representation,the following proper-
tieshold:
High-Dimensional Feature Space.
Independent of the particular choice of terms, text-classication problems in-volve high-dimensionalfeaturespaces. If eachwordoccur-
ring inthe training documentsis used as a feature, text-
classicationproblemswithafewthousandtrainingexam-
plescanleadto30,000andmoreattributes.
Sparse Document Vectors.
While thereis a large spaceofpotentialfeatures, eachdocumentcontainsonlya small
numberofdistinctwords. Thisimpliesthat documentvec-
torsareverysparse.
1
http://www-ai.informatik.uni-dortmund.de/svmlight
Modulaire Industries said it acquired the design li-
braryandmanufacturingrightsofprivately-ownedBoise
Homesforanundisclosedamountofcash. BoiseHomes
soldcommercialandresidentialprefabricatedstructures,
Modulaire said.
USX,CONSOLIDATEDNATURALENDTALKS
USX Corp's Texas Oil and Gas Corp subsidiary and
ConsolidatedNaturalGasCohavemutuallyagreednot
topursuefurthertheirtalksonConsolidated's possible
purchase ofApolloGas CofromTexasOil. Nodetails
weregiven.
JUSTICEASKSU.S.DISMISSALOFTWAFILING
The Justice Department told the Transportation De-
partment it supporteda requestby USAirGroupthat
the DOT dismiss an application by Trans World Air-
lines Inc for approval to take control of USAir. \Our
rationaleisthatwereviewedtheapplicationforcontrol
ledbyTWAwiththeDOTandascertainedthatitdid
notcontainsuÆcientinformationuponwhichtobasea
competitivereview,"JamesWeiss,anoÆcialinJustice's
AntitrustDivision,toldReuters.
E.D. And F. MAN TO BUY INTO HONG KONG
FIRM
TheU.K.BasedcommodityhouseE.D.AndF.ManLtd
and Singapore'sYeo Hiap Seng Ltd jointlyannounced
thatManwillbuyasubstantialstakeinYeo's71.1 pct
heldunit,YeoHiapSengEnterprisesLtd.Manwillde-
velopthe locally listed softdrinks manufacturer into a
securities and commodities brokerage arm and will re-
namethermManPacic(Holdings)Ltd.
Figure 2: Four documents from the Reuters-21578
category\corporateacquisitions" thatdonotshare
any content words.
Heterogeneous Use of Terms.
Considerthe4documentsshowninFigure2.AlldocumentsareReuters-21578articles
from the category \corporate acquisitions". Nevertheless,
the overlap between their document vectors is verysmall.
Inthisextremecase, thedocumentsdonotshare any con-
tentwords. Theonlywordsthatoccurinatleasttwodocu-
mentsare\it", \the",\and", \of",\for",\an",\a", \not",
\that",and\in".Allthesewordsarestopwordsanditisun-
likelythattheyhelpdiscriminatebetweendocumentsabout
corporateacquisitionsandotherdocuments.However,each
document contains goodkeywords indicating a \corporate
acquisition",justthattheyaredierent.
High Level of Redundancy.
While there are generallymanydierentfeaturesrelevanttotheclassicationtask,of-
tenseveralsuchcuesoccurinonedocument. Thesecuesare
partlyredundant. Table1[11]showstheresultsofanexper-
imentontheReuters\corporateacquisitions"category. All
features(afterstemmingandstopwordremoval)areranked
according to their (binary) empirical mutual information
(EMI) with the class label (cf. e.g. [14]). Thena naive
step step step 1
2 3
Figure 3: Structure ofthe argument.
usedfeaturesbyEMIrank PRBE
1-200 89.6
201-500 71.3
501-1000 63.3
1001-2000 58.0
2001-4000 55.4
4001-9947 47.5
random(nolearning) 21.8
Table1: Learningwithoutusingthe\best"features.
Bayesclassieris trained using onlythosefeatures ranked
1-200,201-500,501-1000,1001-2000, 2001-4000, 4001-9947.
Theresults inTable1showthatevenfeaturesrankedlow-
eststillcontainconsiderableinformationandaresomewhat
relevant. Aclassierusingonlythose\worst"featureshasa
precision/recallbreak-evenpoint(PRBE)(e.g. [11])much
betterthanrandom.
Frequency Distribution of Words and Zipf’s Law.
Theoccurrencefrequencies of words innatural-language follow
Zipf'slaw[19]. Zipf'slawstatesthatifonerankswordsby
their termfrequency, ther-thmost frequent words occurs
roughly 1
r
times the term frequency of the most frequent
words. This impliesthat thereisasmallnumberofwords
that occurs very frequently, while most words occur very
infrequently.
4. A DISCRIMINATIVE MODEL OF TEXT CLASSIFICATION
Thegoal ofthissection isastatistical learning modelof
text-classicationtasks. Usingathreestepapproachasil-
lustratedinFigure 3, it provides the relationship between
the properties of text-classication tasks identied above
andtheexpectederrorrateofanSVM.Therststepshows
thatlargemargincombinedwithlowtrainingerrorisasuf-
cientconditionforgoodgeneralization accuracy. Thesec-
ondstepabstractsthepropertiesoftext-classicationtasks
intoamodel,whichthethirdstepconnectstolarge-margin
separation.
4.1 Step 1: Bounding the Expected Error Based on the Margin
Thefollowingbound[14,18]showsthatlargemargincom-
bined with low training error leads to high generalization
accuracy. Itusesresults limitingthe numberof leave-one-
outerrors[10,13]. Thekeyquantitiesare themarginÆ as
denedin Section2, the maximumEuclidean lengthR of
thedocumentvectors~x,andthetrainingloss P
i.
Theorem1 (BoundonExpectedErrorofSVM).
TheexpectederrorrateE(Err n
(h
SVM
))ofaSVMbasedon
n training examples with 0jj~xijjR forallpoints with
non-zero probabilityandsomeconstant C,isboundedby
E(Err n
(h
SVM ))
E
R 2
Æ 2
+C 0
E
n+1
P
i=1 i
n+1
withC 0
=CR 2
ifC1=(R 2
),andC 0
=CR 2
+1other-
wise. Forunbiased hyperplanes equals 1,and forgeneral
stable hyperplanesequals2. Theexpectationsontheright
are overtrainingsetsof sizen+1.
The proof can be found in [14]. Note the R acts as a
scalingconstantforthemarginÆ,asitcaneasilybeseenin
OptimizationProblem1. Forexample,thesquaredmargin
Æ 2
canalwaysbe doubledby scalingthedocumentvectors
~
xtotwicetheir length. TheboundinTheorem1accounts
forsuchscaling.
4.2 Step 2: TCat-Concepts as a Model of Text- Classification Tasks
Unfortunately,itisnotpossible tosimplylook atanew
text-classication task and immediately have a good idea
of whether it has alarge margin. Themargin property is
observable only after training data becomes available and
requires trainingthe SVM.Toovercomethisproblem,this
second steplaysthe basis for connecting the large-margin
propertywithmore intuitiveand moremeaningful proper-
tiesoftext-classicationtasks.
Considerthefollowingstereotypicaltextclassicationtask.
Whilethistaskisarticialandhypothetical,itwillserveasa
motivationforthemodeldevelopedinthissection. Forthis
exampletask,thefollowing describeshowdocumentsfrom
the twoclasses dierintermsofthe frequencywithwhich
certain typesofwords occur inthem. Figure4graphically
illustratesthecorresponding\word-frequencyhistogram".
Stopwords Independentlyofwhether adocumentis from
thepositiveorthenegativeclass,eachdocumentcon-
tains 20 word occurrences from a set of 100 words
(i.e. lexiconentries). Thesehigh-frequencywordsare
typically considered stopwords. Note that this does
not specify the individual word frequences, i.e. it is
openwhether onewordoccurs20times,or 20dier-
entwordseachoccuronce,orsomethinginbetween.
MediumFrequency There are 1,000 medium-frequency
words inthe lexicon. From a subsetof 600suchen-
tries,againeachpositiveandnegativedocumentcon-
tains (any bag of) 5occurrences. Butthere are also
00 00 00 11 11 11 0 0 0 0 0
1 1 1 1 1
0 0 0 1 1 1
00 00 00 11 11 11
00 00 00 00
11 11 11 11
00 00 00 11 11 11
00 00 00 00
11 11 11 11
00 00 11 11
00 00 00 11 11 11
positive documents
negative documents
11100 words in dictionary
20
1
1
5 1
20
1
10 10 5
100 stopwords (irrelevant) 600 irrelevant
3000 positive
3000 negative 4000 irrelevant
50 words per document
200 positive
4 4
9
9
200 negative
Figure4: Asimple exampleof aTCat-concept.
twogroupsof200entrieseachthatoccurprimarilyin
positiveor negative documents,respectively. Inpar-
ticular,fromonegroupthereare4occurrencesineach
positivedocument andonly 1ineachnegative docu-
ment. Respectively,from theothergroupthereare4
occurrencesineachnegativedocumentwhile thereis
onlyoneineachpositivedocument.
LowFrequency Similarly,fortheremaining10,000entries
inthelow-frequencypartofthelexion,thereisasubset
of 4,000 entries of which thereare 10occurrences in
bothpositiveandnegativedocuments. Butthereare
twosets of3,000 entries eachthat occur primarilyin
positiveor negative documentswithafrequencyof9
versus1.
Inhowfardoesthisexampleresemblethepropertiesoftext-
classicationtasks identiedinSection3?
High-Dimensional InputSpace: There are 11,100 fea-
tures,whichisonthesameorderofmagnitudeasreal
text-classicationtasks.
SparseDocument Vectors: Each document is only 50
wordslong,whichmeansthereareatleast11,050zero
entriesineachdocumentvector.
High Levelof Redundancy: Ineachdocumentthereare
4medium-frequencywordsand9low-frequencywords
that indicatetheclass ofthe document. Considering
thedocument lengthof50words, thisisafairlyhigh
levelofredundancy.
HeterogeneousUseofTerms: Boththepositiveandthe
negativedocumentseachhaveagroupof200medium-
frequency words and a group of 3,000 low-frequency
words. From each group there can be an arbitrary
subsetof4forthemedium-frequencywordsand9for
the low-frequency words ineach document. Consid-
ering only the medium-frequencywords, this implies
thattherecanbe50documentsinthesameclassthat
donotshareasinglemedium-frequencytermfromthis
group. Thismimicsthepropertyoftextclassication
tasksidentiedinSection3.
Zipf's Law: There isa smallnumberofwords(100 stop-
words)thatoccurveryfrequently,asetof1,000words
of medium frequency, and a large set of 10,000 low-
frequencywords. ThisdoesresembleZipf'slaw.
Toabstractfromthisparticularexample,thefollowingdef-
initionintroducesaparameterizedmodelthatcandescribe
text-classicationtasksmoregenerally.
Definition1 (HomogeneousTCat-Concepts).
TheTCat-concept
TCat([p
1 :n
1 :f
1 ];:::;[p
s :n
s :f
s
]) (8)
describes a binaryclassicationtask withs disjoint sets of
features(i.e. words). Thei-thsetincludesfifeatures. Each
positiveexamplecontainspioccurrencesoffeaturesfromthe
respectiveset,andeachnegativeexamplecontainsn
i occur-
rences. The same feature can occur multiple times in one
document.
Thisdenitiondoesnot includenoise (e.g. violations of
theoccurrencefrequenciesprescribedbytheTCat-concept).
However, the model canbe extended to handle noise ina
straightforward way [14]. Applying the denition to the
examplein Figure 4,it is easy to verifythat the example
canbedescribedasa
TCat([20:20:100]; #highfreq.
[4:1:200];[1:4:200];[5:5:600]; #mediumfreq.
[9:1:3000];[1:9:3000];[10:10:4000] #lowfreq.
)
concept. Whilethisisanarticialexample,isitpossibleto
modelrealtext-classicationtasksasTCat-concepts?
Empirical Validation.
Consider text-classication tasks fromtheReuters-215782
,theWebKB 3
,and theOhsumed 4
collection. The following analysis shows how they can be
modeledasTCat-concepts.
Letusstartwiththecategory\course"fromtheWebKB
collection. First,wepartitionthefeaturespaceintodisjoint
setsofpositiveindicators,negativeindicators,andirrelevant
features. Using the simple strategy [14] of selecting fea-
turesbytheiroddsratio,thereare98high-frequencywords
that indicatepositive documents (odds ratio greater than
2) and 52 high-frequency words indicating negative docu-
ments(oddsratiolessthan0.5). Anexcerptofthesewords
isgiveninFigure5. Similarly,thereare431(341)medium-
frequencywordsthatindicatepositive(negative)documents
withanoddsratiogreaterthan5(lessthan0.2). Inthelow-
frequencyspectrumthereare5,045positiveindicators(odds
ratiogreaterthan10)and24,276negativeindicators(odds
ratio lessthan0.1). Allotherwords inthevocabularyare
assumedtocarrynoinformation.
Toabstract from the details ofparticular documents, it
is useful to analyse what atypical document for this task
lookslike. Insomesense,an\average"documentcaptures
whatistypical. AnaverageWebKBdocumentis277words
long. For positive examples of the category "course", on
average27.7%ofthe277occurrencescomefromthesetof98
high-frequencypositiveindicatorswhilethesewordsaccount
2
http://www.research.att.com/lewis/reuters215 78 .html
3
http://www.cs.cmu.edu/afs/cs/project/theo-20/ www /data
4
ftp://medir.ohsu.edu/pub/ohsumed
pos 98words
allanyassignmentassignments
available be book c chapter
class code course cse descrip-
tion discussion document due
eacheecsexamexamsfallnal
...
section set should solution so-
lutions spring structures stu-
dentssyllabustatexttextbook
there thursday topics tuesday
unix use wednesday week will
youyour
431words
accountacrobat adapted addi-
son adt ahead aho allowed al-
ternate announced announce-
ment announcements answers
appointmentapproximately
...
tuesdays turing turn turned
tuth txt uidaho uiowa ullman
understandungradedunitsun-
lessupennusrvectorsviwalter
weaverwedwednesdaysweekly
weeksweightswesleyyurttas
5045words
002cc 009a 00a 00om 01oct
01pm02pm 03oct03pm 03sep
04dec
...
gradable gradebook grade-
books gradefreq1 gradefreq2
gradefreq3 graders gradesheet
gradientsgracagrak
...
zimmermannzinczipizipserzj
zlocate znol zoran zp zwatch
zwherezwienerzyda
neg
acmaddressamaustincacali-
fornia center college computa-
tional conference contact cur-
rentcurrentlyddepartmentdr
facultyfaxgraduategrouphe
...
me member my our paral-
lel performance ph pp pro-
ceedings professor publications
recent research sciences sup-
port technical technology uni-
versity vision was working
52words
aaaiacademyaccessesaccurate
adaptationadvisoradvisoryaf-
liatedaÆliationsagentagents
alberta album alumniamanda
americaamherstannual
...
victoria virginia visiting vis-
itors visualization vita vitae
voice wa watson weather web-
ster went west wi wife wire-
less wisconsin worked work-
shopworkshopswroteyaleyork
341words
0a 0b 0b1 0e 0f 0r 0software
0x82d4 100k 100mhz 100th
1020x620102k103k
...
lunar lunches lunchtime lund
lundberglunedilungluniewski
luo luong lupin lupton lure
lurkerlus
...
zuo zuowei zurich zvi zw
zwaenepoel zwarico zwickau
zwilling zygmunt zzhen00
24276words
highfrequency mediumfrequency lowfrequency
Figure 5: Indicative wordsfor the WebKBcategory\course" partitioned by occurrence frequency.
for only 10.4% of the occurrences in an average negative
document. Assessingthe percentages analogously also for
theotherwordgroups,theycanbedirectlytranslatedinto
thefollowingTCat-concept.
TCat
course
([77:29:98];[4:21:52]; #highfreq.
[16:2:431];[1:12:341]; #mediumfreq.
[9:1:5045];[1:21:24276]; #lowfreq.
[169:191:8116] #rest
)
Thisshows thatthetext-classication taskconnectedwith
theWebKB category\course" canbe modeledasaTCat-
concept,ifoneassumesthatdocumentsareofhomogeneous
lengthandcomposition.Itcanbeshownthatthisassump-
tionofhomogeneitycanberelaxed[14].
SimilarTCat-conceptscanalsobefoundfor othertasks.
FortheReuters-21578category\earn"thesameprocedure
leadstotheTCat-concept
TCatearn([33:2:65];[32:65:152]; #highfreq.
[2:1:171];[3:21:974]; #mediumfreq.
[3:1:3455];[1:10:17020]; #lowfreq.
[78:52:5821] #rest
)
asanaveragecasemodel. ThemodelfortheOhsumedcat-
egory\pathology"is
TCat
pathology
([2:1:10];[1:4:22]; #highfreq.
[2:1:92];[1:2:94]; #mediumfreq.
[5:1:4080];[1:10:20922];#lowfreq.
[197:190:13459] #rest
):
Note that in particular the modelfor \pathology" is sub-
stantially dierent from the other two. This veries that
TCat-concepts can capture some properties of real text-
classication tasks that have the potential to dierentiate
betweentasks. Thefollowingstudiestheirrelevanceforgen-
eralizationperformance.
4.2.1 Step 3: Learnability of TCat-Concepts
This nal step provides the connection between TCat-
concepts andtheboundfor thegeneralizationperformance
ofanSVM.TherstlemmashowsthathomogeneousTCat-
conceptsaregenerallyseparablewithacertainmargin. Us-
ingthefactthattermfrequenciesobeyZipf'slaw,asecond
lemmashowsthattheEuclideanlengthofdocumentvectors
issmallfortext-classicationtasks. These tworesultslead
tothemainlearnability resultforTCat-concepts.
Lemma1 (MarginofNoise-FreeTCat-Concepts).
For TCat([p
1 :n
1 :f
1 ];:::;[p
s :n
s :f
s
])-concepts, there is
always a hyperplane passing through the origin that has a
marginÆ boundedby
Æ 2
ac b
2
a+2b+c
with
a= s
P
i=1 p
2
i
f
i
b= s
P
i=1 p
i n
i
f
i
c= s
P
i=1 n
2
i
f
i
(9)
Proof. Denep~ T
=(p
1
;:::;p
s )
T
and~n T
=(n
1
;:::;n
s )
T
,
as wellas thediagonal matrixF withf
1
;:::;f
s
onthe di-
agonal.
Themarginofthemaximum-marginhyperplanethatsep-
aratesagiventrainingsample(~x
1
;y
1
);:::;(~x
n
;y
n
)andthat
passesthroughtheorigin canbederivedfromthe solution
ofthefollowing optimizationproblem.
W(w )~ = min 1
2
~ w T
~
w (10)
s:t: y1[~x T
1
~ w]1
.
.
. (11)
y
n [~x
T
n
~ w]1
Thehyperplanecorrespondingtothesolutionvectorw~
has
amarginÆ=(2W(w~
)) 0:5
. Byaddingconstraintstothis
optimizationproblem,it ispossible tosimplifyitssolution
andget a lowerboundonthemargin. Letus add thead-
ditionalconstraintthatwithineachgroupoffi featuresthe
weights are required to be identical. Then w~ T
~ w = ~v
T
F~v
foravector~vofdimensionalitys. Theconstraints(11)can
alsobesimplied. By denition, eachexamplecontains a
certain number of features from each group. This means
thatall constraints forpositiveexamples are equivalentto
~ p T
~
v 1 and, respectively,~n T
~
v 1 for the negativeex-
amples. Thisleadsto thefollowing simpliedoptimization
problem.
W 0
(~v) = min 1
2
~ v T
F~v (12)
s:t: p~ T
~
v1 (13)
~ n T
~v 1 (14)
Let~v
bethe solution. SinceW 0
(~v
) W(w~
), it follows
that Æ (2W 0
(~v
)) 0:5
is a lower bound for the margin.
Itremains tond anupperboundfor W 0
(~v
)that canbe
computedinclosedform.IntroducingLagrangemultipliers,
thesolutionW 0
(~v
)equalsthevalueL(~v;
+
; )
of
L(~v;
+
; )= 1
2
~v T
F~v
+ ( ~p
T
~v 1)+ (~n T
~v+1) (15)
atitssaddle-point.
+
0and 0are the Lagrange
multipliersforthetwoconstraints(13)and(14). Usingthe
factthat
dL(~v;+; )
d~v
=0 (16)
atthesaddlepointonegetsaclosedformsolutionfor~v.
~v=F 1
[
+
~
p ~n ] (17)
Forease ofnotationonecanequivalentlywrite
~v=F 1
XY~ (18)
withX =( ~p;~n), Y =diag(1; 1),and ~ T
=(+; )ap-
propriately dened. Substituting into the Lagrangian re-
sultsin
L(~)=1 T
~
1
2
~ T
YX T
F 1
XY~ (19)
Tondthesaddlepointsonehastomaximizethisfunction
over~ T
=(
+
; ) T
subjectto
+
0and 0. Since
onlyalowerbound onthemargin is needed, itis possible
todroptheconstraints
+
0and 0. Removingthe
constraintscanonly increasethe objectivefunction at the
solution. Sothe unconstrainedmaximumL 0
(~)
isgreater
orequaltoL(~) . Settingthederivativeof(19)to0
dL 0
(~)
d~
=0 , ~=(YX T
F 1
XY) 1
1 (20)
and substituting into (19) yields the unconstrained maxi-
mum:
L 0
(~v;~)
= 1
2 1
T
(YX T
F 1
XY) 1
1 (21)
Thespecialformof(YX T
F 1
XY)makesitpossibletocom-
puteitsinverseinclosedform.
(YX T
F 1
XY) 1
=
~ p T
F 1
~ p ~p
T
F 1
~ n
~ n T
F 1
~ p ~n
T
F 1
~ n
1
(22)
=
a b
b c
1
(23)
= 1
ac b 2
a b
b c
(24)
Substitutinginto(21) completestheproof.
Thelemmashowsthatanysetofdocuments,whereeach
documentisfullyconsistentwiththespeciedTCat-concept,
isalwayslinearlyseparablewithacertainminimummargin.
Notethatseparabilityimpliesthatthetrainingloss P
i is
zero. Whilethispaperconsidersonlythecaseoffullconsis-
tencyand zeronoise, [14]shows howtheseassumtionscan
berelaxed.
It remains to bound the maximum Euclidean length R
ofdocumentvectorsbeforeitispossibletoapplyTheorem
1. Clearly,thedocumentvectorofadocumentwithlwords
cannothaveaEuclideanlengthgreaterthanl . Nevertheless,
thisboundisverylooseforrealdocumentvectors. Tobound
the quantity R more tightly it is possible to make use of
Zipf'slaw.
Assumethatthetermfrequenciesineverydocumentfol-
lowthegeneralizedZipf'slaw[15]
TFr= c
(r+k)
(25)
withtypicalparametervaluesk5,1:3,andcscaling
with document length. This assumptionabout Zipf's law
doesnotimplythataparticularwordoccurswithacertain
frequencyineverydocument. It ismuchweaker;itmerely
implies that the r-th most frequent word in a document
occurswithaparticularfrequency. InslightabuseofZipf's
lawfor shortdocuments,thefollowing lemmaconnects the
length of the document vectors to Zipf's law. Intuitively,
it states that many words in a document occur with low
frequency, leading to document vectors of relatively short
Euclideanlength.
Lemma2 (LengthofDocumentVectors). If the
ranked term frequencies TF
r
in a document with l terms
havetheformofthegeneralizedZipf'slaw
TFr= c
(r+k)
(26)
basedontheirfrequencyrankr,thenthesquaredEuclidean
lengthofthedocumentvector~xoftermfrequenciesisbounded
by
jj~xjj v
u
u
t d
X
r=1
c
(r+k)
2
withdsuchthat d
X
r=1 c
(r+k)
=l
Proof. Fromtheconnectionbetweenthefrequencyrank
ofatermanditsabsolutefrequencyitfollowsthatther-th
mostfrequenttermoccurs
TFr= c
(r+k)
(27)
times. Thedocumentvector~xhasdnon-zeroentrieswhich
arethevaluesTF1;:::;TFd. Therefore,theEuclidianlength
ofthedocumentvector~xis
~ x T
~ x=
d
X
r=1
c
(r+k)
2
(28)
CombiningLemma1andLemma2withTheorem1leads
tothefollowing mainresult.
Theorem2 (LearnabilityofTCat-Concepts).
For TCat([p
1 : n
1 : f
1 ];:::;[p
s : n
s : f
s
])-concepts and
documents with l terms distributed according to the gener-
alizedZipf'slawTFr= c
(r+k )
,theexpectedgeneralization
errorofan(unbiased)SVMaftertrainingonnexamplesis
boundedby
E(Err n
(h
SVM ))
R 2
n+1
a+2b+c
ac b 2
with a=
s
P
i=1 p
2
i
f
i
b= s
P
i=1 pini
f
i
c= s
P
i=1 n
2
i
f
i
R 2
= d
P
r=1
c
(r+k )
2
unless 8 s
i=1 :p
i
=n
i
. d is chosen so that d
P
r=1 c
(r+k )
= l .
ForunbiasedSVMsequals1,andforbiasedSVMsequals
2.
Proof. UsingthefactthatTCat-conceptsareseparable
(andthereforestable),ifatleastfor oneithevalueofpiis
dierentfromn
i
,theresultfromTheorem1reducesto
E(Err n
(hSV
M ))
1
n+1 E
R 2
Æ 2
(29)
sinceall
i
arezeroforasuÆcientlylargevalueofC. Lemma
1givesalowerboundforÆ 2
whichcanbeusedtoboundthe
expectation
E
R 2
Æ 2
a+2b+c
ac b 2
ER 2
(30)
Itremainsfor ustogiveanupperboundfor ER 2
. R 2
is
themaximumEuclidianlengthofany featurevectorinthe
trainingdata. Sincethe termfrequencies ineachexample
followthegeneralizedZipf'slawTFr= c
(r+k )
,itispossible
touseLemma2toboundR 2
andthereforeER 2
.
Empirical Validation.
TheTCat-modelandthelemmataleading to the main result suggest that text classication
tasks are generally linearlyseparable (i.e.
P
i
=0), and
that the normalized inverse margin R 2
=Æ 2
is small. This
predictioncanbetestedagainstrealdata.
Reuters R
2
Æ 2
n
P
i=1 i
earn 1143 0
acq 1848 0
money-fx 1489 27
grain 585 0
crude 810 4
trade 869 9
interest 2082 33
ship 458 0
wheat 405 2
corn 378 0
WebKB
R 2
Æ 2
P
i=1
i
course 519 0
faculty 1636 0
project 741 0
student 1588 0
Ohsumed
R 2
Æ 2
n
P
i=1
i
Pathology 11614 0
Cardiovasc. 4387 0
Neoplasms 2868 0
NervousSys. 3303 0
Immunologic 2556 0
Table 2: Normalized inverse margin and training
loss for the Reuters (27,658 features), the WebKB
(38,359 features), and the Ohsumed data (38,679
features) for C = 50. As suggested by model-
selection experiments,TFIDF-weighting isusedfor
Reuters andOhsumed,while therepresentation for
WebKB is binary. No stemming is performed and
stopwordremovalisusedonlyontheOhsumeddata.
First,Table 2 indicatesthat all Ohsumedcategories, all
WebKB tasks, and most Reuters-21578 categories are lin-
early separable (i.e.
P
i =0). This meansthat there is
a hyperplaneso that all positiveexamples areon oneside
of the hyperplane,while all negative examples are on the
other. Inseparability on some Reuters categories is often
duetodubiousdocuments(consistingonlyofaheadline)or
obviousmisclassicationsofthehumanindexers.
Second,separabilityispossiblewithalargemargin.Table
2showsthesizeofthenormalizedinversemarginfortheten
most frequent Reuters categories, the WebKB categories,
andthevemostfrequentOhsumedcategories. Intuitively,
R 2
=Æ 2
canbetreatedasan\eective"numberofparameters
due to it's link to VC-dimension [18]. Compared to the
dimensionality ofthe featurespace,thenormalized inverse
marginistypicallysmall.
These experimentalndings inconnection withthe the-
oretical results fromabove validatethat TCat-conceptsdo
capture an important and widely present property of text
classicationtasks.
5. COMPARING THE THEORETICAL MODEL WITH EXPERIMENTAL RESULTS
Theprevioussectionsformallydescribesthat alargeex-
pected margin with low training error leads to a low ex-
pected prediction error. Furthermore, they indicate how
margin is related to the properties of TCat-concepts, and
experimentally verifythatreal text-classicationtasks can
be modeled withTCat-concepts. This section veries not
only that the individual steps are well justied, but also
thattheirconjunctionproducesmeaningfulresults. Toshow
this,thissectioncomparesthegeneralizationperformanceas
predictedbythemodelwiththegeneralizationperformance
foundinexperiments.
In Section 4.2 a TCat-model for the WebKB category
E(Err n
(hSVM)) Err n
test (hSV
M )
WebKB\course" 11.2% 4.4%
Reuters \earn" 1.5% 1.3%
Ohsumed\pathology" 94.5% 23.1%
Table 3: Comparing the expected error predicted
by the model with the error rate and the pre-
cision/recall breakeven point on the test set for
the WebKB category \course", the Reuters cate-
gory \earn", and the Ohsumed category \pathol-
ogy"withTFweightingandC=1000. Nostopword
removaland nostemming areused.
\course" was estimated. Furthermore, the parameters of
Zipf's law for the full WebKB collection are c = 470000,
k =5,and =1:25. Subjectto the assumptionsof The-
orem 2, substituting the estimatedvalues into the bound
leads to the following characterization of the expected er-
ror.
E(Err n
(hSV
M ))
0:23311899:7
n+1
443
n+1
(31)
ndenotes thenumberoftrainingexamples. Consequently,
aftertrainingon 3957 examplesthe modelpredictsanex-
pectedgeneralizationerroroflessthan11.2%.
AnanalogprocedurefortheReuterscategory\earn"leads
tothebound
E(Err n
(hSV
M ))
0:1802762:9
n+1
138
n+1
(32)
sothattheexpectedgeneralizationerrorafter9603training
examples is less than 1.5%. Similarly, the bound for the
Ohsumedcategory\pathology"is
E(Err n
(h
SVM ))
7:41231275:8
n+1
9457
n+1
; (33)
leadingtoanexpectedgeneralizationerroroflessthan94.5%
after10,000trainingexamples.
Table 3 compares the expectedgeneralization errorpre-
dictedbytheestimatedmodelswiththegeneralizationper-
formanceobservedinexperiments.Whileitisunreasonable
to expectthat themodel preciselypredicts the exactper-
formanceobservedonthe testset, Table 3shows that the
modelcaptureswhichclassicationtasks aremorediÆcult
than others. In particular, it does correctly predict that
\earn"istheeasiesttask,\course"isthesecondeasiesttask,
and that \pathology" isthe most diÆcult one. Whilethe
TCatmodelisprobablynotdetailedenoughto besuitable
forperformanceestimationinmostapplicationsettings(e.g.
[13]),thisgivessomevalidationthatTCat-conceptscanfor-
malizethekeypropertiesoftext-classicationtasksrelevant
forlearnabilitywithSVMs. Morecanbefoundin[14].
6. SENSITIVITY ANALYSIS: DIFFICULT AND EASY LEARNING TASKS
Theprevious sectionrevealedthattheboundontheex-
pected generalization error can be large for some TCat-
conceptswhileitissmallforothers.Goingthroughdierent
scenarios, it is now possible to identify the keyproperties
thatmakeatext-classicationtask\easy"or\diÆcult"for
anSVMtolearn[14].
Occurrence Frequency Giventhattheotherparameters
stay constant, the boundontheerror ratedecreases,
if the frequency of the discriminative features is in-
creased.
Discriminative PowerofTermSets The extent to
which vocabulary diers between classes makes a
dierence for learnability. The value of the bound
decreases, if the dierence in class conditional word
frequenciesincreases.
LevelofRedundancy The higher the redundancy, the
lowertheboundonthegeneralization error. Thisim-
plies that it is desirable to have many clues ineach
document.
Similarly, the model can be used to analyse the eect of
TFIDF weighting on the eectiveness of SVMs depending
onthepropertiesofthetask[14].
7. LIMITATIONS OF THE MODEL AND OPEN QUESTIONS
Everymodelabstractsfrom realityinsome senseand it
isimportanttoclearly pointtheassumptionsout.
First, each document is assumed to exactly follow the
same generalized Zipf's law, neglecting variance and dis-
cretizationinaccuraciesthatoccur especiallyfor shortdoc-
uments. Inparticular,this implies that all documentsare
ofequallength.
Second,the modelxesthenumberof occurrences from
eachwordsetintheTCat-model. Whilethedegreeofviola-
tionofthisassumptioncanbecapturedintermsofattribute
noise,itmightbeusefulandpossiblenottospecifytheex-
actnumberofoccurrencesperwordset,butonlyupperand
lowerbounds. This could make themodel moreaccurate.
However,it comeswiththecostofanincreasednumberof
parameters, makingthe modelless understandable. While
the formal analysis of noise in [14] demonstrates that the
modeldoesnotbreak inthepresenceof noise,thebounds
couldbetightened. Alongthesamelines,parametricnoise
modelscouldbeincorporatedtomodelthetypesofnoisein
text-classicationproblems.
Finally, the general approach taken in this paper is to
model only upper bounds onthe error rate. While these
are important to derive suÆcient conditions for the learn-
ability of text-classicationtasks, lowerbounds may be of
interest as well. They could answerthe question of which
text-classicationtaskscannotbelearnedwithSVMs.
8. RELATED WORK
Whileotherlearning algorithmscanalso beanalyzedin
terms of formal models, these models typically make as-
sumptionsunjustiedfor text.
Themost popular suchalgorithm isnaiveBayes. Naive
Bayes is commonly justied using assumptions of condi-
tional independence or linked dependence [3]. However,
theseassumptionsaregenerallyacceptedtobefalsefortext.
Whilemorecomplexdependencemodelscansomewhatre-
movethe degreeofviolation[17], aprincipalproblemwith
usinggenerativemodelsfortextremains.Findingagenera-
tivemodelsfornaturallanguageappearsmuchmorediÆcult
thansolvingatextclassicationtask. Therefore,thispaper
presented a discriminative model of text classication. It
butionof words enough todescribeclassication accuracy.
Thiswayitispossibletoavoidfalseindependenceassump-
tions.
Anothermodelusedtodescribethe propertiesof textis
the2-Poisson model[1]. However,liketheBernoullimodel
itisrejectedbytests[8,9]. Descriptionorientedapproaches
[7][6][5]providepowerfulmodelingtoolandcanavoidhigh-
dimensionalfeaturespaces,butrequireimplicitassumptions
inthewaydescriptionvectorsaregenerated.
While dierent in itsmotivationand its goal, the work
ofPapadimitriou et. alis most similarinspiritto the ap-
proachpresentedhere[16]. Theyshowthatlatentsemantic
indexingleadstoasuitablelow-dimensionalrepresentation,
given assumptionsabout the distributionof words. These
assumptionsare similarinhow theyexploit the dierence
ofworddistributions. However,theydonotshowhowtheir
assumptionsrelateto thestatistical properties of textand
theydonotderivegeneralization-error bounds.
9. SUMMARY AND CONCLUSIONS
Thispaperdevelopstherstmodeloflearningtextclassi-
ersfromexamplesthatmakesitpossibletoquantitatively
connectthestatisticalpropertiesoftextwiththegeneraliza-
tionperformanceofthelearner. Themodelisthe resultof
takingadiscriminativeapproach. Unlikeconventionalgen-
erative models, it doesnot involve independence assump-
tions. Thediscriminativemodelfocusesonthoseproperties
of the text classication tasks that are suÆcient for good
generalizationperformance, avoidingmuchofthecomplex-
ity ofnaturallanguage.
Based on this discriminative model, the paperexplains
howSVMscanachieve goodclassicationperformance de-
spitethe high-dimensionalfeaturespacesintextclassica-
tion. Theresulting bounds ontheexpectedgeneralization
error give a formal understanding of what kind of text-
classication task can be solved with SVMs. This makes
itpossibletoidentifythat{intuitively{highredundancy,
high discriminative powerof termsets, and discriminative
featuresinthehigh-frequencyrangearesuÆcientconditions
forgoodgeneralization. Finally,themodelprovidesaformal
basisfordevelopingnewalgorithmsthataremostappropri-
ateinspecicscenarios.
10. REFERENCES
[1] A.BooksteinandD.R.Swanson.Probabilisticmodels
forautomatedindexing.JournaloftheAmerican
SocietyforInformationScience,25(5):312{318, 1974.
[2] C.Burges.Atutorialonsupportvectormachinesfor
patternrecognition.DataMiningandKnowledge
Discovery,2(2):121{167,1998.
[3] W.Cooper.Someinconsistenciesandmisnomersin
probabilisticinformationretrieval.InInternational
ACMSIGIRConference onResearch and
DevelopmentinInformationRetrieval,pages57{61,
1991.
[4] S.Dumais,J.Platt,D.Heckerman,andM. Sahami.
Inductivelearningalgorithmsandrepresentationsfor
textcategorization.InProceedings ofACM-CIKM98,
November1998.
[5] N.Fuhr,S.Hartmann,G.Lustig,M.Schwantner,
K.Tzeras,andG.Knorz.Air/x-arule-based
RIAO,pages606{623,1991.
[6] N.FuhrandG.Knorz.Retrievaltestevaluationofa
rulebasedautomaticindexing(air/phys).InC.van
Rijsbergen,editor,ResearchandDevelopmentin
InformationRetrieval: Proceedings oftheThirdJoint
BCS andACM Symposium,pages391{408.
CambridgeUniversityPress, July1984.
[7] N.Govert,M.Lalmas,andN.Fuhr.Aprobabilistic
description-orientedapproachforcategorisingWeb
documents.InProceedingsof CIKM-99,8thACM
InternationalConference onInformation and
KnowledgeManagement,pages475{482,KansasCity,
US,1999.ACMPress, NewYork,US.
[8] S.P.Harter.Aprobabilistic approachtoautomated
keywordindexing.PartI:onthedistributionof
specialtywordsinatechnicalliterature.Journalofthe
AmericanSociety forInformationScience,
26(4):197{206, 1975.
[9] S.P.Harter.Aprobabilistic approachtoautomated
keywordindexing.PartII:Analgorithmfor
probabilistic indexing.JournaloftheAmerican
Society forInformationScience,26(5):280{289,1975.
[10] T.JaakkolaandD.Haussler.Probabilistickernel
regressionmodels.InConferenceonAIandStatistics,
1999.
[11] T.Joachims. Textcategorizationwithsupportvector
machines:Learningwithmanyrelevantfeatures.In
ProceedingsoftheEuropeanConferenceonMachine
Learning,pages137{142,Berlin,1998. Springer.
[12] T.Joachims. Makinglarge-scaleSVMlearning
practical.InB.Scholkopf,C.Burges,andA.Smola,
editors, AdvancesinKernelMethods-Support Vector
Learning,chapter11.MITPress,Cambridge,MA,
1999.
[13] T.Joachims. Estimatingthegeneralization
performanceofaSVMeÆciently.InProceedingsofthe
InternationalConference onMachine Learning,San
Francisco,2000.MorganKaufman.
[14] T.Joachims. TheMaximum-MarginApproach to
LearningTextClassiers: Methods,Theory,and
Algorithms.PhDthesis,UniversitatDortmund,2001.
Kluwer,toappear.
[15] B.Mandelbrot.Anoteonaclassofskewdistribution
functions: AnalysisandcritiqueofapaperbyH.A.
Simon.InformationandControl,2(1):90{99,Apr.
1959.
[16] C.H.Papadimitriou,P.Raghavan,H.Tamaki,and
S.Vempala.Latentsemanticindexing: Aprobabilistic
analysis.InACM,editor,PODS '98.Proceedings of
theACMSIGACT-SIGMOD-SIGARTSymposiumon
Principles ofDatabaseSystems,June,1998,Seattle,
Washington,pages159{168,NewYork,NY10036,
USA,1998.ACMPress.
[17] M. Sahami.UsingMachineLearningtoImprove
InformationAccess.PhDthesis,StanfordUniversity,
1998.
[18] V.Vapnik.StatisticalLearningTheory.Wiley,
Chichester,GB,1998.
[19] G.K.Zipf.HumanBehavior andthePrincipleof
LeastEort: AnIntroductiontoHumanEcology.
Addison-Wesley,Cambridge,MA,USA,1949.