Visualization of anomaly detection using prediction sensitivity

(1)

Visualization of anomaly detection using prediction sensitivity

Pavel Laskov¹, Konrad Rieck¹, Christin Sch¨afer¹and Klaus-Robert Muller¨ ^1,2

1Fraunhofer-FIRST.IDA ²University of Potsdam Kekul´estr. 7 Am Neuen Palais 10 12489 Berlin, Germany 14469 Potsdam, Germany

{laskov,rieck,christin,klaus}@first.fhg.de

Abstract:Visualization of learning-based intrusion detection methods is a challeng- ing problem. In this paper we propose a novel method for visualization of anomaly detection and feature selection, based on prediction sensitivity. The method allows an expert to discover informative features for separation of normal and attack instances.

Experiments performed on the KDD Cup dataset show that explanations provided by prediction sensitivity reveal the nature of attacks. Application of prediction sensitivity for feature selection yields a major improvement of detection accuracy.

1 Introduction

Transparency is anessential requirement for intrusion detection algorithms to beused in practice. It does not sufﬁcethat an algorithm tells–perhaps with a degreeof uncertainty –if someattack (or a speciﬁc attack) is present; an algorithm must beableto providea credible evidenceto its prediction.

Whilesuchevidenceiseasy to producefor rule-based detection methods, whoserules are understandableto anexpert, such credibility cannot beclaimed by many approaches using learning-based methods, such as Neural Networks or Support Vector Machines [GSS99, MJS02]. Thesituation is somewhat better for misusedetection methods, for which several featureselection techniques areavailable,e.g. [WMC⁺01, GE03]. Theproblem is much graver for anomaly detection methods, for which almost no practical featureselection techniques areknown to date.

In this contribution weproposea techniquethat enables oneto visualize predictions of thequarter-sphereSVM, an anomaly detection techniqueproposed in [LSK04, LSKM04].

Thetechniqueis based on thenotion ofprediction sensitivitywhich measures thedegreeto which prediction is affected by adding weight to a particular feature. Using this technique wewereableto gain interesting information about thepredictions madeby thequarter- sphereSVM on theKDD Cup dataset. Theinformation weobtained is comparablebut not identical to rules inferred by RIPPER, a classical rule-based method [Coh95].

(2)

By averaging prediction sensitivity over several datasets onecan select thefeatures that aremost important for anomaly detection. In ourexperiments on theKDD Cup dataset wehaveobserved that reducing theset of features to the ones suggested by prediction sensitivity remarkably improves theaccuracy of detection by thequarter-sphereSVM.

2 Approach: analysis of prediction sensitivity

Thenotion ofprediction sensitivityexpresses thedegreeto which prediction is affected by adding weight to individual features. Mathematically this can bedescribed by theJacobian matrix of theprediction function with respect to theinput features. Thederivation of the expression for this Jacobian matrix –which depends on a particular anomaly detection method, in our case, thequarter-sphereSVM–is rather technical, therefore, dueto space limitations, only themain idea is presented in this section. Themathematical details will besubject of a forthcoming publication.

LetXbead×ldata matrix containingdfeatures collected overlobservations. Weassume that an anomaly detection algorithm assigns theanomaly scores(xi)toevery data point xi∈X (a column in thedata matrix). Thel×dJacobian matrix is deﬁned as thepartial derivatives ofswith respect to thecomponents ofxi:

J_(ik)=∂s(xi)

∂xk , 1≤i≤l, 1≤k≤d. (1) For thesakeof moreintuitivevisualization wewill always consider thetransposed Jaco- bian matrixJ^T whosedimensions areidentical to thoseof theinitial matrixX. Thus,each column of the(transposed) Jacobian matrix can beseen as thesensitivity of theprediction s(xi)of thealgorithm on thedata pointxiwith respect to thek-th featureof thedata. The deﬁnition ofs(xi)for thequarter-sphereSVM used in this paper is given in Eq. (4) in Sec. 3.

Further information can begained by considering statistical properties of prediction sensitivity. To perform such analysis, randomly drawn data samplesX1, . . . , XN arecollected, in which thepercentageof attacks is ﬁxed. Oncethedata samples arecollected, one computes themean and thestandard deviation of therespectiveJacobian matrices over N samples. Based on this information, heuristic criteria can bedeﬁned (cf. Sec. 5) for selecting informativefeatures for separating attacks and normal patterns.

3 Application: anomaly detection using quarter-sphere SVM

Thequarter-sphereSVM [LSK04, LSKM04] is an anomaly detection techniquebased on theidea of fitting a sphereonto thecenter of mass of data. Oncethe center of the sphereis fixed, thedistanceof points from thecenter defines theanomaly score. Choosing a threshold for theattack scores determines theradius of thesphere encompassing the

(3)

This geometric model can be extended for non-linear surfaces. Weﬁrst apply somenon- linear mappingΦto theoriginal features. Then, foreach data point, thedistancefrom the center of mass in thetransformed space –which is our scorefunction–is computed as:

s(xi) =||Φ(xi)−¹_l_l

j=1Φ(xj)||. (2)

It remains to beshown how thescorefunction (2) can beobtained withoutexplicitly com- puting themapping Φ, since thelatter can map the data into a high- or even inﬁnite- dimensional space.

It is well known in themachinelearning literature(e.g. [MMR⁺01, SS02]), that, under sometechnical assumptions, inner products between images of data points under a nonlinear transformation can becomputed by an appropriatekernel function:

k(xi, xj) = Φ(xi)^TΦ(xj).

For many interesting transformation thekernel function is known in advanceand iseasy to compute. Forexample, for thespaceof radial-basis functions (RBF) thekernel function is computed as

k(xi, xj) =e⁻^||^xi⁻^xj^||

2

2γ .

To computethescorefunctions(xi) using thekernel function, thefollowing steps are needed:

1. Form thel×lkernel matrixKwhose entries arethevalues of thekernel function k(xi, xj)for all pairs of data pointsiandj.

2. Computethecentered kernel matrix [SSM98, SMB⁺99]:

K˜ =K−1_lK−K1_l+ 1_lK1_l, (3) where1_lis anl×lmatrix with all valuesequal to ¹_l.

3. Thescorefunction is given by the entries on themain diagonal of thecentered kernel matrix:

s(xi) =

K˜_(ii). (4)

4 Experimental setup

Before presenting the operation of our visualization technique a few remarks need to bemadeon data preprocessing. In ourexperiments weusetheKDD Cup 1999 dataset [Cup99], a standard dataset for the evaluation of data mining techniques. Theset com- prises a ﬁxed set of connection-based features computed from theDARPA 1998 IDSeval- uation [LCF⁺99] and contains 4898430 records of which 3925650 areattacks. A list of all features is provided in [LS01, LSKM04]. In-depth description of somefeatures,e.g.

thehotfeature, is availablein theBro IDS documentation [Pax98, Pax04].

(4)

Thedistribution of attacks in theKDD Cup dataset isextremely unbalanced. Someattacks arerepresented with only a fewexamples,e.g. thephfandftp writeattacks, whereas thesmurfandneptuneattacks cover millions of records. In general, thedistribution of attacks is dominated by probes and denial-of-serviceattacks; themost interesting–and dangerous–attacks, such as compromises, aregrossly under-represented.

In order to copewith theunbalanced attack distribution and to investigatethecharacteristic features of particular attacks, weconstruct separatedatasets containing a ﬁxed attack ratio of 5%. Thedesired ratio is achieved by combining two randomly drawn sub-samples. The ﬁrst sub-sampleis drawn from theattacks in question. If an attack is under-represented, i.e. therearetoo few samples to carry random sampling, all attackexamples aredrawn.

Thesecond sub-sampleis drawn randomly from normal data matching theservices used in thechosen attack. Thenumber ofexamples in both sub-samples is chosen so as to attain thedesired attack ratio.

In order to analyzethestatistical properties of prediction sensitivity, as indicated in Sec. 2, 10 datasets of 1000 data points aregenerated foreach attack. If thenumber of available attacks in thedata is smaller than 50 (required to have5% of attacks in datasets of size 1000), wereducethedataset sizetoL < 1000, sufﬁcient to accommodateall available attacks, and increasethenumber of generated datasets by thefactor of1000/L.

After thesub-sampled datasets aregenerated, a data-dependent normalization [EAP⁺02] is computed, a quarter-sphereSVM is applied toeach dataset and thecorresponding Jacobian matrices (cf. Eq. (1)) arecalculated.

5 Interpretation of anomaly detection on the KDD Cup dataset

Theproposed prediction sensitivity criterion can bevisualized by plotting theJacobian matrix. If multipletraining sets areavailablethemean and thestandard deviation Jacobian matrices areplotted. Therows of thematrices correspond to features and thecolumns correspond to normal and attack instances.

Anexampleof such visualization for thelandattack is shown in Fig. 1. Thefollowing observations can beinferred from theprediction sensitivity matrices:

– Random sampling and averaging of prediction sensitivitiesemphasize the salient features of thedata. As a result, instances corresponding to a particular attack are characterized by consistent regions in themean Jacobian matrix, whereas themuch moreheterogeneous normal dataexhibits random sensitivity.

– Theconsistency of prediction sensitivity for attack instances can bequantiﬁed by thestandard deviation Jacobian matrix. Salient featuresexhibit low standard deviation. Thus onecan suggest thefollowing heuristic criterion for featureselection:for attack instances,features must have high values in the mean and low values in the standard deviation Jacobian matrix.

(5)

Connections sorted by anomaly score (#210−420)

Feature number

Mean Jacobian matrix for the "land" attack

Normal data Attacks

230 250 270 290 310 330 350 370 390 410

10 20 30 40 50

0 5 10 15 20 25 30

Connections sorted by anomaly score (#210−420)

Feature number

Standard deviation Jacobian matrix for the "land" attack

Normal data Attacks

230 250 270 290 310 330 350 370 390 410

10 20 30 40 50

0 20 40 60 80 100

Figure 1: Visualization of prediction sensitivity. The mean and the standard deviation Jacobian matrices for thelandattack exhibit different patterns for attack and normal data, as well as different impact of particular features on prediction. The grey-scale bars to the right of the ﬁgure illustrate the range of matrix values.

In order to illustratefeatureselection based on theproposed criterion, wecalculatethe mean and thestandard deviation of themean Jacobian matrix for theattack instances only. Thesequantities computed for thelandattack areshown in Fig. 2. Onecan see that thenumerical characteristics of prediction sensitivity providesubstantial information for identifying candidatefeatures. According to the“high mean/low variance” criterion, most prominent for thisexamplearethefeatures 38, 39, 40, 45. Their names and brief descriptions areshown in Table1. Thesefeatures areindeed meaningful for theland attack. This attack is manifested in transmission of singleTCP packets (with SYN set) that crash a server withouteliciting an ACK reply; as a result high SYNerror rates are observed. Thefeatures 48, 50, 52, 53 may also beadded as second-choicecandidates.

Number Name Description

38 srv count Number of connections to service 39 serror rate SYNerror rate

40 srv serror rate SYNerror ratefor service

45 srv diff host rate SYNerror ratefor serviceon multiplehosts Table 1: Feature subset selected using prediction sensitivity.

(6)

0 10 20 30 40 50

−30

−20

−10 0 10 20 30

Feature number

Mean / Standard deviation

Mean and standard deviation of mean Jacobian matrix for instances of the "land" attack

Standard deviation Mean

Selected features

Additional selection candidates

Figure 2: Mean and standard deviation of mean Jacobian matrix for instances of thelandattack.

According to the “high mean/low variance” criterion a subset of features and additional candidate features have been selected.

Wehaveperformed theinterpretation and analysis of all 21 attacks present in theKDD Cup dataset. Dueto spaceconstraints wecannot present thedetailed analysis here; so we restrict ourselves to 5 characteristic attacks which demonstratethestrengths as well as the limitations of theproposed visualization technique.

For each of theattack classesremote-to-local(R2L), user-to-root(U2R) andprobeone attack was arbitrarily selected. For theclass ofdenial-of-service(DoS) attacks wedecided to interpret two attacks which differ in activity. Thefollowing attacks werechosen:

– thephf(R2L) attackexploits a security ﬂaw in theinput handling of CGI scripts which allows the execution of local commands on a remoteweb server,

– theloadmodule(U2R) attackexploits an improper boundary check in thepro- gramloadmoduleof theSolaris operating system and allows a local attacker to gain super-user privileges,

– theportsweep(probe) attack discovers activeservices on remotehosts by sys- tematically requesting connections to multipleTCP ports,

– thepod(DoS) attack crashes or reboots remotesystems by sending a single, over- sized IP datagram corrupting thehost’s packet reassembly,

– thesmurf(DoS) attack uses misconﬁgured broadcast hosts to ﬂood a victim host with spoofed ICMP datagrams.

In order to qualitatively comparetheproposed featureselection method with alternative techniques, weapplied theRIPPER classiﬁer to our datasets, in a similar way as it was previously used in [LSM99, LS01] for featureanalysis and generation of detection rules.

Table2 lists theselected features based on prediction sensitivity and corresponding RIP- PER rulesets for theﬁve exampleattacks.

(7)

phf Featureselection based on prediction sensitivity:

hot, num_access_files, duration

RIPPER ruleset:

phf :- root_shell>=1, src_bytes<=51.

loadmodule Featureselection based on prediction sensitivity:

dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate

RIPPER ruleset:

loadmodule :- dst_host_count<=6, src_bytes<=0, count>=2.

loadmodule :- dst_host_count<=6, num_file_creations>=1, duration<=103.

portsweep Featureselection based on prediction sensitivity:

rerror_rate, srv_rerror_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate

RIPPER ruleset:

portsweep :- dst_host_srv_rerror_rate>=1, dst_host_same_srv_rate<=0.01, dst_host_same_src_port_rate>=0.02.

portsweep :- src_bytes<=1, dst_host_same_srv_rate<=0.02, dst_host_same_src_port_rate>=0.03.

portsweep :- rerror_rate>=0.19, dst_host_same_srv_rate<=0.8, dst_host_same_src_port_rate>=0.08,

dst_host_count>=78, protocol_type=tcp.

portsweep :- src_bytes<=0, service=private.

portsweep :- src_bytes<=8, protocol_type=icmp.

portsweep :- src_bytes<=0, service=ftp_data, dst_bytes<=0.

portsweep :- duration>=42908.

portsweep :- dst_host_rerror_rate>=0.95, dst_host_diff_srv_rate>=0.47.

portsweep :- flag=OTH, service=smtp.

pod Featureselection based on prediction sensitivity:

src_bytes, wrong_fragment

RIPPER ruleset:

pod :- src_bytes>=564.

smurf Featureselection based on prediction sensitivity:

count, src_count, src_bytes

RIPPER ruleset:

normal :- src_bytes<=64.

Table 2: Feature selection based on prediction sensitivity and RIPPER rule sets for selected attacks

(8)

Two questions arisefrom Table2: How aretheselected features related to thenatureof attacks and why do featuresextracted by RIPPER and prediction sensitivity differ?

– For thephfattack theselected features indicatemalicious activity accessing system files, e.g. /etc/passwd, and an anomal connection duration. Thesefeatures match thetypical application pattern of thephfattack, in which system files are retrieved by a short HTTP GET request. Thecorresponding RIPPER ruleset reveals theproblem ofoverfitting. Therules match specific properties of thetraining sets, but do not identify thegeneral properties of theattack in question.

– Theloadmoduleattack belongs to theclass of U2R attacks and thusevidenceof theattack is only present incontent-basedfeatures. Theselected features and the RIPPER rules mainly containtrafﬁc-basedfeatures. Both methods fail to select the relevant features becauseno content-based features clearly reﬂect thepresenceof theloadmoduleattack.

– For theportsweepprobetheprediction sensitivity reveals features related to re- jectionerrors,e.g. rerror rate. A side effect of vanilla portscans, as in case of portsweep, is a very high number of rejected connection requests because only few services arepresent on most network hosts. TheRIPPER ruleset is too complex for realistic application. Furthermoremost rules involvetheserviceand protocol featurewhich arenot inherent properties of theportsweepattack.

– The selected features for thepod attack indicate an inﬂuenceof the number of transmitted bytes and thepresenceof wrong fragments, which arevery characteristic for theping-of-death (pod) attack. TheRIPPER ruleis, however, too speciﬁc:there is no reason to believethat 564 bytes is a good threshold between normal data and thepodattack.

– The smurf attack is represented by trafﬁc-based features, such as count and srv count. Theattack involves tremendous trafﬁc from various spoofed sources.

Theselected featureset matches thesmurfattack, but also contains generalization which applies to successor attacks,e.g.fraggle. TheRIPPER rule exhibits similar overﬁtting as for thepodattack.

One can see that, provided relevant features are present in thedata, both RIPPER and prediction sensitivity succeed in selecting an informativesubset of features. However theRIPPER classiﬁer is proneto overﬁtting, and theinferred rules often lack necessary generality, which undermines themain advantageof rule-based learning: understandable rules. Featureselection based on prediction sensitivity is moreaccurateandexhibits good generalization ability. Another differencebetween thetwo techniques is that prediction sensitivity determines a threshold for a combination of rather than for singlefeatures.

(9)

6 Improvement of detection by feature selection

As it was shown in theprevious section, theprediction sensitivity criterion allows one to select an informative subset of features characterizing singleattacks. Although we used labels for featureselection, theunderlying concept beyond thenotion of prediction sensitivity is anomaly detection in unlabeled data. In this section wedemonstratethat the featureselection based on our criterion improves theaccuracy of thequarter-sphereSVM, an unsupervised anomaly detection algorithm.

The experiments presented below werecarried out under two scenarios. First weselected features forsingle attacksand applied a quarter-sphereSVM on thereduced featuresets.

In thesecondexperiment, thedatasets–for featureselection as well as for anomaly detection–werecomposed of multipleattacks (ab)using thesameservice:FTP, HTTP and SMTP. Theobjectiveof bothexperiments is to investigatewhether pre-selection of features improves detection accuracy compared to thefull set of features. In allexperiments, unseen data was used for the evaluation of featureselection in order toensurethat the selection generalizes beyond theparticular datasets.

Theimpact of featureselection on theaccuracy of anomaly detection by thequarter-sphere SVM is shown in Fig. 3. The evaluation criterion is thearea under theROC curverestricted to thelow false-positiveinterval[0,0.1](AUC^0.1). Thearea is multiplied by a factor of 10; this allows oneto interpret AUC^0.1as a percentageof themaximum attainablearea on thedesired interval of interest.

It can beseen from Fig. 3 that reducing thefeatures according to prediction sensitivity pro- vides a major improvement of theAUC^0.1values. For no attack does theAUC^0.1decrease after thefeatureselection. Theseresults arevery promising sincedetection accuracy at low false-positiverates isextremely important in IDS.

Thefull ROC curves for four attacks analyzed in Sec. 5 areshown in Fig. 4. TheROC curvefor thepodattack was almost perfect beforefeatureselection and thus is not shown in Fig. 4.

7 Discussion and conclusions

Wehavepresented a new techniquefor visualization of anomaly detection based on prediction sensitivity. Its applicationenables anexpert (a) to interpret thepredictions made by anomaly detection and (b) to select informativefeatures in order to improvedetection accuracy.

Ourexperiments wereconducted using thequarter-sphereSVM and theKDD Cup dataset.

Thefeatures highlighted by prediction sensitivity reasonably reveal thenatureof attacks present in this dataset, and, furthermore,exhibit moregenerality than therules suggested by RIPPER, a rule-based learning method.

(10)

0 0.2 0.4 0.6 0.8 1

spyrootkit nmap warezclientmultihopipsweep

loadmodule perl warezmasterbuf. overflow

phfsatan ftp write

portsweepguess pw.

land

neptuneback pod smurf teardrop

FTPHTTPSMTP Anomaly detection accuracy before and after feature selection

Area under curve at FP < 0.1

AUC^0.1 before feature selection AUC^0.1 after feature selection

Figure 3: Anomaly detection accuracy before and after feature selection. The left part shows experiments with single attack datasets. The right part corresponds to experiments with FTP, HTTP and SMTP datasets.

Thefeatureselectionexperiments showed major improvements of theaccuracy for anomaly detection on a speciﬁc subset of features chosen by prediction sensitivity, which conﬁrms the explanatory power of prediction sensitivity.

How can theproposed techniquebeuseful in practice? It is truethat the experimental setup presented in Sec. 4 is not fully unsupervised. Onecannot, as onewould liketo, simply feed thedata into thealgorithm and obtain the explanations to predictions and theset of informativefeatures. On theother hand, thelabel information is anyway needed for test- ing of intrusion detection systems: nobody would ventureto deploy an IDS withoutever wondering if it works right. At this point, using our technique, onecan utilizetheavail- ablelabel information to look beyond thebareaccuracy metrics and obtain insights into whytheanomaly detection produces theresults it is producing andwhatcan bedoneto improveit. Although labels areused for featureselection, noexplicit training is required, and in this sensetheprocedureremains unsupervised. Furthermore, the explanatory information provided by prediction sensitivity can beparticularly useful as a ﬁrst guidancefor development of signatures for unknown attacks.

8 Acknowledgements

Theauthors gratefully acknowledgethefunding fromBundesministerium fur Bildung und¨ Forschungunder theproject MIND (FKZ01-SC40A), and fromDeutsche Forschungsge- meinschaft under theproject MU 987/2-1. Wewould liketo thank Sebastian Mika and Stefan Harmeling for fruitful discussions and anonymous reviewers whosesuggestions helped to improvethequality of presentation.

(11)

0 0.2 0.4 0.6 0.8 1 0

0.2 0.4 0.6 0.8 1

False positive rate

True positive rate

"phf" attack

Full feature set Reduced feature set

0 0.2 0.4 0.6 0.8 1

False positive rate

True positive rate

"portsweep" attack

0 0.2 0.4 0.6 0.8 1

False positive rate

True positive rate

"smurf" attack

0 0.2 0.4 0.6 0.8 1

False positive rate

True positive rate

"loadmodule" attack

Figure 4: Full ROC curves for the attacksphf, portsweep, smurfandloadmoduleafter feature selection.

References

[Coh95] W. Cohen. Fast Effective Rule Induction. InProc. of the 12th International Conference on Machine Learning, 1995. http://wcohen.com/postscript/ml-95-ripper.ps.

[Cup99] KDD Cup, 1999. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

[EAP⁺02] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. Applications of Data Min- ing in Computer Security, chapter A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. Kluwer, 2002.

[GE03] I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection.JMLR, 3:1157–1182, 2003.

[GSS99] A. K. Ghosh, A. Schwartzbard, and M. Schatz. Learning Program Behavior Pro- ﬁles for Intrusion Detection. In Proc. of the 1st USENIX Workshop on Intrusion Detection and Network Monitoring, pages 51–62, Santa Clara, USA, April 1999.

http://www.cigital.com/papers/download/usenix id99.pdf.

[LCF⁺99] R. Lippmann, R. K. Cunningham, D. J. Fried, K. R. Kendall, S. E. Webster, and M. A.

Zissman. Results of the DARPA 1998 Ofﬂine Intrusion Detection Evaluation. InProc.

RAID 1999, 1999. http://www.ll.mit.edu/IST/ideval/pubs/1999/RAID 1999a.pdf.

(12)

[LS01] W. Lee and S. Stolfo. A Framework for Constructing Features and Models for Intru- sion Detection Systems. InACM Transactions on Information and System Security, volume 3, pages 227–261, November 2001.

[LSK04] P. Laskov, C. Sch¨afer, and I. Kotenko. Intrusion detection in unlabeled data with quarter-sphere Support Vector Machines. InProc. DIMVA, pages 71–82, 2004.

[LSKM04] P. Laskov, C. Sch¨afer, I. Kotenko, and K.-R. M ¨uller. Intrusion detection in unlabeled data with quarter-sphere Support Vector Machines (Extended Version). Praxis der Informationsverarbeitung und Kommunikation, 27:228–236, 2004.

[LSM99] W. Lee, S. Stolfo, and K. Mok. A data mining framework for building intrusion detection models. InProc. IEEE Symposium on Security and Privacy, pages 120–132, 1999.

[MJS02] S. Mukkamala, G. Janoski, and A. Sung. Intrusion Detection using Neural Networks and Support Vector Machines. InProceedings of IEEE Internation Joint Conference on Neural Networks, pages 1702–1707, May 2002.

[MMR⁺01] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Sch ölkopf. An Introduc- tion to Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.

[Pax98] V. Paxson. Bro: a system for detecting network intruders in real-time. InProc. USENIX Security Symposium, pages 31–51, 1998.

[Pax04] V. Paxson. The Bro 0.8 User Manual. Lawrence Berkeley National Laboratroy and ICSI Center for Internet Research, 2004.

[SMB⁺99] B. Schölkopf, S. Mika, C.J.C. Burges, P. Knirsch, K.-R. M üller, G. Rätsch, and A.J.

Smola. Input Space vs. Feature Space in Kernel-Based Methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, September 1999.

[SS02] B. Sch¨olkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

[SSM98] B. Sch¨olkopf, A.J. Smola, and K.-R. M ¨uller. Nonlinear component analysis as a kernel Eigenvalue problem.Neural Computation, 10:1299–1319, 1998.

[WMC⁺01] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature Selection for SVMs. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors,Advances in Neural Information Processing Systems 13, pages 668–674. MIT Press, 2001.