Visual Analysis of Network Traffic : Interactive Monitoring, Detection, and Interpretation of Security Threats

(1)

Visual Analysis of Network Traffic

Interactive Monitoring, Detection, and Interpretation of Security Threats

Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Universit¨at Konstanz

im Fachbereich Informatik und Informationswissenschaft

vorgelegt von

Florian Mansmann

Universität Konstanz

Juli 2008

Tag der m¨undlichen Pr¨ufung: 13. Juni 2008

Referenten: Prof. Dr. Daniel A. Keim, Universit¨at Konstanz Dr. Stephen C. North, AT&T Shannon Labs, USA Prof. Dr. Marcel Waldvogel, Universit¨at Konstanz

Konstanzer Online-Publikations-System (KOPS) URL: http://www.ub.uni-konstanz.de/kops/volltexte/2008/5958/

URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-59589

(2)

(3)

Acknowledgment

I want to thank all people and institutions that supported me during my dissertation:

• First of all, I want to thank my supervisor Prof. Dr. Daniel Keim for his continuous en- couragement, excellent scientific guidance, and the possibility to conduct joint research.

• I would like to thank my external collaboration partners Dr. Stephen North, Daniel Shel- eheda, and Brian Rexroad at AT&T who provided me with valuable insights into the problems of large-scale network monitoring and security as well as Prof. Dr. Leland Wilkinson for advice on MDS and graph drawing methods.

• This work would have not been possible without gaining access to real data, which was provided by the computing center of the University of Konstanz. Therefore, I would like to express my gratitude to Prof. Dr. Marcel Waldvogel, Andreas Merkel, Barbara L¨ohle, and Stephan Pietzko, who shared anonymized data and provided valuable background knowledge about networking.

• I want to thank my colleagues Sabine Kuhr, Christian Panse, Mike Sips, Benjamin Bu- stos, Tobias Schreck, J¨orn Schneidewind, Hartmut Ziegler, Daniela Oelke, Dominik Morent, and Peter Bak, who motivated, inspired, and helped me during this exctiting time.

• I would like to thank my student assistants Andrada Tatu, Fabian Fischer, Lorenz Mei- er, and Halld´or Janetzko, who implemented parts of the presented systems and whose creativity often shaped the research outcome.

• I am more than grateful to my dear wife Svetlana Mansmann for her loving support in both personal and scientific matters as well as to my sister Veronika Mansmann who proof-read the thesis.

• Finally, I want to thank the anonymous reviewers of EuroVis ’05, CEAS ’05, IV ’06, Informatik ’05, IEEE TVCG ’06, IEEE VAST ’06, SCCG ’06, EDBT ’06, IEEE InfoVis

’06-’07, SADA ’07, BTW ’07, VMV ’07, and VizSec ’07, whose valuable comments helped to improve my research.

This work was funded by the German Research Society (DFG) under the grant GK-1042, Explorative Analysis and Visualization of Large Information Spaces, Konstanz, by the federal state Baden-W¨urttemberg in the research project BW-Fit: Information at your Fingertips – Interactive Visualization for Gigapixel Displays, and by the Information and Software Systems Research Lab in AT&T.

(4)

(5)

Abstract

The Internet has become a dangerous place: malicious code gets spread on personal computers across the world, creating botnets ready to attack the network infrastructure at any time.

Monitoring network traffic and keeping track of the vast number of security incidents or other anomalies in the network are challenging tasks. While monitoring and intrusion detection systems are widely used to collect operational data in real-time, attempts to manually analyze their output at a fine-granular level are often tedious, require exhaustive human resources, or completely fail to provide the necessary insight due to the complexity and the volume of the underlying data.

This dissertation represents an effort to complement automatic monitoring and intrusion detection systems with visual exploration interfaces that empower human analysts to gain deeper insight into large, complex, and dynamically changing data sets. In this context, one key aspect of visual analysis is the refinement of existing visualization methods to improve their scalability with respect to a) data volume, b) visual limitations of computer screens, and c) human perception capacities. In addition to that, developmet of innovative visualization metaphors for viewing network data is a further key aspect of this thesis.

In particular, this dissertation deals with scalable visualization techniques for detailed analysis of large network time series. By grouping time series according to their logical intervals in pixel visualizations and by coloring them for better discrimination, our methods enable accurate comparisons of temporal aspects in network security data sets.

In order to reveal the peculiarities of network traffic and distributed attacks with regard to the distribution of the participating hosts, a hierarchical map of the IP address space is proposed, which takes both geographical and topological aspects of the Internet into account. Since visual clutter becomes an issue when naively connecting the major communication partners on top of this map, hierarchical edge bundles are used for grouping traffic links based on the map’s hierarchy, thereby facilitating a more scalable analysis of communication partners.

Furthermore, the map is complemented by multivariate analysis techniques for visually studying the multidimensional nature of network traffic and security event data. Especially the interaction of the implemented prototypes reveals the applicability of the proposed visualization methods to provide an overview, to relate communication partners, to zoom into regions of interest, and to retrieve detailed information.

For an even more detailed analysis of hosts in the network, we introduce a graph-based approach to tracking behavioral changes of hosts and higher-level network entities. This information is particularly useful for detecting misbehaving computers within the local network infrastructure, which can otherwise substantially compromise the security of the network.

To complete the comprehensive view on network traffic, a Self-Organizing Map was used to demonstrate the usefulness of visualization methods for analyzing not only structured network protocol data, but also unstructured information, e.g., textual context of email messages. By

(6)

iv

extracting features from the emails, the neuronal network algorithm clusters similar emails and is capable of distinguishing between spam and legitimate emails up to a certain extent.

In the scope of this dissertation, the presented prototypes demonstrate the applicability of the proposed visualization methods in numerous case studies and reveal the exhaustless potential of their usage in combination with automatic detection methods. We are therefore con- fident that in the fields of network monitoring and security visual analytics applications will quickly find their way from research into practice by combining human background knowledge and intelligence with the speed and accuracy of computers.

(7)

Zusammenfassung

Das Internet ist ein gefährlicher Ort geworden: Schadcode breitet sich auf PCs auf der ganzen Welt aus und schafft damit sogenannte Botnets, welche jederzeit bereit sind, die Netzwerkin- frastruktur anzugreifen. Netzwerkverkehr zu überwachen und den Überblick über die gewalti- ge Anzahl von sicherheitsrelevanten Vorfällen oder Anomalien im Netzwerk zu behalten sind schwierige Aufgaben. Während Monitoring- und Intrusion-Detection-Systeme weit verbrei- tet sind, um operationale Daten in Echtzeit zu erheben, sind Bemühungen, ihren Output auf detaillierter Ebene manuell zu analysieren, oftmals ermüdend, benötigen viel Personal, oder schlagen vollständig fehl, die notwendigen Einsichten zu liefern aufgrund der Komplexität und des Volumens der zugrunde liegenden Daten.

Diese Dissertation stellt ein Bestreben dar, automatische Überwachungs- und Intrusion- Detection-Systeme durch visuelle Explorationsschnittstellen zu ergänzen, welche menschliche Analysten befähigen, tiefere Einsichten in riesige, komplexe und sich dynamisch verändernde Datensätze zu gewinnen. In diesem Zusammenhang ist ein Hauptanliegen von visueller Ana- lyse, bestehende Visualisierungsmethoden zu verfeinern, um ihre Skalierbarkeit in Bezug auf a) die Datenmenge, b) visuelle Beschränkungen von Computerbildschirmen und c) die Auf- nahmefähigkeit der menschlichen Wahrnehmung zu verbessern. Darüber hinaus ist die Ent- wicklung von innovativen Visualisierungsmetaphern ein weiteres Hauptanliegen dieser Dok- torarbeit.

Insbesondere beschäftigt sich diese Dissertation mit skalierbaren Visualisierungstechniken für detaillierte Analyse von riesigen Netzwerk-Zeitreihen. Indem Zeitreihen einerseits in Pi- xelvisualisierungen anhand ihrer logischen Intervalle gruppiert werden und andererseits zur verbesserten Abgrenzung eingefärbt werden, erlauben unsere Methoden genaue Vergleiche von temporären Aspekten in Netzwerk-Sicherheits-Datensätzen.

Um die Eigenheiten von Netzwerkverkehr und verteilten Attacken in Bezug auf die Vertei- lung der beteiligten Rechner aufzudecken, wird eine hierarchische Karte des IP Adressraums vorgeschlagen, welche sowohl geographische als auch topologische Aspekte des Internets be- rücksichtigt. Da naives Verbinden der wichtigsten Kommunikationspartner auf der Karte zu störenden visuellen Artefakten führen würde, können Hierarchical Edge Bundles dazu verwendet werden, die Verkehrsverbindungen anhand der Hierarchie der Karte zu gruppieren, um dadurch eine skalierbarere Analyse der Kommunikationspartner zu ermöglichen.

Ferner wird die Karte durch eine multivariate Analysetechnik ergänzt, um auf visuelle Art und Weise die multidimensionale Natur des Netzwerkverkehrs und der Daten von sicherheitsrelevanten Vorfällen zu studieren. Insbesondere deckt die Interkation der implementierten Pro- totypen die Eignung der vorgeschlagenen Visualisierungsmethoden auf, einen Überblick zu verschaffen, Kommunikationspartner zuzuordnen, in interessante Regionen hineinzuzoomen, und detaillierte Informationen abzufragen.

F¨ur eine noch detailliertere Analyse der Rechner im Netzwerk, f¨uhren wir einen graphen-

(8)

vi

basierten Ansatz ein, um Veränderungen im Verhalten von Rechnern und abstrakteren Ein- heiten im Netzwerk zu beobachten. Diese Art von Information ist insbesondere nützlich, um Fehlverhalten der Rechner innerhalb der lokalen Netzwerkinfrastruktur aufzudecken, welche andernfalls die Sicherheit des Netzwerks beträchtlich gefährden können.

Um die umfassende Sicht auf Netzwerkverkehr abzurunden, wurde eine Self-Organizing Map dazu verwendet, die Eignung der Visualisierungsmethoden zur Analyse nicht nur von strukturierten Daten der Netzwerkprotokolle, sondern auch von unstrukturierten Informatio- nen, wie beispielsweise dem textuellen Kontext von Email Nachrichten, zu demonstrieren.

Mittels der Extraktion der charakteristischen Eigenschaften aus den Emails, gruppiert der Neuronale-Netzwerk-Algorithmus ¨ahnliche Emails und ist imstande, bis zu einem gewissen Grad zwischen Spam und legitimen Emails zu unterscheiden.

Im Rahmen dieser Dissertation demonstrieren die präsentierten Prototypen die breite An- wendbarkeit der vorgeschlagenen Visualisierungsmethoden in zahlreichen Fallstudien und le- gen ihr unerschöpfliches Potential dar, in Kombination mit automatischen Intrusion-Detection- Methoden verwendet zu werden. Deswegen sind wir zuversichtlich, dass Visual-Analytics- Anwendungen in den Bereichen Netzwerküberwachung und -sicherheit schnell ihren Weg aus der Forschung in die Praxis finden werden, indem sie menschliches Hintergrundwissen und Intelligenz mit der Geschwindigkeit und Genauigkeit von Computern kombinieren.

(9)

Parts of this thesis were published in:

[1] Daniel A. Keim, Florian Mansmann, J¨orn Schneidewind, and Tobias Schreck. Monitoring Network Traffic with Radial Traffic Analyzer. In Proceedings of IEEE Symposium on Visual Analytics Science and Technology, pages 123–128, 2006.

[2] Daniel A. Keim, Florian Mansmann, J¨orn Schneidewind, Jim Thomas, and Hartmut Ziegler. Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, chapter Visual Analytics: Scope and Challenges. Lecture Notes in Computer Science (LNCS).

Springer, 2008.

[3] Daniel A. Keim, Florian Mansmann, J¨orn Schneidewind, and Hartmut Ziegler. Challenges in visual data analysis. InProceedings of IEEE Information Visualization (IV 2006). IEEE Press, 2006.

[4] Daniel A. Keim, Florian Mansmann, and Tobias Schreck. MailSOM - Visual Exploration of Electronic Mail Archives Using Self-Organizing Maps. InConference on Email and Anti-Spam, 2005.

[5] Florian Mansmann, Fabian Fischer, Daniel A. Keim, and Stephen C. North. Visualizing large-scale IP traffic flows. InProceedings of the 12th International Workshop on Vision, Modeling, and Visualization, pages 23–30. mpn, Saarbr¨ucken, Germany, November 2007.

[6] Florian Mansmann, Daniel A. Keim, Stephen C. North, Brian Rexroad, and Daniel Shele- hedal. Visual analysis of network traffic for resource planning, interactive monitoring, and interpretation of security threats. IEEE Transactions on Visualization and Computer Graphics, 13(6):1105–1112, 2007. Proceedings of the IEEE Conference on Information Visualization.

[7] Florian Mansmann, Lorenz Meier, and Daniel A. Keim. Visualization of host behavior for network security. InVizSec 2007 – Workshop on Visualization for Computer Security.

Springer, 2008.

[8] Florian Mansmann and Svetlana Vinnik. Interactive Exploration of Data Traffic with Hierarchical Network Maps.IEEE Transactions on Visualization and Computer Graphics, 12(6):1440–1449, 2006.

[9] Svetlana Vinnik and Florian Mansmann. From analysis to interactive exploration: Build- ing visual hierarchies from OLAP cubes. InProceedings of the 10th International Con- ference on Extending Database Technology, pages 496–514, 2006.

(10)

(11)

List of Figures

2.1 The layers of the OSI, TCP/IP and hybrid reference models . . . 11

2.2 Advertised IPv4 address count – daily average [78] . . . 13

2.3 IP datagram . . . 14

2.4 Concept of an SSH tunnel . . . 20

2.5 Port scanusing NMap . . . 21

2.6 Concept of ademilitarized zone(DMZ) . . . 24

2.7 A multilayered data warehousing system architecture. . . 27

2.8 Example netflows cube with three dimensions and sample data . . . 28

2.9 Modeling network traffic as an OLAP cube . . . 30

2.10 Navigating in the hierarchical dimensionIP address . . . 34

3.1 Mapping values to color using different normalization schemes . . . 41

3.2 The Scope of Visual Analytics . . . 44

3.3 Computer network traffic visualization tool TNV [61] . . . 48

4.1 Line charts of 5 time series of mail traffic over a time span of 1440 minutes . 54 4.2 Recursive pattern example configuration: 30 days in each of the 12 months . 57 4.3 Recursive pattern parametrization showing a weekly reappearing pattern . . . 59

4.4 Multi-resolution recursive pattern with empty fields for normalizing irregularities in the time dimension . . . 60

4.5 Enhancing the recursive pattern with spacing . . . 61

4.6 Different coloring options for distinguishing between time series . . . 62

4.7 Combination of two time series at different hierarchy levels . . . 63

4.8 Recursive pattern insmall multiples mode. . . 64

4.9 Recursive pattern inparallel mode . . . 64

4.10 Recursive pattern inmixed mode . . . 65

4.11 Case study showing the number of SSH flows per minute over one week . . . 66

4.12 Visualizing different characteristics of SSH traffic . . . 67

5.1 Example of a hierarchical data visualization using the rectangle-packing algorithm . . . 72

5.2 Density histogram: distribution of sessions over the IP address space. . . 73

5.3 Multi-resolution approach:Hierarchical Network Map . . . 74

5.4 Scaling effects in the HNMap demonstrated on some IP prefixes in Germany . 77 5.5 HNMap on the powerwall . . . 77

5.6 Border coloring scheme . . . 78

5.7 Geographic HistoMap Layout . . . 80

(14)

xii LIST OF FIGURES

5.8 HistoMap 1D layout . . . 83

5.9 Strip Treemap layout . . . 84

5.10 Anonymized outgoing traffic connections from the university gateway . . . . 85

5.11 Average position change . . . 87

5.12 Average side change . . . 88

5.13 HNMap interactions . . . 91

5.14 Recursive pattern pixel visualization showing individual hosts . . . 92

5.15 Multiple map instances facilitate comparison of traffic of several time spans . 93 5.16 Interface for configuring the animated display of a series of map instances . . 93

5.17 Visual exploration process for resource location planning . . . 95

5.18 Monitoring traffic changes . . . 96

5.19 Employing Radial Traffic Analyzer to find dependencies between dimensions 96 5.20 Rapid spread of botnet computers in China in August 2006 . . . 99

6.1 Comparison of different strategies to drawing adjacency relationships . . . . 104

6.2 The IP/AS hierarchy determines the control polygon for the B-spline . . . 105

6.3 HNMap with edge bundles showing the 500 most important connections . . . 107

6.4 Assessing major traffic connections through edge bundles . . . 109

7.1 Scatter plot matrix . . . 112

7.2 Design ratio of RTA . . . 115

7.3 Continuous refinement of RTA by adding new dimension rings . . . 116

7.4 RTA display with network traffic distribution at a local computer . . . 117

7.5 Animation over time in RTA intime frame mode. . . 119

7.6 Invocation of the RTA interface from within the HNMap display . . . 120

7.7 Security alerts fromSnortin RTA . . . 121

8.1 Normalized traffic measurements of two hosts . . . 125

8.2 Coordinate calculation of the host position . . . 126

8.3 Host behaviorgraph of 33 prefixes over a timespan of 1 hour. . . 128

8.4 Fine-tuning the graph layout through cohesion forces . . . 129

8.5 Integration of the behavior graph view into HNMap . . . 130

8.6 Automatic accentuation of highly variable ‘/24’-prefixes . . . 132

8.7 Overview of network traffic between 12 and 18 hours . . . 133

8.8 Nightly backups and Oracle DB traffic in the early morning . . . 134

8.9 Investigating suspicious host behavior through accentuation . . . 135

8.10 Evaluating 525 000 SNORT alerts recorded from Jan. 3 to Feb. 26, 2008 . . . 136

8.11 Splitting the analysis into internal and external alerts reveals different clusters 137 8.12 Analysis of 63 562 SNORT alerts recorded from January 21 to 27, 2008 . . . 137

8.13 Performance analysis of the layout algorithm . . . 138

9.1 tf-idf feature extraction on a collection of 100 emails . . . 145

9.2 The learning phase of the SOM . . . 147

9.3 Self-Organizing Email Map . . . 148

(15)

1 Introduction

,,Computers are incredibly fast, accurate and stupid: humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination.”

Albert Einstein

I

T is a fact that digital communication and sharing of data has proven to be cheap, efficient, and effective. Over the years, it has turned the Internet into an indispensable resource in our everyday life: in modern information society, not only private communication, but also education, administration, and business largely depend on the availability of the information infrastructure. To ensure the health of the network infrastructure, the following three aspects play critical roles:

1. Effective monitoringof the network to detect failures and timely react to overload situations.

2. Detection of intrusions and attacksthat aim at stealing confidential information, misus- ing hijacked computers for malicious activities, and paralyzing business or services in the Internet.

3. Human capability to reactto unforeseen threats to the network infrastructure.

Network monitoring is an essential task to keep the information infrastructure up and running. It is usually executed through a system that constantly monitors the hard- and software components crucial for the vitality of the network infrastructure and informs the network administrators in case of outages. Through so-called activity profiling, the monitoring system tries to distinguish between normal and abnormal usage and network behavior. In most cases, it is easy to handle failures that have previously occurred. However, recognition of abnormal network behavior often involves unnoticed misuse of the network and many false alarms, which eventually lead to information overload for the involved system administrators.

Network security has become a constant race against time: so-called0-Day-Exploits, which are security vulnerabilities that are still unknown to the public, have become a valuable good in the hands of hackers. These exploits are used to develop malicious code, which infiltrates

(16)

2 Chapter 1. Introduction

various computers in the Internet even before virus scanners and firewalls are capable of offer- ing effective countermeasures. Often, this malicious code communicates with a botnet server and only waits to receive commands to execute code on the hijacked computer. If many of these infected computers are interlinked, they form abotnetand are a mighty weapon to har- vest websites for email addresses, to send out SPAM messages, or to jointly conduct adenial of service attackagainst commercial or governmental webservers.

Today, signature-based and anomaly-based intrusion detection are considered the state-of- the-art of network security. However, fine-tuning parameters and analyzing the output of these intrusion detection methods can be complex, tedious, and even impossible when done manually. Furthermore, current malware trends suggest an increase in security incidents and a diversification of malware for the foreseeable future [62].

In general, it is noticeable that systems become more and more sophisticated and make decisions on their own up to a certain degree. As soon as unforeseen events occur, the system administrator or security expert has to interfere to handle the situation. The network monitoring and security fields seem to have profited a lot from automatic detection methods in recent years. However, there is still a large potential of visual approaches to foster a better understanding of the complex information through visualization and interaction. In addition to that, a very promising research field to solve many of today’s information overload problems in network security is visual analytics, which aims at bridging the gap between automatic and visual analysis methods.

In the remainder of this chapter, the need for network monitoring will be explained. We proceed by discussing how intrusion detection systems are used to prevent security threats.

The need for visual analysis for network security is then motivated through its potential to bridge the gap between the human analyst and the automatic analysis methods. The last section gives an outline of this thesis.

1.1 Monitoring network traffic

The computer network infrastructure forms the technical core of the Information Society. It transports increasing amounts of arbitrary kinds of information across arbitrary geographic distances. To date, theInternetis the most successful computer network.

The Internet has fostered the implementation of all kinds of productive information systems unimaginable at the time it was originally designed. While the wealth of applications that can be built on top of the Internet infrastructure is virtually unlimited, there are fundamen- tal protocol elements which rule the way how information is transmitted between the nodes on the network. It is an interesting problem to devise tools for visual analysis of key network characteristics based on these well-defined protocol elements, thereby enhancing the network monitoringapplication domain. Network monitoring in general is concerned with the surveillance of important network performance metrics to a) supervise network functionality, b) detect and prevent potential problems, and c) develop effective countermeasures for networking anomalies and sabotage as they occur. One may distinguish between unintentional defects due to human failures or other malfunctions, referred to as flaws, and the intentional misuses of the system, known asintrusions.

(17)

1.2. Intrusion detection and security threat prevention 3

The main focus of most network monitoring systems is to collect operational data from countless network connections, switches, routers, and firewalls. These data need to be timely stored in central repositories where network operations staff can conveniently access and use them to tackle failures within the network. However, one major drawback of these systems is that they only employ simple business charts to visualize their data not taking into account the inter-linked properties of the multi-dimensional network data. In the course of this dissertation, it will be demonstrated how information visualization techniques can be applied to gain insight into the complex nature of large network data sets.

1.2 Intrusion detection and security threat prevention

Since the Internet has de-facto become the information medium of first resort, each host on the network is forced to face the danger of being continuously exposed to the hostile environment.

What started as proof-of-concept implementations by a few experts for unveiling security vulnerabilities has become a sport among script kiddies and drawn the attention of criminals.

Therefore, network security has turned into one of the central and most challenging issues in network communication for practitioners as well as for researchers. Security vulnerabilities in the system are exploited with the intention to infect computers with worms and viruses, to

“hack” company networks and steal confidential information, to run criminal activities through the compromised network infrastructure, or to paralyze online services through denial-of- service attacks. Frequency and intensity of the attacks prohibit any laxity in monitoring the network behavior of the system.

One of the most famous infections to date was the SQL Slammer worm in January 2003.

Due to a vulnerability in the Microsoft SQL Server, the worm was able to install itself on Microsoft servers and started to wildly scan the network in order to propagate itself. It was not the unavailability of the Microsoft SQL servers, but the traffic generated by extensive scans, which in turn caused packet loss or completely saturated circuits in some instances.

Several large Internet transit providers and end-user ISP’s were completely shut down. As a result, Bank of America’s debit and credit card operations were impacted, denying customers the opportunity to make any transactions using their bank cards [172].

Economies of scale have made usage of the network infrastructure very efficient and extremely cheap. While this allowed the Internet to experience unprecedented growth, it brought about the pitfall that almost every internet user is exposed to unwanted advertisement e-mail messages, so-called spam. In the last few years, more and more of these relatively harm- less spam messages have turned into phishing mails, targeting at stealing online banking, e-commerce codes, and passwords from naive users.

Various automatic methods, such as virus scanners, spam filters, online surfing controls, firewalls, and intrusion detection systems have emerged as a response to the need of protecting the systems from harmful network traffic. However, as there will always exist human and machine failures, no fully automated method can provide absolute protection.

Intrusion detection is the major preventive mechanism for timely recognition of malicious use of the system endangering its integrity and stability. There exist two different concepts to detect intrusions: a)anomaly-basedintrusion detection systems (IDS) which offer a higher

(18)

potential for discovering novel attacks, and b) signature-based IDS, which targets already known attack patterns. Anomaly-based detection is carried out by defining the normal state and behavior of the system with alerts sent out whenever that state is violated. It is a rather complicated task to define the normal behavior precisely enough as to minimize false alerts on the one hand and not to let attacks evolve unnoticed on the other hand.

1.3 Visual analysis for network security

The roots of the field of exploratory data analysis date back to the eighties when John Tukey articulated the important distinction betweenconfirmatoryandexploratory data analysis[171]

out of the realization that the field of statistics was strongly driven by hypothesis testing at the time. Today, a lot of research deals with an increasing amount of data being digitally col- lected in the hope of containing valuable information that can eventually bring a competitive advantage for its owner. Visual data exploration, which can be seen as a hypothesis generation process, is especially valuable, because a) it can deal with highly non-homogeneous and noisy data, and b) is intuitive and requires no understanding of complex mathematical methods [90].

Visualization can thus provide a qualitative overview of the data, allowing data phenomena to be isolated for further quantitative analysis.

The emergence ofvisual analyticsresearch suggests that more and more visual methods will be closely linked with automatic analysis methods. The goal of visual analytics is to turn the information overload into the opportunity of the decade [162, 163]. Decision-makers should be enabled to examine this massive, multi-dimensional, multi-source, time-varying information stream to make effective decisions in time-critical situations. For informed decisions, it is indispensable to include humans in the data analysis process to combine flexibility, creativity, and background knowledge with the enormous storage capacity and computational power of today’s computers. The specific advantage of visual analytics is that decision makers may fully focus their cognitive and perceptual capabilities on the analytical process, while allowing them to apply advanced computational capabilities to augment the discovery process.

Network computers have become so ubiquitous and easy to access that they are also vul- nerable [114]. While extensive efforts are made to build and maintain trustworthy systems, hackers often manage to circumvent the security mechanism and thereby find a way to infil- trate the systems and steal confidential information, to compromise network computers, and in some cases even to take over the control of these systems. In practice, large networks consisting of hundreds of thousands of hosts are monitored by integrating logs from gateway routers, firewalls, and intrusion detection systems using statistical and signature-based methods to detect changes, anomalies and attacks. Due to economic and technical trends, networks have experienced rapid growth in the last decade, which resulted in more legitimate as well as malicious traffic than ever before. A consequence is that the number of detected anomalies and security incidents becomes too large to cope with manually, thus justifying the pressing need for more sophisticated tools.

Our objective is to show how visual analysis can foster deep insight in the large data sets describing IP network activity. The difficult task of detecting various kinds of system vulnerabilities, for example, can be successfully solved by applying visual analytics methods.

(19)

1.4. Thesis outline and contribution 5

Whenever machine learning algorithms become insufficient for recognizing malicious patterns, advanced visualization and interaction techniques encourage expert users to explore the relevant data by taking advantage of human perception, intuition, and background knowledge.

Through a feedback loop, the knowledge, which was acquired in the process of human in- volvement, can be used as input for advancing automatic detection mechanisms.

1.4 Thesis outline and contribution

The overall goal of this thesis is to show how visual analysis methods can contribute to the fields of network monitoring and security. In many cases, the large amount of available data from network monitoring processes renders the applicability of many visualization techniques impossible. Therefore, a careful selection and extension of current visualization techniques is needed.

While the first three chapters motivate our work and introduce the necessary foundations of networking, intrusion detection, data modeling, information visualization, and visual analytics, Chapters 4 to 9 deal with efforts to appropriately represent and interact with the available information in order to gain valuable insight for timely reactions in case of failures and intrusions.

Chapter 2 details basic concepts in networking and intrusion detection that are necessary to comprehend the data sets, which will be analyzed throughout the application chapters. Net- work protocols are discussed at an abstract level along with various tools for monitoring, intrusion detection, and threat prevention. Since in some cases one has to deal with extremely large data sets, performance requirements of the database management system play an important role in our network research efforts. The underlying data model for storing network traffic and event data was inspired by the OLAP (online analytical processing) approach used in building data warehouses for efficiently managing huge data volumes and computing aggregates under high performance requirements.

In Chapter 3, the research fields of information visualization and visual analytics are discussed. Using Shneiderman’s data type by task taxonomy, the visualization methods of this thesis are systematically put into context. Furthermore, we show how colors are mapped to values using different scaling functions and propose some literature for further reading. Next, the relatively young field of visual analytics is defined and its potential for network monitoring and security is pointed out. Based on an extensive review of scientific publications in the field, an overview of visual analysis systems and prototypes for network monitoring and security is presented to the reader.

Starting from low-dimensional input data as in time series, the used input data increase in dimensionality as we proceed from Chapter 4 to Chapter 9. All these chapters follow the same methodology: after a short motivation, related visualization methods are reviewed, the respective visualization approach is introduced, discussed, and evaluated where applicable.

Finally each method’s applicability is demonstrated in at least one case studies.

Chapter 4 describes the enhanced recursive pattern technique as an alternative to traditional line and bar charts for the comparison of several granular time series. In this visualization technique, through its color attribute each pixel represents the aggregated value of a time series

(20)

for the finest displayed granularity level. Long time series are subdivided into groups of logical units, for example, several hours each consisting of 60 minutes. By allowing empty pixels in the recursive patterns, the technique can better cope with irregularities in time series, such as the irregular number of days or weeks in a month. In order to be able to compare several time series, three coloring schemes and three alternative arrangements are proposed. Finally, the applicability of the extended recursive pattern visualization technique is demonstrated on real data of large-scale SSH flows in our network.

In Chapter 5, we propose theHierarchical Network Map (HNMap), which is a space-filling map of the IP address space for visualizing aggregated IP traffic. Within the map, the position of network entities are defined through a containment-based hierarchy by rendering child nodes as rectangles within the bounds of their parent node in a space-filling way. While the upper continent and country levels require a space-filling geographic mapping method to pre- serve geographical neighborhood, node placement in the lower two levels depends on the IP addresses, which are contained within the respective autonomous system or network prefix.

Since there exist two alternative layout methods and their combination for these lower two levels, we evaluate their applicability according to a) visibility, b) average rectangle aspect ratio, and c) layout preservation. Visual analysis of network traffic and events essentially involves exploration of the data. Therefore, various means of interaction are implemented within our prototype. Finally, three case studies involving resource location planning, traffic monitoring, and botnet spread propagation are conducted and show how the tool enables insightful analyses of large data sets.

In the scope of Chapter 6, the HNMap is extended through hierarchical edge bundles to convey source destination relationships of the most important network traffic links. In contrast to straight connection lines, these bundles avoid visual clutter while at the same time grouping traffic with similar properties in the IP/AS hierarchy of the map. In order to communicate the intensity of a connection, we consider both coloring and width of the splines. The case study then assesses changes of the major traffic connections throughout a day of network traffic.

Chapter 7 describes theRadial Traffic Analyzer (RTA), which is a visualization tool for multivariate analysis of network traffic. In the visualization, network traffic is grouped according to joint attribute values in a hierarchical fashion: starting from the inside each ring represents one dimension of the data set (e.g., source/destination IP or port). While inner rings show the high-level aggregates, outer rings display more detailed information. By interactively rear- ranging the rings, the aggregation function of the data is changed. By animating the display, it is demonstrated that the RTA can be used for temporal analysis of network traffic. The case study then demonstrates how the tool is applied for the analysis of event data of an intrusion detection system.

In Chapter 8, we propose a novel network traffic visualization metaphor for monitoring the behavior of network hosts. Each host is represented through a number of nodes in a graph, whose position correspond to the traffic proportions of that particular host within a specific time interval. Subsequent nodes of the same host are then connected through straight lines to denote behavioral changes over time. In an attempt to reduce overdrawing of nodes with the same projected position, we then apply a force-directed graph layout to obtain compact traces for hosts of unchanged traffic proportions and large extension of the traces that represent hosts with highly variable traffic proportions. Two case studies show how the tool can be used to

(21)

1.4. Thesis outline and contribution 7

gain insight into large data sets by analyzing the behavior of hosts in real network monitoring and security scenarios.

Chapter 9 details the analysis of content-based characteristics of network traffic using the well-knownSelf-Organizing Map (SOM)visualization technique. This neuronal network approach orders high-dimensional feature vectors on a map according to their distances. We cre- ate text descriptors by extracting the most popular terms from email subject and text fields of 9 400messages in an archive and by applying thetf-idf information retrieval scheme. Within the case study, it is demonstrated that the SOM learned on these feature vectors can be used for classification tasks by distinguishing between spam and regular emails based on the position of the email’s feature vector on the map.

Chapter 10 concludes the dissertation by summarizing the contributions and giving an out- look to future work.

(22)

(23)

2 Networks, intrusion detection, and data management for network

traffic and events

,,Not everything that is counted counts, and not everything that counts can be counted.”

Albert Einstein

2.1 Network fundamentals . . . 10 2.1.1 Network protocols . . . 10 2.1.2 The Internet Protocol . . . 13 2.1.3 Routing . . . 14 2.1.4 UDP and TCP . . . 15 2.1.5 Domain Name System . . . 16 2.2 Capturing network traffic . . . 17 2.2.1 Network sniffers . . . 17 2.2.2 Encryption, tunneling, and anonymization . . . 18 2.3 Intrusion detection . . . 20 2.3.1 Network and port scans . . . 21 2.3.2 Computer viruses, worms, and trojan programs . . . 22 2.3.3 Counter measures against intrusions and attacks . . . 23 2.3.4 Threat models . . . 25 2.4 Building a data warehouse for network traffic and events . . . 26 2.4.1 Cube definitions . . . 28 2.4.2 OLAP Operations and Queries . . . 31 2.4.3 Summary tables . . . 32 2.4.4 Visual navigation in OLAP cubes . . . 33

C

OMPUTER networks have become an integral part of our daily used IT infrastructure. It is therefore worth devoting some time to introducing networking concepts and terminol- ogy in order to foster the understanding of the data analysis challenges within this dissertation.

In particular, methods for capturing network traffic, methods for intrusion detection, as well as data modeling issues are discussed.

(24)

10 Chapter 2. Networks, intrusion detection, and data management

2.1 Network fundamentals

Within this dissertation, networking concepts are only explained in a very brief fashion. Kurose et al. [101] and Tanenbaum [159] have written excellent books on computer networking for a more thorough discussion.

In general, one can distinguish between network hardware and software. Today’s network hardware has diversified in various technologies: cable, fiber links, wireless, and satellite communication work together seamlessly. This is due to the fact that flexible network communication protocols – the software part – introduce the necessary abstraction to facilitate communication among various machines running diverse operation systems and wide-spread applications. Many innovations of today’s network communication were first proposed in the Request For Comments (RFC). RFC was intended to be an informal fast distribution way to share ideas with other network researchers. It was hosted at Stanford Research Institute (SRI), one of the first nodes of the ARPANET, which was the predecessor of the Internet [109].

2.1.1 Network protocols

Computers in a network are called hosts. As mentioned above, different technologies exist to connect the hosts of a network. From a structural point of view, one often distinguishes betweenLocal Area Networks (LANs)andWide Area Networks (WANs). The main difference between them is the communication distance, not necessarily the size, resulting in the use of different communication hardware. Since long distance links are more expensive than wiring a few local hosts, there is obviously a consolidation effect resulting in one strong link connecting two networks instead of several low-bandwidth links. However, due to availability concerns, many networks are connected through several links.

TheInternetis a giant network, which consists of countless interconnected networks. Many people confuse the termWorld Wide Web(WWW) with the term Internet. In fact, the World Wide Web can be seen as a huge collection of interlinked documents accessed via the Inter- net. Countless webservers provide access to interconnected dynamic and static websites for arbitrary hosts in the Internet. Therefore, the Internet is the name of the network whereas WWW refers to a particular service running on top of this network infrastructure. Emails, for example, are sent through the Internet and are another service besides WWW.

There exist two well-known reference models for network protocols:TCP/IP (Transmission Control Protocol/Internet Protocol)andOSI (Open System Interconnection). Neither the OSI nor the TCP/IP model and their respective protocols are perfect. The OSI reference model was proposed at the time when a lot of the competing TCP/IP protocols were already in widespread use, and no vendor wanted to be the first one to implement and support the OSI protocols. The second reason OSI never caught on is that the seven proposed layers were modeled unneces- sarily complex: two of the layers are almost empty (session and presentation), whereas two other ones (data link and network) are overloaded. In addition to that, some functions such as addressing, flow control, and error control reappear in each layer. Third, the early implementations of OSI protocols were flawed and were therefore associated with bad quality as opposed to TCP/IP which was supported by a large user community. Finally, people thought of OSI as the creation of some European telecommunication ministries, the European Union, and later

(25)

2.1. Network fundamentals 11

as the creation of the U.S. government and thus preferred TCP/IP as a solution coming out of innovative research rather than bureaucracy [159].

Application Transport

Network Data Link 5

4 3 2

Physical 1

Application Transport

Network Host-to-network Session

Transport Network Data Link

Physical Presentation

Application

OSI TCP/IP hybrid

Figure 2.1: The layers of the OSI, TCP/IP and hybrid reference models

OSI protocols are rarely used nowadays, therefore the focus is on TCP/IP protocols, but a hybrid reference model combining TCP/IP and OSI is considered throughout this dissertation.

Networking protocols are organized in so-calledlayers. These layers can be implemented in software (highest flexibility), in hardware (high speed), or in the combination of the two. The used hybrid model uses the upper three layers of TCP/IP, but splits the Host-to-network layer into the data link and physical layer as illustrated in Figure 2.1. The five layers of the hybrid model are as follows:

1. Thephysicallayer is concerned with transmitting raw bits over a communication chan- nel and ensures that if one party sends a 1 bit, the other party actually receives it as a 1 and not as a 0.

2. Thedata linklayer, sometimes also called link layer or network interface layer, is com- posed of the device driver in the operation system and the corresponding network interface card. These two components are responsible for handling the hardware details with the cable.

3. Thenetwork layer, or internetlayer, handles the movement of packets in the network and routes them from one computer via several hops to its destination. In the TCI/IP protocol suite this layer is provided by IP (Internet Protocol), ICMP (Internet Control Message Protocol), and IGMP (Internet Group Management Protocol).

4. Thetransportlayer delivers the service of data flows for the application layer above it.

The protocol suite includes two conceptually different protocols: TCP (Transmission Control Protocol) offers a reliable flow of data between two hosts for the application layer by acknowledging received packets and retransmitting erroneous packets. UDP (User Datagram Protocol), on the contrary, only sends packets of data from one host to the other, but no guarantee about the arrival of the packet at the other end is given. It is often used in real-time applications like voice, music, and video streaming where a loss of some packets is acceptable to a certain degree.

(26)

Protocol Layer Name Purpose / applications

FTP File Transfer Protocol file transfer

HTTP HyperText Transfer Protocol hypertext transfer IMAP Internet Message Access Proto-

col

electronic mailbox with folders, etc.

POP3 AL Post Office Protocol version 3 electronic mailbox

SMTP Simple Mail Transfer Protocol e-mail transmission across the Internet

SNMP Simple Network Management

Protocol

network management

SSH Secure Shell secure remote login (UNIX, LINUX)

TELNET TELetype NETwork remote login (UNIX, LINUX)

TCP Transmission Control Protocol lossless transmission

UDP TL User Datagram Protocol transmission of simple datagrams (packets might be lost), music, voice, video ICMP Internet Control Message Proto-

col

error messages

IGMP NL Internet Group Management

Protocol

manages IP multicast groups

IP Internet Protocol global addressing amongst computers

Table 2.1: Common protocols that build upon the TCP/IP reference model (NL = Network Layer, TL = Transport Layer, AL = Application Layer)

5. The communication details of diverse applications are handled in theapplicationlayer.

Common application layer protocols are Telnet for remote login, File Transfer Protocol (FTP) for file transfer, SMTP for electronic mail, SNMP for network management, etc.

An important contribution of the OSI model is the distinction between services, interfaces, and protocols. Each layer performs some services for the layer above. A layer’s interface, on the contrary, tells the processes above how to access it. The protocols used inside the layers can now be seen independently and exchanging them will not affect other layers’ protocols.

Since the upper layer protocols build upon the services provided by the lower layers, the intermediate routers do not necessarily need to understand the protocols of the application layer, but it suffices when they communicate data using lower level protocols and only the respective source and destination computers are capable of interpreting the used application protocols.

A few commonly used protocols are listed in Table 2.1 to convey the intuition about what is done in which layer. Let us consider a short example: After requesting a web page, the HTTP header (Application layer) specifies the status code, modification date, size, content-type, and encoding of the document, among other technical details. The TCP protocol of the Transport layer then subdivides the document into multiple frames and specifies the source (HTTP = port 80) and destination ports on which the requesting host already listens. This protocol guarantees reliable and in-order delivery of data from sender to receiver by sending requests

(27)

Figure 2.2: Advertised IPv4 address count – daily average [78]

and acknowledgments. In case of timeouts, TCP retransmits the lost frames and correctly assembles them. The IP protocol (Network layer) then provides global addressing and takes care of the routing of the frames from the source to the destination host. Often, this involves many routers, which each time transfer the packet to the next host. Normally, this host is closer to the destination with respect to network topology. Finally, the communication between two machines in this chain of involved routers and hosts is controlled using the data link layer.

This might be handled by Ethernet or other data link and physical layer standards.

2.1.2 The Internet Protocol

As demonstrated in the example above, many upper layer protocols depend upon the Internet Protocol with its global addressing and routing capabilities. Nowadays, version 4 is most commonly used. The Internet’s growth has been documented in several studies [113, 126, 78]

by means of estimating its network traffic, its users, and the advertised IP addresses. Figure 2.2 illustrates this growth, stating that currently about 40 % of the approximately 4 billion IPv4 addresses are used. Predictions suggest that IANA (Internet Assigned Number Authority) will run out of IP addresses in 2010. However, Network Address Translation (NAT) and IPv6 technology can compensate for the need of more IPv4 addresses. Both technologies are already in use and ready for a broader deployment.

Figure 2.3 shows an IP datagram. The IP header normally consists of at least five 32-bit words (one for each of the first five rows in the figure). It specifies the IP version used (mostly 4), the IP header length (IHL), the type of service, the size of the datagram (header + data), the identification number, which in combination with the source address uniquely identifies a packet, several flags, the fragmentation offset (byte count from the original packet, set by the routers which perform IP packet fragmentation), the time to live (maximum number of

(28)

hops which the packet may still be routed over), the protocol (e.g., 1 = ICMP; 6 = TCP; 17 = UDP), the header checksum, the source address and the destination address. In some cases, additional options are used by specifying a number greater than five in the IHL field. The rest of the datagram consists of the actual data, the so-calledpayload.

Versi on IHL Type of Service Identification

Total Length Header Checksum Time to Live Protocol

Source Address Destination Address

Options (optional)

Flags Fragment Offset

Data

Figure 2.3: IP datagram

The IP addressing and routing scheme build upon two components, namely, the IP address and network prefixes:

• AnIP addressis a 32-bit number (in IPv4), which uniquely identifies a host interface in the Internet. For example,134.34.240.69is the IP address of a webserver at the University of Konstanz in dot-decimal notation.

• Aprefixis a range of IP addresses and corresponds to one or more networks [67]. For instance, the prefix134.34.0.0/16defines the 65 536 IP addresses assigned to the University of Konstanz, Germany. Each prefix consists of an IP address and a sub- net prefix, which specifies the number of leftmost bits that should be considered when matching an IP address to prefixes.

2.1.3 Routing

When traffic is sent from a source to a destination host, several repeaters, hubs, bridges, switches, routers, or gateways might be involved. To clarify these terms, we need to reconsider the layers of the reference model since these devices operate on different layers. Repeaters were designed to amplify the incoming signal and send it out again in order to extend the maximum cable length (Ethernet: ca. 500 m). Hubs work in a very similar way, but send out the incoming signal on all their other network links. These two devices operate on the physical level, since they do not understand frames, packets, or headers.

Next, we consider switches and bridges, which both operate on the data link layer. Switches are used to connect several computers, similar to hubs, whereas bridges connect two or more networks. When a frame arrives, the software inside the switch extracts the destination address from the frame, looks it up in a table, and sends it out on the respective network link.

When a packet enters a router, the header and the trailer are stripped off and the routing software determines by destination address in the header to which output line the packet should

(29)

be forwarded. For an IPv4 packet, this address is a 32-bit number (IPv6: 64 bit) rather than the 48-bit hardware address (also called MAC address or Ethernet Hardware Address (EHA)). The term gateway is often used interchangeably for a router. However, transport and application gateways operate one or two layers higher.

Since each network is independently managed, it is often referred to as anAutonomous Sys- tem(AS). An AS is a connected group of one or more IP prefixes (networks) run by one or more network operators, and has a single and clearly defined routing policy [67]. AS’s are indexed by a 16-bitAutonomous System Number (ASN). Usually, an AS belongs to a local, regional or global service provider, or to a large customer that subscribes to multiple IP service providers. The border gateway routers, which connect different AS’s, base their routing decision upon each one’s so-called routing table. This table contains a list of IP prefixes, the next router, and the number of hops to the destination.

Prefixes underlie Classless Inter-Domain Routing (CIDR) [56], which was preceded by Classful Addressing. Classful Addressing only allowed 128 A class networks (/8) each consisting of 16 777 214 addresses, 16 384 B class networks (/16) with each 65 534 addresses, and 2 097 152 C class networks (/24) of size 254. Note that the number of available addresses is always2^N−2, whereN is the number of bits used and the -2 adjusts for the invalidity of the first and last addresses because they are reserved for special use. Since many mid-size compa- nies required more than 254 addresses, the fear arose that the B class networks would soon be depleted. CIDR introduced variable prefix length and thus offered more flexibility to vary network sizes for both internal and external routing decisions. Through its bitwise address assignment and aggregation strategy routing tables are kept small and efficient. Continuous ranges of IP addresses, which are all forwarded to the identical next hop, are aggregated in the routing tables. For example, traffic to the prefixes134.34.52.0/24and134.34.53.0/24, which are both in AS 553, can be aggregated to134.34.52.0/23.

Each time a packet arrives at an intermediate hop, it is forwarded to the router with the most specific prefix entry matching the destination IP address. This is done by checking whether the N initial bits are identical. Note that routing is usually more specific within an AS, whereas external routing is highly aggregated due to the fact that all traffic from a particular source to a destination AS needs to pass through the same border gateway router. Further details of the exterior routing, such as policies, costs, announcement, and withdrawal of prefixes are dealt with in the Border Gateway Protocol (BGP) [112].

2.1.4 UDP and TCP

As mentioned previously, UPD and TCP operate on the transport layer and provide end-to-end byte streams over an unreliable internetwork. The connectionless protocol UDP provides the service of sending IP datagrams with a short header for applications. This is done by adding the source and destination port fields to the IP header, thus enabling the transport layer to determine, which process on the destination machine is responsible for handling the packet.

The destination port specifies which process on the target machine handles the packet, whereas the source port details on which port the reply to the request should arrive. In the reply, the former source port is simply copied into the destination port so that the requesting machine knows how to handle the answer.

(30)

In contrast to UDP, the connection-oriented protocol TCP provides a reliable service for sending byte streams over an unreliable internetwork. This is done by creating so-called sockets, which are nothing else but communication end points, and by binding the ports local to the host to the sockets. TCP can then establish a connection between a socket on the source and a socket on the target machine.

The IANA [77] is responsible for the assignment of application port numbers for both TCP and UDP. Conceptually, there are three ranges of port numbers:

1. On many systems,well-known port numbersranging from 0 to 1023 can only be used by system (or root) processes or by programs executed by privileged users. They are assigned by the IANA in a standardization effort.

2. Registered port numbers ranging from 1024 to 49 151 can be used by ordinary user processes or programs executed by ordinary users. The IANA registers uses of these ports as a convenience to the community.

3. Dynamic and/or private port numbersranging from 49 152 to 65 535 can be used by any process and are not available for registration.

In the analysis types presented in this thesis, application port numbers are often used as an indication about what applications are using the network. Since the application port numbers can be extracted from the packet headers, this kind of analysis does not require looking at the packet content, which might otherwise raise additional privacy concerns. Although the used application ports are a rather good estimate for regular traffic, the application port numbers can be used by other processes than the ones they were originally meant to. The peer-to-peer Internet telephone system Skype, for example, uses port 80 and 443, which are registered to web traffic (http) and secure web traffic (https). This is done in order to bypass application firewalls, which are an effort to protect the network infrastructure from malicious traffic by blocking unused ports. Naturally, this is only possible if the application ports on that particular machine have not already been bound to a webserver process.

2.1.5 Domain Name System

So far, we have discussed various protocols which all rely on some sort of network address (e.g., the MAC address or the IP address). Whereas machines can perfectly deal with these kind of addresses, humans find them hard to remember. Due to this fact, ASCII names were introduced to decouple machine names from machine addresses. However, the network itself only understands numerical addresses. Therefore, a mapping mechanism is required to convert the ASCII strings to network addresses. Since the previously used host files could not keep pace with the fast growing Internet, theDomain Name System(DNS) was invented.

The Internet is conceptually divided into a set of roughly 250 top level domains. Each one of these domains is further partitioned into subdomains, which in turn can have subdomains of their own, and so on. This schema spans up a hierarchy. One distinguishes between so-called generic (e.g., com, edu, org) and country top-level domains (e.g., de, ch, us). Subdomains can then be registered at the responsible registrar. Each domain is named by the path upward

(31)

2.2. Capturing network traffic 17

from it to the (unnamed) root, whereas the components are separated by periods (pronounced

“dots”). For example, www.uni-konstanz.de specifies the subdomain www (common convention to name a webserver), which is a subdomain ofuni-konstanz, which in turn is registered below the top-level domainde. This naming schema usually follows organizational boundaries rather than the physical network.

When a domain name is passed to the DNS system, the latter returns allresource records associated with that name. For simplicity, we restrict ourselves to the address records, which might look like this:

Domain TTL Class Type Value

www.uni-konstanz.de 86400 IN A 134.34.240.69

Resource records contain five values, namely the domain name, time to live (TTL), class, type, and value. In the record above, the TTL value of 86 400 (the number of seconds in one day) indicates that the record is rather stable since highly volatile information is assigned a small value. INspecifies that the record contains Internet information and A that it is an address record. The final field specifies the actual IP address 134.34.240.69, which is mapped to the domain www.uni-konstanz.de. Other resource records hold information such as the start of authority, responsible mail and name servers for a particular domain, pointers, canonical names, host descriptions, or other informative text. For more details refer to [159].

2.2 Capturing network traffic

To conduct data analysis of network traffic, details of this traffic need to be obtained from hosts, routers, firewalls, and intrusion detection systems. Collecting this data often turns out to become a practical challenge since some network packets might pass several routers and are thus stored several times. Furthermore, export interfaces of routers might return the so-called netflows, which are detailed information about size, time, source, and destination of the transferred network traffic, in different formats. For a more detailed description of problems and solutions to measuring network traffic, we suggest to read the book “Network Algorithmics:

an interdisciplinary approach to designing fast networked devices” by George Varghese [178].

2.2.1 Network sniffers

There exists an alternative way of monitoring network traffic when access to the export interface of a router is not given. The network card of almost any computer can be set into promiscuous mode, which instructs the network card to pass any traffic it receives to the CPU rather than just packets addressed to itself. In the next step, the packets are passed to programs extracting application-level data. Depending on the network infrastructure,packet sniffingcan be very effective or not effective at all: hubs forward all traffic to each of their network interfaces (except for the one where it came in), whereas switches only forward incoming traffic to one network link as long as it is not a broadcast packet. Despite this fact,Arp Poison Routing

(32)

(APR)can be used to fool switches by misleadingly announcing the MAC addresses of other hosts in the network.

Today, network administrators and hackers can choose from a wide variety of packet sniffers. A few commonly used freeware tools are listed here:

• libpcapis a system-independent interface for user-level packet capturing, which runs on POSIXsystems (Linux, BSD, and UNIX-like OSes).

• tcpdump is a command line tool that prints out the headers of packets on a network interface matching a boolean expression. It runs on POSIX systems and is built upon libpcap.

• WinPcapcontains the Windows version of the libpcap API.

• JPcap is a java wrapper for libpcap and winpcap, which provides a java interface for network capturing.

• Wireshark(formerly Etherreal) is a free graphical packet capture and protocol analysis tool, which runs on POSIX systems, MS Windows, and Mac OS X. It uses libpcap or WinPcap depending on the OS.

• TheOSU Flow-tools are a set of tools for recording, filtering, printing, and analyzing flow logs derived from exports of CISCO NetFlow accounting records [57].

Since there has always been a need to monitor network traffic and debug network protocols, many more freeware and commercial products have emerged. An overview can be found in [75]. By simply instructing a router to duplicate the outgoing packets and output them on an additional network interface, it is possible to sniff all network traffic using a machine with a promiscuous network interface that is connected to the router. This reduces the effort of having to deal with various formats of the export interface, but requires a lot of bandwidth between the router and the capturing device. Alternatively, most routers are capable of exporting metadata using, for example, the CISCO Netflow protocol.

Surprisingly, a lot of commonly used applications, such as POP3, FTP, IMAP, htaccess, and some webstores do not encrypt transferred data, usernames or passwords. For hackers, it therefore often suffices to have a network sniffer installed which listens to the communication between their victims and the servers they use. The sniffed ethernet frames are simply passed to an application which searches their payload for passwords. Note that in high-load situations not all packets can be captured due to capacity problems of the machine where the sniffer runs.

Routers are built and configured to pass large amounts of packets, whereas sniffers need to analyze the packets at a higher level, which needs more processing power.

2.2.2 Encryption, tunneling, and anonymization

Encryption can be used to avoid stolen passwords and to guarantee the privacy of the communication. More formally, the desirable properties ofsecure communication can be identified as follows:

Visual Analysis of Network Traffic : Interactive Monitoring, Detection, and Interpretation of Security Threats