9 Content-based visual analysis of network traffic

,,Exploration really is the essence of the human spirit, and to pause, to falter, to turn our back on the quest for knowledge, is to perish.”

Frank Borman

9.1 Related work on visual analysis of email communication . . . 142 9.2 Self-organizing maps for content-based retrieval . . . 143 9.2.1 Use cases . . . 144 9.2.2 Feature Extraction . . . 144 9.2.3 SOM generation . . . 146 9.3 Case study: SOMs for email classification . . . 146 9.4 Summary . . . 148 9.4.1 Future work . . . 148

W

HILE the previous chapters mainly dealt with the analysis of network traffic meta data, this chapter’s focus is on the analysis of the actual content of network communication.

For this purpose, we consider electronic mail, which has become one of the most widely-used means of communication. While mailing volumes have shown high growth rates since the introduction of email as an Internet service and considerable work has been done to im-prove theefficiencyof email management, theeffectivenessof email management from a user perspective has not received a comparable amount of research attention. Typically, users are given little means to intelligently explore the wealth of cumulated information in their email archives.

In this chapter, we extend our framework with a visualization module based on Self-Orga-nizing Maps (SOMs) [94] generated from a term occurrence email descriptor. We apply this module to an email archive and enhance the functionality of an email management system by offering powerful visual analysis features.

The rest of this chapter is structured as follows. In Section 9.1, related efforts on enhancing visual analysis of email communication are discussed. The concept of self-organizing maps is introduced in Section 9.2 by presenting use cases, demonstratingtf-idf feature extraction on emails, and giving an intuition about how SOMs are generated. Section 9.3 shows in a case study how the SOM can be used to explore emails classified according to spam and ordinary email. Finally, we sum up our contributions and possible directions for future work in the concluding section.

142 Chapter 9. Content-based visual analysis of network traffic

9.1 Related work on visual analysis of email communication

The main reason for improving the graphical user interface of email applications is that the success and popularity of email communication have led to high daily volumes of messages being sent and received, resulting inemail overloadsituations where important messages get overlooked or “lost” in archives [186]. Although usage of email has changed significantly over the years since nowadays many people use it to manage their appointments and tasks, for file exchange and as a personal archive, email clients have stayed very much the same [42]. Several research groups felt motivated to propose novel features for email applications to make email clients more adequate for these tasks.

Mandic and Kerne, for example, recognized that email communication is acually “a diary we were never aware we were keeping ”[118] and, therefore, regard personal email archives as a potential source of valuable insight into the structure and dynamics of one’s social network.

Acknowledging the fact that traditional email interfaces have undergone little evolution despite their intensive usage, they proposefaMailiar, a visualization interface for revealing intimacy and rhythm of personal email communication to the user. The system combines user-defined categorization of contact intimacy with message intimacy, computed using the presence of intimate and anti-intimate syntagmata, in order to visualize communication patterns of email messages through glyphs in a calendar view.

The IBM Remail project also focused on chronological aspects of email communications with Thread Arcs, which visualize the reply patterns of communication threads in a compact way [92]. Apart from this key feature, the prototype was designed to integrate several sources, such as chat communication, news, calendar events, reminders, etc., into one communication platform. Through the scatter-and-gather feature, the inbox can be quickly reduced to the latest messages of on-going communication threats. So-called collections can be used to order the contents of email communication according to user-defined preferences while not moving the messages out of the inbox. Especially the pivoting interface allows to rapidly switch between different views of the inbox, collections, and the to-do list without losing context.

Microsoft also started an attempt to innovate email interfaces with the Social Network and Relationship Finder (short: SNARF) [131]. The prototype aggregates social meta data about email correspondents to aid email triage, which is the process of viewing unhandled email and deciding what to do with it. In the prototype, emails can be sorted according to several social metrics that capture the nature and the strength of the relationship between the user and each correspondent.

In addition to exploring relationships between users and groups of users, theEmail Mining Toolkit (EMT) can be used to investigate chronical flows of emails for detection of misuses, such as virus propagation and spam spread [110]. The tool comes with a clique panel for visu-alizing the relationships in a circular node link diagram, which is extended through concentric rings to depict the time dimension.

Besides a node-link diagram for social network exploration called Social Network Frag-ment, the tool by Vi´egas et al. also offers a temporal visualization of email communication namedPostHistory, which uses a calendar metaphor to visualize overal email volumes as well

9.2. Self-organizing maps for content-based retrieval 143

as highlighted communication from user-selected contacts [180].

To facilitate interactive management of large volumes of email, we investigated techniques for visualizing temporal and geo-related attributes of emal archives in the previous work [86].

These techniques were based on a recursive pattern pixel visualization for displaying temporal aspects and on a map distortion technique to visualize the distribution of emails according to their geographic origin.

In contrast to the methods reviewed above, which all visualize structured attributes from the content headers (e.g., sender, date, and in-reply-to fields) or other meta data (e.g., geography), this chapter will focus on the analysis of unstructured text content of email messages. The-mail, for example, is a typographic visualization of an individual’s email content over time [181]. The tool displays the most frequently used terms as “yearly words” in a large font in the background and a more detailed selection of “monthly words” in a small font in the fore-ground, where those terms were chosen according to their frequency and distinctiveness using an adapted tf-idf feature extraction approach.

A recent approach to visualizing emails in a self-organizing map [133] is probably closest to our own research. The authors propose an externally growing SOM with the aim of providing an intuitive visual profile of considered mailing lists and a navigation tool where similar emails are located close to each other. While both their and our approach use tf-idf feature vectors, the focus of the former is to adapt the map to the distribition of the underlying data by a growing process in order to avoid the time-consuming retraining process. Our approach, on the contrary, deals with the issue of how well additional classification attributes, such as the classification into spam and ordinary email, is preserved within the SOM.

9.2 Self-organizing maps for content-based retrieval

Self-Organizing Map [94] is a neural network algorithm that is capable of projecting a dis-tribution of high-dimensional input data onto a regular grid of map nodes in low-dimensional (usually, 2-dimensional) output space. This projection is capable of (a) clustering the data, and (b) approximately preserving the input data topology. The algorithm is therefore especially useful for data visualization and exploration purposes. Attached to each node on the output SOM grid is a reference (codebook) vector. The SOM algorithm learns the reference vectors by iteratively adjusting them to the input data by means of a competitive learning process.

SOMs have previously been applied in various data analysis tasks. An example of the application on a large collection of text documents is the well-known WebSomproject. Sev-eral visualization techniques supporting different SOM-based data analysis tasks exist [179].

TheU-matrix, for example, visualizes the distribution of inter-node dissimilarity, supporting cluster analysis. Component planes are useful for visualizing the distribution of individual components in the reference vectors in order to support correlation analysis. If the input data points are mapped to their respectively best matching map nodes, histograms of map popula-tion, such as the distribution of object classes on the map, are possible.

144 Chapter 9. Content-based visual analysis of network traffic

9.2.1 Use cases

Conceptually, we identify several interesting use-cases for SOM-based visualization support in an email client:

• Classification. Using either automatic or manual methods, the SOM can be partitioned into regions representing different types of email, e.g., spam and non-spam email, busi-ness and private mail, and so on. For incoming email, the best matching region can then be identified and the mail can be classified as belonging to the label of that region.

• Retrieval.The user can search for email messages by mapping a query to the SOM node that best matches the query, followed by exploring the emails mapped to the neighbor-hood of that node using a technique like U-matrix or a histogram-based visualization to guide the search.

• Organization. The user can employ the SOM generated from his/her email archive to learn about the overall structure of the emails contained in the archive. The user might then create a directory hierarchy for organizing emails reflecting the SOM structure information.

9.2.2 Feature Extraction

To obtain feature vectors from email data, we employ a well-known scheme from information retrieval. The n most frequent terms from the subject fields of all emails in the archive are determined after having filtered the irrelevant terms using a stop-word list in order to avoid inclusion of non-discriminating terms in the description. Then, thetf-idf document indexing model [6] is applied, considering each email to be a document titled by its subject field. The model assigns to each document and each of the terms a weight indicating the relevance of the term in the given document with respect to the document collection. By concatenating the term weights for a given document we obtain a feature vector (descriptor) for that document.

The tf-idf vectors can be calculated by counting the frequencies of the terms. Usually, the term frequency count is normalized to prevent a bias towards longer documents, which naturally contain terms with higher frequencies due to the overall document length. Therefore, the following formula is used to calculate the term frequencytf_i,jof termt_iin emaile_j:

tf_i,j = n_i,j

max_k(n_k,j) (9.1)

wheren_i,j is the number of occurrences of the considered term in emaile_j and the denom-inator is the maximum occurrence of any single term ine_j. Note that there exist variations of the normalized term frequency, for example, using the sum instead of the maximum function in the denominator. The inverse document frequencyidf_i measures the general importance of termt_iwith respect to the whole email collectionE.

idf_i =log |E|

|{e_j :t_i ∈e_j}| (9.2)

9.2. Self-organizing maps for content-based retrieval 145

Thereby, terms that rarely occur in the whole document collection get highidf values since those terms are characteristic for that collection, whereas terms that occur in almost every document only result in small idf values. The tf-idf value of term t_i in in email e_j is then calculated as the product of its term frequency and its inverse document frequency:

tf-idf_i,j =tf_i,j ×idf_i (9.3)

Thetf-idf vector of a document is composed of the respectivetf-idf values of all terms. For two sample emails in Figure 9.1, we first count the term frequencies of terms “buy”, “new”,

“car”, “today”, “internet”, and “toy”. From these values and the sum of each term’s total occurrences, we can calculate the normalized term frequencies, such astf_buy,1 = ²₂ = 1 and tf_internet,2 = ¹₁ = 1. Next, the resulting inverse document frequencies are idf_buy = log¹⁰⁰₁₃ and idf_internet=log¹⁰⁰₇ given the collection size of 100 emails. Therefore, the relevance of the term

“buy” is0.89fore₁ and that of term “internet” is1.15fore₂. Note that the term “internet” is more important for the second email than the term “buy” for the first email since the former term only occurs in 7 emails whereas the latter appears in 13 emails.

Florian Mansmann 21st July 2005, CEAS

From: mail@provider.com Subject: flatrate!

Get cheap internet. ….

Im Dokument Visual Analysis of Network Traffic : Interactive Monitoring, Detection, and Interpretation of Security Threats (Seite 155-159)

9 Content-based visual analysis of network traffic

Contents

W

9.1 Related work on visual analysis of email communication

9.2 Self-organizing maps for content-based retrieval

9.2.1 Use cases

9.2.2 Feature Extraction