• Keine Ergebnisse gefunden

traffic and events

2.4 Building a data warehouse for network traffic and eventsevents

2.4.1 Cube definitions

For our data warehouse, we used the relational database technology of the open source database PostgreSQL [141] because it offers a robust, reliable, and efficient approach for storing and managing large volumes of data. While collaborating with network administrators of our uni-versity and staff of the network security team at AT&T, we were able to design four types of facts and populate them with data: the netflows cube for storing network traffic extracted from the university gateway, thewebserver cubeconsisting of log entries of the webserver of our working group, thebotnets cube containing data from a signature-base Botnet detection mechanism, and thesnort cube, which is defined through alerts generated by the IDS Snort

2.4. Building a data warehouse for network traffic and events 29

that was set up for test purposes. Table 2.2 describes these four cubes, their dimensions, and measures.

Measures Dimensions

Cube #bytes #flows #requests #IPs #alerts IPaddress Time Port Browser OS Alert

netflows P P P

2 1 2

webserver P P P

1 1 1 1

botnets P P

2 1

snort P P

2 1 2 1

Table 2.2: Cubes of the network security data warehouse with their associated measures and dimensions. Inside the table the aggregation functions for the applicable measures and the number of dimensions of a particular type for the applicable dimensions are denoted.

Transactional facts are periodically retrieved from network devices, the webserver log or IDS and stored in the respective fact table. For network traffic, for instance, an entry describes a single network packet consisting of source and target IP addresses and ports, the timestamp, and the size of the payload. When measuring network traffic using several sensors, special precaution has to be taken in order not to count traffic redundantly when it passes multiple hosts on its way to its destination. To reduce the volume of facts to be stored, all packets and connections referring to the same source, destination and time interval are aggregated into a single fact, with the number of sessions and their total size in bytes as its two measures.

A single day of network traffic measured on the main gateway of our mid-size university, for example, stores about 10 million connections, which are already aggregated on hourly intervals to approximately 2 million facts. An overview of the logical database design of the netflow cube is depicted in Figure 2.9. Using the snowflake schema, dimensional tables (e.g., IPAddress, Port) were normalized into subtables for each granularity level. The time dimension was left denormalized like in the star schema since data warehouse systems provide their own routines for handling temporal characteristics.

IP address dimension

IP address dimension is common to all of our cubes, some cubes even have two dimensions of this type, for example, the source and the destination IP addresses in the netflow cube. A balanced hierarchy is defined upon the IP address dimension using the following consolidation path: IP address → IP prefix → autonomous system → country → continent and we thus

30 Chapter 2. Networks, intrusion detection, and data management

netflows Timestamp SourceIP DestinationIP SourcePort

DestinationPort IPAddress IPAddress Prefix

Network Prefix StartIP EndIP ASN

Auton_System ASN

Name CountryID Country

CountryID Longitude Latitude Name ContinentID Continent

ContinentID Longitude Latitude Name Port

PortID Category Domain Description

Time Timestamp Millisecond Second Minute Hour Day Month Year PortCategory

Category

Description NumConn

NumBytes PortDomain

Domain Description

Dimensions Measures

Figure 2.9: Modeling network traffic as an OLAP cube. Each entry of the fact table netflows is linked to its dimensional values and stores the measures NumConn and NumBytes obtain the following hierarchy with the number of entries at each level:

continents (7) countries (190)

autonomous systems (23054) prefixes (197427)

The hosting country of an autonomous system is determined by looking up the geograph-ical positions of the IP addresses of all the networks it contains and choosing the prevailing country. For this, we rely on the GeoIP database of Maxmind, Ltd. [122] (approx. 99%

accuracy).

A global map from IP prefixes to ultimate AS names and numbers can be somewhat com-plicated to obtain. Though a local map can be extracted from any border gateway router, due to route aggregation, it is unlikely to list many terminal or leaf-level AS’s. Therefore, maps should be obtained from multiple vantage points in the Internet and aggregated, raising the problem of consistency and completeness, especially in the presence of intentional efforts to spoof AS identifiers. This problem has been studied and there are useful heuristic methods based on dynamic programming [120] and public prefix-to-ASN tables available. Unfortu-nately, these tables date back to 20041and we therefore reverted to the data we extracted from a single routing table in September 2006 [7].

Time dimension

Time is a very common and important dimension in almost all data analysis scenarios. In the data cubes presented here,timestampsare aggregated bymillisecond→second→minute→

1A possible side-effect could be that previously large AS’s contain less prefixes, which makes the AS level easier to render in the visualizations.

2.4. Building a data warehouse for network traffic and events 31

hour →day → month→ year. Database systems provide support to this unique dimension in form of functions for extracting and manipulating temporal properties of interest. How-ever, because these functions differ from product to product, we modeled time as an OLAP dimension to provide the full OLAP analysis capabilities.

Port dimension

Thenetflowsandsnort cubesboth share the port dimension. TCP and UDP application ports may be grouped intocategories(i.e., well-known, registered and dynamic ports as mentioned in Section 2.1.4) on the one hand, and into domains (e.g., web, email, database, etc.) on the other hand. These two consolidation paths allow calculation of two different upper level aggregates. Note furthermore that the netflows cube contains two port dimensions, namely SourcePortandDestinationPort.

Other dimensions

Naturally, each application domain has its own dimensions characterizing the facts. In the scenarios discussed in this thesis, the hierarchical dimensionalert (consolidation path: alert

→ alert category) as well as non-hierarchical dimensionsbrowser and operation system are contained in the snort and netflows cubes. Note that the dimensions presented here can by no means be considered complete or perfectly modeled since complicated analysis tasks might pose novel requirements (e.g., disk space efficiency, performance, finer granularities, etc.) to the model.

Measures

As mentioned before, measures are defined through a numerical attribute as well as an aggre-gation function or a set of aggreaggre-gation functions applicable to it. For all of the measures pre-sented here, the sum aggregation function was used. Evidently, other aggregation functions, such as average and bounds, or derived aggregation functions (e.g., alerts per IP address) can be easily defined. As shown in Table 2.2, the measures number of bytes, connections, re-quests, flows, IPs, andalerts were used in the four presented cubes of the network security data warehouse.