Computational Setup - Fully Realistic Multi-Criteria Timetable Information Systems

9.4.1 Testing Environment

Hard- and Software

All experiments were run on 2 cores of an Intel Xeon 2.6 GHz with 4GB of RAM under Ubuntu 8.04 in a Virtual Machine (VMWare ESXi 3.5.0). Our code was compiled with gcc V.4.3 with compile option 02.

Search Graph

The search graph was already used in the previous chapter. It was constructed from the schedule of the federal German railway company (Deutsche Bahn AG) for 2008. It encompasses all German long distance and local trains. The key figures were presented in Tables8.1.

Test Set

We use 5,000 real customer queries taken from the requests to the internet portal of Deutsche Bahn AG as available on http://www.bahn.de. Each of the queries has an interval of 3 hours. In Figure9.4all the origin-destination pairs are drawn onto a map of Germany. On the left hand side the source stations are red, while terminal stations are blue. Stations serving both as source for one and terminal for another query are violet.

In the figure on the right hand side, the pairs are ordered by the starting minute of their departure interval. We stacked them from 0:00 at the bottom to 23:59 at the top with a color gradient from green for 0:00 to blue for 23:59.

9.4.2 Measures and Test Procedures

9.4.2.1 Performance Measurement

Runtimes heavily depend on the used machines and optimization of the code. Rather than on runtime only, we want to concentrate on the number of created labels and extractions from the priority queue, oursignificant operations. These are our key criteria to compare the performance of different variants and parameterizations of our algorithm. As runtime scales similarly to the numbers of created and extracted labels, the significant operations are a good indicator for computational complexity and runtime. Therefore, we decided to present the average numbers of created and extracted labels as well as the runtime averages, whenever talking about performance.

9.4 Computational Setup 135

Figure 9.4: Origin-Destination Pairs for our 5,000 test queries. Origins are red, destina-tions blue, stadestina-tions that are both are violet (left) On the right the pairs are connected and stacked from 0:00 (green, bottom) to 23:59 (blue, top).

For runtimes we took the average over 10 runs to control for variations due to back-ground processes, IO etc. The algorithm is deterministic, so the numbers of all significant operations do not vary between the individual runs. For our reference version (see be-low) the runtime average of a single run was at most 1.77ms higher and at most 1.79ms lower than the average of 412,14ms over all ten runs. This deviation of at most 0.43% in any direction in case of runtimes is small enough to have high confidence in our runtime statements.

9.4.2.2 Quality Measurement

Some of our speedup techniques are heuristics. Therefore, we need a measurement for the quality of the computed sets of results. Instead of using some score based approach for quality, we decided to measure the loss of quality in relation to other test runs (e.g.

heuristics turnedon or off or different parameterizations).

When comparing the quality of two versions, saybase (B) andheuristic(H), we look at the set of calculated connections for each of the queries individually. To counteract the effect of significantly different quality (e.g. a heuristic only determines connections domi-nated by connections of (B)) and result sets of varying sizes (one result set is much larger but contains many worse connections) we first take the union of the results determined by (B) and (H). Afterwards, we apply filtering to the resulting set and remove duplicates.

As both versions may have computed results of different quality, this filtering may remove results from the union. In most cases the applied rules for filtering are those relevant for the reference version, unless stated otherwise. The number of surviving connections for each version is the number of connections determined by that version remaining in the filtered union. Note that connections determined by both versions are counted as sur-vivors for both. The quality loss for this query of (H) is then defined as the difference between the survivors of (H) minus the survivors of (B).

Think of an example for bi-criteria search with travel time and number of interchanges without relaxation. While (H) determined the results (100min, 3), (130min, 2), and (145min, 1), version (B) delivers connections (100min, 3) and (125min, 1). The filtered union of the results only contains the connections (125min, 1) and (100min, 3). Thus, (B) has a quality loss of zero, since it found all optimal connections. Meanwhile (H) found only one of the optimal connections (in the union). Consequently, it has a quality loss of one. Note that the number of suboptimal connections is not important. If a version had determined 10 connections that are not optimal (in the union) for the example above, it would not loose 10 connections in quality. It would have a quality loss of two, as it did not find any of the two optimal connections.

Summing up the quality loss over all queries gives us the total quality loss in connec-tions. Normalized over the number of connections determined for all queries we get our first quality criterion, thequality loss in connections (Q_conn) in percent.

Two versions with nearly identical quality loss in connections may have totally different distributions of the lost optima. One may have lost quality for one quarter of the queries, whereas the other may have lost quality for each tenth query only. To cover the affected queries we have a second quality criterion, the queries with worse quality (Q_query), i.e.

the number of queries for which (H) determined less survivors as (B), normalized over the number of queries.

Whenever we want to compare the quality of two or more variants, we select one base variant and determine the loss of quality in connections and queries in relation to that variant. For ease of exposition we will talk about loosing x% of the optimal connections or missing optimal connections for y% of the queries.

Assume we have 200 queries and found 800 connections. If variant (H) has a quality loss in connectionsQ_conn(H) = 10% and aQ_query(H) = 25% queries with worse quality, it lost 80 optimal connections. These were distributed over 50 of the 200 queries.

9.4.2.3 Measurement of Speedups Baseline Variant

A baseline implementation without speed-up techniques requires between 15 and 20 mil-lion extractions from the priority queue and nearly 5 minutes of runtime per query. Our 10 reference runs for all 5,000 queries would take half a year to complete. We do not deem the improvement compared to this version a fair measure of the speedup.

Reference Version

We want to measure the effect of each setting/heuristic in the whole setup. Hence, and due to the unfair measure mentioned above, we will not take the baseline variant without the speedup techniques and add either all techniques individually or one after another.

Instead we start with a fully optimized version without heuristics and only change the parameters currently under investigation. So when testing lower bounds, for example, we will disable the lower bounds for each of the four criteria separately and look in detail at the effect of using different ways to obtain lower bounds on the number of interchanges (from different graphs and with different options to the algorithm determining lower bounds).

Our reference version uses the optimal setting among our parameters for lower bounds, goal-direction and priority queue type and no heuristics at all, thus delivering optimal quality at the best possible speed. Throughout this chapter, the base version will be marked with a star (^F) in all tables in which it appears.

9.5 Advanced Pareto Dominance 137

Im Dokument Fully Realistic Multi-Criteria Timetable Information Systems (Seite 148-151)