Cache Miss Tests - Calibration of the CPU Costs

4.3 Calibration of the CPU Costs

4.4.2 Cache Miss Tests

Modern CPUs enable counting of certain performance-critical events, such as cache misses, during the execution of other operations. The Performance Counter Library [BM] provides an uniform interface to start and stop event counting independent of the underlying system. As processors differ widely with respect to the supported events and their meaning, the library tried to identify an unified event set. However, first tests have shown that the P4 processor does only count cache misses caused by read accesses, contrary to other systems. When executing write-only operations, the performance counter of the P4 returned a small constant number of cache misses, independent of the actual amount of processed data. Therefore, we also separated read and write cache miss calculation and tried to validate at least the former with the following experiments.

Working with the performance counters we obtained another anomaly concern-ing the L1 measurements. Unlike the L2 cache, the P4 CPU architecture divides theL1in a data and an instruction cache unit; cache events are counted separately as well. The obtained number of misses in theL1data cache, however, turned out to be extraordinarily high in some of the tests, which could not be explained by the amounts of data actually involved. We thus reduced our observations to the more stable L2 misses. The restriction seems admissible sinceL2misses are by far more interesting with respect to data access costs.

Within 30 test runs, a stepwise increasing number of context nodes was chosen randomly. The experiments measured the occurring cache misses during single axis step executions from the selected context nodes. In order to avoid cache hits due to

∗After one initial run for allocation and load up of an array exceeding Li, but still fitting inLi+1, the program determined the time elapsed during a constant number ofnread or write accesses. For measuring the sequential latency, the array was traversed in uniform strides of length Zi. Random accesses, on the other hand, were simulated tricking the system by decoding the array index for the next fetch in the currently encountered array cell, thus forcing the CPU to complete evaluation of each fetch instruction before starting the next one. In the third test case of sequential writing to main memory,marray cells were filled successively with 4 byte integers, causing a write back ofn= 4m/Zicache lines. Finally, in each case the obtained time was divided by the number of accessesn. Fori >1 the sequential miss latencies of all cachesL<ihave to be subtracted.

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

cache misses

Figure 4.5: Cache misses caused by the major axes steps onoid-BATs.

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

cache misses

Figure 4.6: Cache misses caused by descendant/ancestor steps onvoid-BATs.

the data loaded in previous test runs, the caches were cleaned by a further traversal of a large not involved BAT before each new measurement. Furthermore, accounting for instruction cache misses, the modeled calculation was calibrated by addition of the number of misses, obtained from an initial operation call with empty context set. Since these first tests aimed at validation of the presented calculation models, the required values (Table 4.1, 4.2, 4.3) were gathered as accurately as possible, neglecting the fact of their availability at query optimization time.

The result figures present the modeled values of the number of cache misses connected by lines for better distinction from the plotted measurement points, which should not be misinterpreted as the connection does not mean that the model results in a continuous graph.

Starting with theoid-algorithms (Fig. 4.5), descendant, ancestor, and preceding axis always require scanning of the entire node set. The former two also include a traversal of all context nodes. The measurements as well as the modeling per-fectly reflect the easily described access pattern. In the case of the void-versions (Fig. 4.6 – 4.8), the cache miss graphs show similar characteristics, but the to-tal number of misses remains smaller due to the increased storage density of the void-BATs, especially for the algorithms running on the level table. Searching for skipping effects, we have to take a look at the first parts of the graphs, displaying the

0 500 1000 1500 2000 2500

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

cache misses

context set size prec measures

prec model foll measure foll model

Figure 4.7: Cache misses caused by preceding/following steps onvoid-BATs.

0 1000 2000 3000 4000 5000 6000 7000 8000

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

cache misses

context set size

chl measures chl model par measures ps measures par/ps model fs measure fs model

Figure 4.8: Cache misses caused by the level-based axis steps.

results for small sized context sets. Unlike the oid-algorithms, the measurements show significantly fewer cache misses for small context sets, which is again well matched by the modeled values. Smaller cache miss numbers, however, increase the relative deviation of the estimates. The modeled misses remain by a nearly constant number below the actually measured ones, as clearly shown in case of the preceding axis (Fig. 4.7). The same error becomes apparent for all axes, when running the experiments on smaller sized documents.

The second test restricted the choice of context nodes to those lying at the second level of the document tree. Notice that this limitation causes the chosen context sets to remain very small, whereas still a large part of the node set has to be scanned. Second level nodes thus also increase the possibility to observe skipping effects.

Although the size of the scanned node set partition varies highly from one test to another, the modeling still remains accurate for the oid-algorithms (Fig. 4.9).

Contrarily, skipping in case of the descendant axis causes the highest relative esti-mation error (Fig. 4.10). Cache miss models for the level based algorithms capture at least the outline of the measurements (Fig. 4.11).

Summarizing the validation tests, the proposed models provide accurate esti-mates on the number of expected cache misses. Even in those cases, where higher

Figure 4.9: Cache misses caused by steps on the major axes using theoid-algorithm originated from second level context nodes.

Figure 4.10: Cache misses caused by void-descendant/ancestor steps originated from second level context nodes.

Figure 4.11: Cache misses caused by the level-based axis steps originated from second level context nodes.

relative errors are observed, absolute deviations remain quite small. The models are thus well suited to become the basis for the further derivation of the cost models.

Im Dokument Methods and Cost Models for XPath Query Processing in Main Memory Databases (Seite 60-64)