Evaluation - High-Performance Persistent Identification for Research Data Management

In this section, we provide an experimental evaluation of the two DSID approaches discussed in the previous section. Since with these approaches, there is no need to register and up-date individual PIDs, the evaluation is limited only to the resolution operation of PIDs, which are composed by concatenating the DSID with an internal identifier, such that PID = DSID / INTERNAL IDENTIFIER. Note that the acceleration of the resolution of PIDs, which are associ-ated with individual PID records is comprehensively addressed in Chapter 6.

Hence, the response times of the administration for individual PID records, depicted in Figure 3.3, remain unimproved. However, our proposed approach results in the effect that these expensive administration operations have not to be applied for each individual PID record. Instead, only when the research data repository is subjected to modifications affecting its data access interface, the corresponding DSID record has to be transactionally updated. However, also the tracking of individual research datasets still requires with option (A) transactional administration procedures.

The performance improvement for individual PID records is addressed in Chapter 4 and Chapter 5.

3.4.1 Evaluation Setup

The evaluation setup is depicted by Figure 3.11 and includes several components. Therefore, we proceed with a description of these components:

Testing GPR: Since the approach with theHS RDS URL-typed Handle Value requires a modified

Arrived Resolution Request

LHSInfo has HS_RDS_URL

Handle Value

buildResponseForRDSLocator(

thePID,LHSInfo) response(locator)

COMPOSE locator get LHSInfo

[Yes]

response(locator)

send response

[No]

get HandleRecord from LHS resolveFromCache

is NULL

resolveGlobally response(

HandleRecord)

[No]

[Yes]

Figure 3.10: GPR resolution algorithm: The greyed area represents our extension to realize the DSID resolution approach at the GPR. Again, the yellow box represents the core function of our extension.

GPR

processing at the GPR, we placed a testing GPR at an Amazon EC2 c3.xlarge instance in Ireland. In Figure 3.11, this testing GPR is labeled as GPR-EU. This testing GPR was equipped with our extended core software package of the Handle System [102].

Productive GHR: The productive GHR was used in the usual way for storing theHS RDS URL-typed Handle Values into a prefix Handle Record. For this evaluation, we used the 0.NA/21.T11996prefix Handle Record.

Testing LHS: The LHS, responsible for the21.T11996-prefix, was installed at GWDG. It was composed of a primary and a mirror Handle server. This LHS was particularly interesting for the approach with theHS NAMESPACEtype.

Load-Generators: To produce appropriate workload, we placed four load-generators at different geographical locations. Three of them were hosted on an Amazon EC2 c3.xlarge instance, where the first load-generator EU) was hosted in Frankfurt, the second in US-east (LG-US) and the third in Singapore (LG-AS). The remaining load-generator (LG-GWDG) was hosted at GWDG.

3.4.2 Measurements

The overall evaluation procedure includes two different measurement procedures. The first mea-suring procedure was composed of two one-hour measurement runs. In both runs, the testing GPR was subjected to HTTP resolution requests from the three load-generators hosted on Amazon EC2 instances. In the first run, the Handle-PIDs were resolved by means of theHS RDS URLtype ap-proach (option (B)). In contrast to that, in the second run, they were resolved by means of the HS NAMESPACEtype approach (option (A)).

Also the second measuring procedure was composed of two one-hour measurement runs. The intention behind the second measuring procedure is to reveal the difference in the performance for resolving ordinary individual PIDs and the dynamic locator composition approach based on theHS NAMESPACEtype. In contrast to the first measuring procedure, in this procedure only the

GWDG load-generator was issuing HTTP resolution requests directly to the LHS without involv-ing the testinvolv-ing GPR.

Finally, the measurements of the first measuring procedure are depicted by Figure 3.12 and Fig-ure 3.13. FigFig-ure 3.12 illustrates the average resolution times. In addition to that, FigFig-ure 3.13 reveals the relative improvements for the resolution times. The relative improvement is measured by the formula

R_S(B)−RS(A) R_S(B)

, whereR_S(A)andR_S(B)denote the resolution times recorded with option (A) and option (B) respectively.

Furthermore, Figure 3.14 shows the per request resolution times at the LHS for the ordinary and option (A)-approach (remember that option (B) does not involve the LHS).

3.4.3 Analysis

From Figure 3.12 we can see that the resolution time is obviously highly dependent on the network latency between the involved components. For the resolution by means of the option (A)-approach, the resolution time is mainly composed of the latency between three connections: first, between the load-generator and the testing GPR, secondly, between the testing GPR and the GHR and thirdly, between the testing GPR and the LHS. In Figure 3.11, the communication latencies are depicted as directed graphs.

Hence, the resolution time via option (A) includes the latencies 1 - 3 . In contrast to that, since with option (B) the locator composition is performed directly at the GPR, the resolution time with theHS RDS URL type approach is composed of only the latencies 1 - 2 . Therefore, for each logenerator, the resolution time with option (B) is always shorter than with option (A). In ad-dition, with option (B), the resolution time is more stable than with option (A), which can be seen from the deviation whiskers plotted on the bars in Figure 3.12.

With increasing geographical distance between the load-generator and the GPR, the overall res-olution time becomes expectingly longer. However, also the improvement effect with option (B) becomes smaller, which can be clearly seen from Figure 3.13. For the EC2 load-generator located in Frankfurt, the relative improvement is about 43%, whereas for the EC2 load-generator located in Singapore it is only about 10%.

Furthermore, Figure 3.14 shows that there is no significant difference between the resolution time with theHS NAMESPACEtype approach and the ordinary resolution approach. The average resolu-tion time for resolving ordinary Handle-PIDs is 8.09ms, where with theHS NAMESPACEresolution approach it is 8.15ms. Also the fluctuations in the resolution times of both approaches are pretty similar. This means that the algorithm for composing the locators at the LHS does not introduces a significant overhead onto the resolution time. It should be noted that in our setup, the MySQL databases attached to the Handle servers (primary and mirror) had about three million records.

Note that due to caching at the productive GPR, frequently resolved ordinary Handle-PID would have the same resolution times as the with option (B)-approach. The difference is that with option (B), all (not only the cached ones) DSID-prefixed Handle-PID would benefit from the shortened delegation procedure.

Im Dokument High-Performance Persistent Identification for Research Data Management (Seite 63-66)