DNS and Handle System Comparison - High-Performance Persistent Identification for Research Data

As already indicated in the earlier chapters, PIDs are technically very well comparable with DNS domain names. The fundamental operations offered by a PID and DNS system are first a reg-istration and secondly, a resolution operation. In both systems the regreg-istration is accomplished transactionally and hence, it poses an expensive operation.

A DNS registration agency is usually subjected to a large amount of registration requests by a large amount of individual registrants. These registrants are often still individual human users. In con-trast to that, a PID registration agency is often subjected to requests by a relative small number of registrants, which however are characterized by a high individual request frequency. Furthermore, these registrants are mostly machine agents of research data repositories. For an individual regis-trant, where the frequency of registrations or updates is fairly low, an expensive writing operation is usually acceptable. However, for a research data repository with a high data ingest or modifica-tion frequency, this can often quickly cause a serious performance degradamodifica-tion in the whole data management workflow. Therefore, a high-performance PID management protocol is required to be able to efficiently manage such large amounts of research datasets.

Because of its versatility and sophisticated protocol, in this thesis we particularly concentrate on the Handle System. Thus, in this chapter we focus on improving the Handle protocol to realize such a high-performance persistent identifier management protocol. Since the Handle System con-stitutes the core component of various other PID systems, an improvement of the Handle protocol can potentially also lead to an improvement of these systems, such as the ePIC system. Moreover, in Section 4.5.7, we have already indicated that the underlying Handle server and its communica-tion pattern with the attached database could pose a bottleneck in the ePIC PID system.

Both systems, DNS and Handle, can be considered as a hierarchical distributed database, for which the global operation is ensured by a specific communication protocol, the DNS and the Handle protocol respectively. However, there are also significant differences in these systems, which are revealed by a comparison in the following subsections.

5.2.1 Namespace

The DNS domain namespace is composed as a hierarchical tree structure consisting of labels which correspond to individual domains. A single domain name is uniquely addressed by a its corresponding label including all of its parent labels separated by a dot ”.”.

The Handle System namespace consists of two parts: a prefix and a suffix part. Both parts are separated by the slash character ”/”. The prefix part is comparable with a domain name since it consists of hierarchical labels separated by dots. The suffix in turn, is a locally unique name assigned to an individual research dataset. In conjunction with the prefix, the locally unique suffix becomes globally unique.

5.2.2 Architecture

Each domain is under the stewardship of exactly one zone. A zone in turn, manages and controls one or more domains. Each zone consists of one or more master nameservers and zero to many slave nameservers. There are myriads of globally distributed zones, which are ordered by com-plying with a certain hierarchy. The root zone of this hierarchy is represented by the zone with the label ”.”. It holds the information about each responsibleTop Level Domain(TLD), which in turn form the second level in the hierarchy.

A resolution request is therefore delegated beginning from the root zone towards the responsible zone for an individual domain name.

The architecture of the Handle System has already been described in Section 3.3.1. Nevertheless, in the Handle System, the counterpart of the DNS root zone is theGlobal Handle Registry(GHR), where for a DNS nameserver it is an individual Handle server of aLocal Handle Service(LHS).

Likewise, a Handle-PID resolution request is delegated from the GHR towards the responsible LHS.

5.2.3 Data Model

In the DNS system, the data stored in the nameservers are denoted as Resource Records. In principle, an individual Resource Record can be reduced to a key-value pair containing a binding of a domain name and its corresponding IP-address.

Data in the Handle System is stored in Handle Records consisting of multiple Handle Values, which are comparable with Resource Records of the DNS system. Examples for a Handle Record have been already provided by Figure 5.2 and Figure 5.3. In addition to that, Figure 5.4 provides a general representation of a Handle Record and a Handle Value.

The core difference between a Handle Record and a Resource Record is that for Handle Records there is no restriction on a set of permissible data types as it is the case for Resource Records.

This is a major advantage of the Handle System, which enables it to be positioned in other areas, especially in the area of the Internet of Things(IoT).

It should be noted that in Chapter 6, which dedicatedly deals with the resolution problem, the discussion also covers the data model of both systems. However, the focus in the current chapter is on the composition of Handle Records and Handle Values, whereas in Chapter 6 it is more on the transformation of Handle Records into Resource Records.

HandleRecord HandleValue_0

type

index data ttl timestamp permission references

HandleValue_N type

index data ttl timestamp permission references

Figure 5.4: Generic Handle Record consisting of multiple Handle Values.

5.2.4 Protocol

Initially, the DNS protocol was designed for queries only. This is the reason for the Resource Records to be stored in a zone file, instead of a real database. However, with the increase of updating requests, the protocol has been extended by an update operation [107].

In the reference implementation, an update is transactionally appended to a log file, which is then used for propagating the update to the slave nameservers. Upon the restart of the master nameserver, the new zone file is constructed with the updated information stored in the log file.

An update operation is always processed by the primary master nameserver.

The Handle protocol in contrast, offers a much richer set of operations with management opera-tions forming an essential part of the protocol from the beginning on. Also for the Handle System, a management operation is always accomplished in a transactional procedure.

In the reference implementation of the Handle protocol, a single management operation involves two transactions: First, the update is transactionally appended to an internal database, which is used to provision the replication system. This is followed by an access into the attached main database, which ensures the updates to be persistently written to disk.

The Handle protocol also defines its own authentication and authorization mechanism. By using the Handle protocol an authenticated user can manage any Handle Record stored at any LHS in the world, provided that the user is also authorized. In contrast to that, in the DNS system there is no standardized common mechanism for authentication and authorization, which can considered as an indication that the DNS system has not been conceived for data management. This also under-lines that the Handle System, although it appears to be a DNS imitation, is a globally distributed data management system rather than a data lookup registry such as the DNS system.

5.2.5 Workload

The workload of the DNS system is mainly characterized by resolution operations. On the other hand, the registration procedure is often still accomplished manually and hence, it usually involves a relative long delay until a domain name is updated and globally propagated. Moreover, as al-ready mentioned, these large amounts of registration requests are often still issued by individual human users. Note that also in DNS there are machine agents issuing requests against DNS reg-istration agencies. An example for that are the virtualization (Cloud) technologies, in which DNS entries are registered upon the instantiation of new virtual machines automatically. However, for an individual user, the instantiation time of a virtual machine is usually not highly critical since the actual focus is on the performance of the virtual machine itself, rather than on the management of the instance.

In the Handle System, the registration operation is increasingly issued by machine agents belong-ing to research data repositories. Usually, the requests of an individual research data repository can be grouped into batches sharing common information pattern corresponding to individual research datasets. However, each of these requests is usually processed in a dedicated, isolated transactional administration procedure. This often causes a serious performance degradation at the respective research data repository.

In addition, the communication pattern between an individual research data repository and a PID system, is usually characterized by a high frequency. This stems from the fact that PID record administration has become an integral part of the research data management workflow.

After contrasting the Handle System against the DNS system, in the following section, we provide an abstract description of our core idea to extend the Handle protocol with a bulk operation. The actual technical implementation of this idea then addressed in the subsequent section.

Im Dokument High-Performance Persistent Identification for Research Data Management (Seite 115-118)