• Keine Ergebnisse gefunden

3.2 Performance in Persistent Identification

3.2.5 Data Access Interface Registration

The basic function of a regular naming authority is twofold: on the one hand to enable the ad-ministration of PID records and on the other hand, to respond to resolution requests revealing the locators. The resolution operation basically results in a lookup into its database for entries asso-ciated with the incoming PID string. For group (A) research data repositories, the most important entry within an individual PID record is the PID-to-locator binding. This entry in turn usually contains the following three information entities:

(a) The current Internet address of the research data repository.

(b) The current access request syntax used for retrieving datasets from that research data repos-itory.

(c) The internal identifier of the research dataset stored in the research data repository.

The entities (a) and (b) compose the base part of the data access interface of a research data repository. This base part is usually included in all PID records associated with an individual research repository, which means that the information entities (a) and (b) are usually redundantly available in a database of a naming authority.

Therefore, for a group (A) research data repository assigned with a dedicated prefix, the DSID approach can be technically realized by associating the prefix with a composition rule based on the information entities (a) and (b). This requires only a single registration/update of the base part of the data access interface of the repository at a naming authority, which is then stored under the record of the DSID. The resolution request of identifiers prefixed by a DSID is then accomplished by dynamically extending the base part with the incoming information entity (c).

An example for a group (A) repository is the ARCHE repository6. Its datasets are registered at a naming authority, which is based on the ePIC PID system and therefore also on the Handle System. This naming authority is provided by the GWDG. Figure 3.6 illustrates an excerpt of the important entries (PID-to-locator bindings) from the database of this naming authority. In addition to that, Figure 3.7 depicts the complete entries of an individual PID. These entries form a Handle Record for one of these Handle-PIDs listed in Figure 3.6. The Handle Record is composed of three entities, where the first entry represent the most important one, since it contains the URL (locator).

The second and third entries contain only internal information used by the ePIC PID system and the underlying Handle server for administrative purposes, which means that they do not hold any additional information specific to the identified dataset.

As can be see from Figure 3.6 and also from the first entry in Figure 3.7, the locators are composed of exactly the three entities: (a)id.acdh.oeaw.ac.at, which is the address of the repository, (b)https://{· · · }/uuid/, which specifies the current access request syntax and (c) an internal identifier, which is opaque and hence, appropriate to be used as a PID.

Instead of registering each individual research dataset, with the DSID concept, there is only an

6https://arche.acdh.oeaw.ac.at

21.11115/0000-000B-C8D5-3 https://id.acdh.oeaw.ac.at/uuid/1f0bbeff-343d-5081-a887-90b632b2a05f 21.11115/0000-000B-C8D6-2 https://id.acdh.oeaw.ac.at/uuid/a80553c0-6be1-6a7f-a63b-16993060302d 21.11115/0000-000B-C8D7-1 https://id.acdh.oeaw.ac.at/uuid/84b8b78a-321a-4239-e091-3ee565a9737f 21.11115/0000-000B-C8D8-0 https://id.acdh.oeaw.ac.at/uuid/e1fc9dfe-b8c0-e9f3-9115-ae48421fb2ff 21.11115/0000-000B-C8D9-F https://id.acdh.oeaw.ac.at/uuid/d62ca450-5693-6c45-c5b0-fe6890e366fc 21.11115/0000-000B-C8DA-E https://id.acdh.oeaw.ac.at/uuid/cca1beff-5c9e-4223-970e-7faf7b5580a3 21.11115/0000-000B-C8DB-D https://id.acdh.oeaw.ac.at/uuid/69bf3f5f-46d2-6296-dc98-a24d299ce9dc 21.11115/0000-000B-C8DC-C https://id.acdh.oeaw.ac.at/uuid/c47f12a0-e05f-d2ea-a7f9-e4210e3f8df0 21.11115/0000-000B-C8DD-B https://id.acdh.oeaw.ac.at/uuid/0e33f785-5ff5-55f2-f6f3-efea0453833d 21.11115/0000-000B-C8DE-A https://id.acdh.oeaw.ac.at/uuid/72045ebc-b726-f10d-c3a4-4816abaa9d46 21.11115/0000-000B-C8DF-9 https://id.acdh.oeaw.ac.at/uuid/47841bc7-086d-ec40-978d-bb992e72a03b 21.11115/0000-000B-C8E0-6 https://id.acdh.oeaw.ac.at/uuid/37ff0e18-3197-7bcb-9d59-11aa6bd8a78c 21.11115/0000-000B-C8E1-5 https://id.acdh.oeaw.ac.at/uuid/830106dd-6494-7a4b-d455-432b6d05660e 21.11115/0000-000B-C8E2-4 https://id.acdh.oeaw.ac.at/uuid/53274ca7-51b8-cfc0-dd59-9f284254c5c7 21.11115/0000-000B-C8E3-3 https://id.acdh.oeaw.ac.at/uuid/b929ed98-a354-d82b-e733-d01ba5ae3b29 21.11115/0000-000B-C8E4-2 https://id.acdh.oeaw.ac.at/uuid/60e5d6aa-e48a-e298-c7eb-0f22e2a24745 21.11115/0000-000B-C8E5-1 https://id.acdh.oeaw.ac.at/uuid/f440a719-17c0-ed14-f091-d8c4331c09d7 21.11115/0000-000B-C8E6-0 https://id.acdh.oeaw.ac.at/uuid/07188f15-2948-73c0-8bc4-e3eadb91a072 21.11115/0000-000B-C8E7-F https://id.acdh.oeaw.ac.at/uuid/bbe76c68-48ff-fe96-e638-ae2001e2de65 21.11115/0000-000B-C8E8-E https://id.acdh.oeaw.ac.at/uuid/fd428ae8-0ed1-c285-a0bc-96d0bb759874 21.11115/0000-000B-C8E9-D https://id.acdh.oeaw.ac.at/uuid/b7e43e5f-42a2-619d-d3de-5f04edd88900

URL PID

Figure 3.6: Excerpt of database entries of a naming authority hosted by GWDG.

(a) address of repository, (b) access request syntax, (c) internal identifier

21.11115/0000-000B-C8D7-1

Figure 3.7: Handle Record of a dataset stored in the ARCHE repository.

administration of the information entities (a) and (b) within the record of the prefix 21.11115 required.

Often, there are research data repositories, which already contain huge amounts of research datasets without being previously registered at a PID system. For these research data reposito-ries, which also belong to group (A), with this concept, all the contained research datasets get assigned with a PID by a single registration. Therefore, the DSID approach can lead to an enor-mous reduction of the maintenance overhead for PID administration.

Note that a concrete technical implementation of the DSID approach is only provided in the im-plementation section, Section 3.3. In contrast that, the following subsections discuss limitations and side effects of this approach.

3.2.5.1 Tracking of Individual Research Datasets

However, the movement of individual research datasets can not be tracked with this approach.

Although the DSID approach is specifically addressed for group (A) research data repositories, it can be possible that specific datasets move to another research data repository. In order to additionally sustain the ability of tracking the movement of individual research datasets, for a naming authority, it is necessary to employ two different resolution functions: The first function is an ordinary lookup for an entry match for the incoming PID in its database. The second function is to apply the special locator composition function, which is specified in the DSID record. In order to respond a resolution request, the naming authority always has to start with the ordinary database lookup first. If there is no matching entry found, the special composition function is applied, provided that there is a locator composition rule in the corresponding prefix record. Otherwise, the resolution request is responded by an error message indicating that a matching entry could not be found in the database.

The overall resolution algorithm of a naming authority is depicted by Figure 3.8. The greyed area represents the special function block for resolving DSID-prefixed PIDs. This special function block is incorporated into the regular resolution algorithm path between the red and blue dots. The black dashed line, denotes the processing path of the original resolution path.

Another issue is the verification of the existence of an individual PID record. Provided that the DSID record of an individual prefix is associated with a locator composition function Handle Value, if the extended resolution path would directly lead from the red dot to the yellow dot (grey dashed line), without involving the green box, the resulting resolution algorithm would always return a positive answer. Even when there is no individual PID record stored in the database asso-ciated with an individual PID. Therefore, the green box represents a mechanism, which enables to distinguish between an individually registered locator and a dynamically generated locator.

For a dynamically generated locator, the existence of the corresponding dataset can be ultimately verified by the respective research data repository:

The naming component (cf. Figure 3.4) of the research data repository, which generates the inter-nal identifiers, usually also records additiointer-nal information about the assigned identifiers. Hence, by means of this additional information, it is possible to confirm the validity of the suffixes ap-pended to the corresponding DSID.

3.2.5.2 Metadata Associated with PIDs

Another shortcoming of the DSID approach is related with metadata. As already indicated in Section 3.2.3, PID systems usually also provide the possibility to retrieve individual metadata associated with the identified research dataset. The metadata is often included in the respective

Arrived Resolution Request

getEntries(thePID)

entryRecord is NULL

getEntries(thePrefix)

entryRecord is NULL get DSID Record get PID Record

[Yes]

response(locator)

send response

[No]

response(notFound)

[Yes]

EXTRACT locator

COMPOSE locator

response(locator)

[No]

suppress composition flag is

SET

[Yes] [No]

Figure 3.8: Naming authority resolution algorithm: The greyed area represents our extension to realize the DSID resolution approach. In this area, the yellow box represents the core function of our extension. The function represented by the green box is used to sup-press the execution of the locator composition path.

PID records in addition to the sole locators. The DSID approach does not comprise individual metadata. Therefore, this approach is not appropriate for group (B) research data repositories.

The following subsection discusses a combined usage of DSIDs with regular PIDs for group (C) research data repositories.