Data Publishing Approach Classification - Strategies and Approaches for Exploiting the Value of

3.5 Challenges

4.1.1 Data Publishing Approach Classification

C H A P T E R 4

Publishing and Consuming Open Government Data

As essential parts of the data life cycle, publishing and consuming are vital for the existence of an open government data initiative. Without the existence of data and its re-use, an open data initiative is deemed to fail. In this chapter we therefore focus on these two processes with the aim of identifying their specific characteristics within open government data initiatives.

The act of publishing data is the very basis of open government data initiatives. Government and public entities are sharing data on the Internet at an astonishing pace. Yet, there is a lack of agreed-upon standards for data publishing [38], and as discussed in detail in Section3.5, there are many challenges to be overcome in order for the published data to be exploited to its full potential. While not all challenges are directly related to publishing issues, tackling these issues at the root could prevent subsequent issues related to data consumption. For example, if data is published in a machine-readable format with good metadata descriptions, then usability issues will most probably be avoided when it is consumed.

The publishing of data enables it to be available for use by the public, in an attempt to achieve the main aim of open government data initiatives; namely to use, re-use, and distribute the published data. This is only achievable through the consumption of the data by stakeholders. Data consumption is possible through a number of means. The most direct example is to obtain a copy of the actual published data, generally with the aim of using it for a specific use-case. Certain portals might also provide exploration tools, where a data consumer can simply look through the published data. Other tools, such as analysis tools, enable a consumer to actually identify potential patterns in the published data. Usually analysis tools also provide for visualisations, which aid data consumers to view the data in a pictorial manner. An even more hands-on way of consuming the data is to create mashups, where different datasets are merged in order to create new knowledge using existing data.

4.1 Publishing Data

In this section we provide a classification of different data publishing approaches, and proceed to discuss guidelines and best practices for publishing data in any data publishing effort.

1. Thetechnological approach- followed by the data publisher in the actual act of publishing data, i.e. making the data available on the Web. Publishing initiatives are classified within this approach depending on the variation of technologies implemented for publishing the data. These include:

a) The format of the published data (proprietary, machine readable, descriptive);

b) The access method (RESTful APIs, custom APIs, search interfaces);

c) The use of Linked Open Data principles (HTTP, URIs, RDF); and d) The level of linkage to different datasets (Linked Open Data Cloud).

As is evident, the above reflect most of the existing guidelines for publishing data, particularly the Five Star Scheme for Linked Open Data.

2. Theorganisational approach- followed by the data provider, i.e. the manner in which the data is provided to the data consumers. This second dimension for open government data publishing initiatives focuses on theprovisionof data, rather than the actual act of publishing. The authors of [67] identify two different methods of providing Linked Open Data, the epitome of an open government initiative, each with their own advantages and disadvantages.

a) Direct Data Provision- Direct data provision involves a one-stop portal aggregating all processed and value-added data provided by a public entity. In this case, the data publisher is not necessarily the same as the data provider. In the case that the latter are two different entities, the maintainability is limited unless an effective data synchronisation process is in place. For example, if the original data from the public entity changes over time, this change must be reflected in the data provided on the data portal, otherwise the data provided here will be obsolete [38]. An advantage of having direct data provision, however, is the consumers’

direct access to data through a single entry point.

b) Indirect Data Provision- Data Catalogues are a good example of indirect data provision, where the data cannot be directly accessed through the catalogue. Catalogues contain links (metadata) to the actual data provided by the public entity. To access data, a consumer has to search for the relevant data through the catalogue, then follow the provided links to the public entity that provides the actual data. In contrast to direct data provision, indirect data provision has the advantage of being up to date and unique, since the actual data is provided by the data producers, and the catalogue simply provides links to it. On the other hand, processed and value-added data has to be performed by the data consumer, as it cannot be provided by the data catalogue.

4.1.2 Publishing Guidelines

In order to tackle the previously-mentioned issues in Section3.5, and other publishing-related problems, a number of publications in literature, such as [54,77,136], propose guidelines for publishing data on the Web. The basis of most of these guidelines are the Eight Open Government Data Principles:

1. Complete- All available public data that is not subject to privacy, security or privilege limitations is made available.

2. Primary- Data is made available as it is available at the source, and not aggregated or modified.

3. Timely- Data is made available to the public as soon as possible after the actual data is created, in order to preserve the value of the data.

4.1 Publishing Data

4. Accessible- Data is made available to all consumers possible, and with no limitations on its use.

5. Machine Processable- Data is published in a structured manner, to allow automated processing.

6. Non-Discriminatory- Data is available for all to use, without requiring any registration.

7. Non-Proprietary- Data is published in a format which is not controlled exclusively by a single entity.

8. Licence-Free- Other than allowing for reasonable privacy, security and privilege restrictions, data is not subject to any limitations on its use due to copyright, patent, trademark or trade secret regulations.

The above principles provide a roadmap for the data publisher and help result in good open government data with the best potential for being consumed by the stakeholders. Further to these principles, the Five Star Scheme for Linked Open Data, listed below, provides a more technical guide towards publishing Linked Open Data:

1. Available on the Web in any format but with an open licence (Open Data);

2. Available as machine-readable structured data (e.g. Microsoft Excel table instead of image scan of a table);

3. Available as machine-readable structured data in a non-proprietary format (e.g. CSV instead of Microsoft Excel);

4. All of the above as well as using open standards from W3C (RDF and SPARQL) to identify things;

5. All of the above as well as linking the published data to other existing data to provide context.

In order to provide official guidelines, the W3C eGov Interest Group has also developed the following set of steps for publishing open government data¹, which emphasise standards and methodologies to encourage the publishing of government data, with the aim of enabling easier use by the public:

1. Identify- The use of permanent, patterned and/or discoverable URI/URLs enables processes and people to find and consume the data more easily.

2. Document- Documentation helps the data to be more understandable and less ambiguous, as well as enabling easier data discovery. The use of formats such as XML/RDF would be self-documenting.

3. Link- Linked data contains links to other data and documentation, providing context.

4. Preserve- The use of versioning of datasets enables data consumers to cite and link to present and past versions, where new and upgraded datasets can refer back to original datasets. Versioning also allows the documentation of changes between versions.

5. Expose interfaces- To make it easier for published data to be discovered and explored, published data should be both human-readable and machine-readable. Preferably, data should be published separate from the interface, and external parties should have direct access to raw data. This enables them to build their own interfaces if needed.

1http://www.w3.org/TR/gov-data/(Date accessed: 2 August 2016)

6. Create standard names/URIs for all government objects- The use of a unique identifier for each object is as important as having information about the object itself. This aids in discoverability, improves metadata, and ensures authenticity.

Along with the above, the W3C eGov Interest Group also discusses the importance ofchoosing what data to publish, theright formatto publish it in, and therestrictions on its use. Data which is to be shared with the public should be published in compliance with applicable laws and regulations, and only after addressing issues of security and privacy. Such data is usually already available in other formats, and may already have been shared with the public in other ways. The best format to publish this data is in its raw form serialised as XML and RDF, to allow for easy manipulation. The use of established open standards is also recommended. Finally, the published data should have clear documentation on any legal or regulatory restrictions on the use of that data.

Liu et al. [77] present some recommendations for data publishing and analysis based on a survey on the sustainability related datasets published by the Australian government, with the aim of identifying underlying opportunities and issues. While not entirely reflecting the above-mentioned guidelines, the proposed recommendations complement the essential aspects. The authors tackle commonalities amongst data published by different public entities, the ideal formats for publishing data as Linked Data, its discoverability, and its re-usability.

Similarly, the authors of [119] identify common issues and challenges to the accessibility and re-usability aspects of public sector information. De Rosnay and Janssen point out that such obstacles can be of legal, institutional, technical or cognitive nature. They proceed by providing common solutions that can be implemented to overcome these issues.

In [136], Solar et al. propose a maturity model for open data, with the aim of assessing the commitment and capabilities of public agencies in pursuing the principles and practices of open data. The authors extend the discussed guidelines and principles by considering other aspects towards publishing data, including an Establishment and Legal Perspective, a Technological Perspective, and finally a Citizen and Entrepreneurial Perspective.

Another maturity model was defined in [80]. Here Lourenço and Serra aim towards identifying essential contextual aspects which affect the way data is published by public entities on their portals. The latter aspects are then organised into an online transparency for an accountability maturity model, which has the purpose of assessing the level of advancement of a governing region. In other words, researchers requiring to assess an entity should start by analysing the context using the proposed maturity model, and then proceed to define the assessment model depending on the identified maturity level.

4.1.3 Publishing Tools and Standards

While there exist a huge number of government data portals that enable data producers to publish their data, there are not many tools aiding data publishers in this task. Yet, efforts are currently being focused on providing portals and other open government data initiatives which allow stakeholders to publish (and consume) datasets without requiring background knowledge on the open data life cycle. An example of such efforts is the LinDA project². A contribution within this project enables a stakeholder to publish data in any format, which is then converted to RDF to enable easy linking with other open datasets.

In [55], Hofman and Rajagopal propose a technical framework for data sharing between data providers and consumers, based on an analysis of a number of data platforms. They aim to identify, from the relevant literature, the required functionality for data sharing, considering challenges such as different published formats, data ambiguity, and privacy issues.

2http://linda-project.eu/(Date accessed: 2 August 2016)

Im Dokument Strategies and Approaches for Exploiting the Value of Open Data (Seite 55-59)