• Keine Ergebnisse gefunden

The (environmental research) data lifecycle

N/A
N/A
Protected

Academic year: 2022

Aktie "The (environmental research) data lifecycle"

Copied!
54
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

The (environmental research) data lifecycle

Martin Schultz, JSC

Image credit: FZJ/Panknin: „7 Tage, 7 Bilder - keine Menschen, keine Worte - Tag 4/7“

#blackandwhitechallenge #sevendayssevenpictures

(2)

08 Nov 2018:

Copernicus journals sign up to

http://www.copdess.org

Hot off the press

Motivation Data lifecycle Metadata Publish Data Summary

(3)

Science is changing!

Paul Magill at first National Air Pollution

Symposium in 1949 (Wikipedia) SPARC General Assembly, Kyoto 2018

JUWELS: Germany‘s fastest supercomputer in September 2018 (12 PFlops)

CDC6600 in 1964 (3 Mflops)

Motivation Data lifecycle Metadata Publish Data Summary

(4)

Data is changing!

Data available to Lewis Fry Richardson for first numerical weather prediction in 1921:

Upper air observations from sondes.

Squares marked 'P' provided atmospheric pressure values; those marked 'M' gave atmospheric momentum.

(source: http://www.abc.net.au/science/slab/forecast/story.htm)

Corrected reflectance from NASA Terra&Aqua on 21 Oct 2018: only one of ~90 satellite data products routinely assimilated into ECMWF‘s weather forecast system

Motivation Data lifecycle Metadata Publish Data Summary

(5)

Data volumes are growing exponentially!

ECMWF size of MARS archive

Genomic sequencing data

100 Pbytes in 2014

http://www.ifp.uni-stuttgart.de/publications/phowo15/110Wagner.pdf STSci Astronomic data

https://archive.stsci.edu/reports/BigDataSDTReport_Final.pdf

Motivation Data lifecycle Metadata Publish Data Summary

(6)

Data throughput keeps growing too!

CERN data output through ATLAS (1 CD ROM per second)

Motivation Data lifecycle Metadata Publish Data Summary

(7)

Paradigm change:

Those who understand data will excel in science

New job descriptions:

Data manager

Data scientist

Motivation Data lifecycle Metadata Publish Data Summary

(8)

Why should you care?

• You must publish your data when you publish scientific results

• Sharing data increases collaborations and advances knowledge

• Ever wanted to get access to other peoples‘ data?

• Try to re-analyse and understand old measurements!

DOAS measurements in SAPHIR (IEK-8) John J. Kelley, USGS report, 1971

Motivation Data lifecycle Metadata Publish Data Summary

(9)

Motivation

Can anyone reproduce this plot in 30 years from now?

Motivation Data lifecycle Metadata Publish Data Summary

(10)

Motivation Data lifecycle Metadata Publish Data Summary

Motivation

Can others put your data in context with other Earth system issues?

Mind map „Global Warming“

(11)

https://www.dataone.org/best-practices /

The Data lifecycle

(12)

Plan data management early in your project

https://www.dataone.org/best-practices

Data lifecycle: Plan

• What kind of data will your instrument/model generate?

• How much data will your instrument/model generate?

• What kind of secondary data products will you generate?

• What kind of data will you want to/need to publish?

• Where will you store your data (and backups)?

• How will you process your data and where?

• How can you structure your data?

• What data do you need from others and how do you manage this?

Write this up so that others can understand it

 You have a data mangement plan!

Motivation Data lifecycle Metadata Publish Data Summary

(13)

Software tools may help creating a data management plan

https://dmptool.org/

Data lifecycle: Plan

Motivation Data lifecycle Metadata Publish Data Summary

(14)

Software tools may help creating a data management plan Trendy: Make DMP „machine actionable“

https://blog.dmptool.org/

Data lifecycle: Plan

Motivation Data lifecycle Metadata Publish Data Summary

(15)

This looks awful and cryptic?

Yes.

Machine actionable data management

Motivation Data lifecycle Metadata Publish Data Summary

(16)

But it provides structure and context

 Software can be written to digest this information

 Information can be shared, re-used, and discovered Machine actionable data management

namespace structure unique id resolvable endpoint

controlled vocabulary Elements of machine actionable descriptors

Motivation Data lifecycle Metadata Publish Data Summary

(17)

Data lifecycle: Collect

• Define data format(s)

− „ASCII“ is not a data format!

− Don‘t forget to think about date and time

• Define data quality flags and special values

− Missing value = NaN or -9999 or 9999?

• Make use of international standards

− e.g. use CF convention when producing netCDF files

• Make sure to have enough metadata in your data files

− Not everything must be stored in the data file (see UUIDs)

− Store enough information so that data are unambiguous

− Always think of a version number!

• Never modify or delete any rawdata!

Establish a re-usable workflow

See also https://www.dataone.org/webinars/how-not-collect-data-organizing-data-long-term-use-and-re-use

Motivation Data lifecycle Metadata Publish Data Summary

(18)

• Develop a quality assurance and quality control plan

• Ensure dataset consistency

− Version control for data format, codes, etc.

− Double-check data

− Confirm match between data and their description in metadata

− Ensure datasets used are reproducible

• Analyse and document data quality

− Identify missing values and define missing value codes

− Identify outliers

− Identify values that are estimated

− Mark data with quality control flags

− Provide version information for use and discovery

− Communicate data quality

Modified from https://www.dataone.org/best-practices

Data lifecycle: Assure

Motivation Data lifecycle Metadata Publish Data Summary

(19)

Data lifecycle: Describe

Metadata

Motivation Data lifecycle Metadata Publish Data Summary

(20)

• Make sure your analyses are reproducible!

− Make backups of your data and scripts (use git for scripts)

− Ensure that you can always go back to raw data and reprocess

− Fully code your workflows

− Fully document your data (beyond hand-written lab books)

− Publish your data for others and future generations

− Use universal (not proprietary) formats Data lifecycle: Preserve

Motivation Data lifecycle Metadata Publish Data Summary

(21)

data = pd.read_csv(“old_data/sunshine_exp.csv”) mass = (data[“kg”]-0.176) * 46.006 / 28.967 print(mass.iloc[12:29])

Workflow preservation = coding practices Bad (why?)

Motivation Data lifecycle Metadata Publish Data Summary

(22)

data = pd.read_csv(“old_data/sunshine_exp.csv”) mass = (data[“kg”]-0.176) * 46.006 / 28.967 print(mass.iloc[12:29])

Workflow preservation = coding practices Bad (why?)

filename = “old_data/sunshine_exp.csv”

mass_offset = 0.176 # net weight of scale mw_no2 = 46.006 # molecular mass NO2

mw_air = 28.967 # molecular mass of dry air irow0, irow1 = (12, 29)

data = pd.read_csv(filename)

mass = (data[“kg”]-mass_offset) * mw_no2 / mw_air print(mass.iloc[irow0:irow1])

Better

Motivation Data lifecycle Metadata Publish Data Summary

(23)

Motivation Data lifecycle Metadata Publish Data Summary

Data lifecycle: Discover, Integrate, Analyze

That‘s the fun part

= not in this boring lecture

(24)

Metadata

(25)

What is metadata?

A set of data that describes and gives information about other data

Oxford English Dictionary

• Needed to discover data (discovery metadata)

• Needed to understand data (descriptive metadata)

Motivation Data lifecycle Metadata Publish Data Summary

(26)

http://www.geoportal.org/

Mauna Loa

Motivation Data lifecycle Metadata Publish Data Summary

(27)

http://www.geoportal.org/

Motivation Data lifecycle Metadata Publish Data Summary

(28)

Discovery metadata

What is needed to find your data?

Mind map „Global Warming“

Motivation Data lifecycle Metadata Publish Data Summary

(29)

• Dataset topic

• Dataset name

• Whodunnit?

• Type of dataset

• Geographic location or region

• Temporal extent

• (spatial and temporal) resolution

• How to access the data?

• License

• Creation/modification dates

Use standard terminology whenever possible!

Discovery metadata

Motivation Data lifecycle Metadata Publish Data Summary

(30)

// global attributes:

:name = "TOAR gridded monthly statistics of daytime_avg";

:Conventions = "CF-1.6";

:author = "Martin G. Schultz (m.schultz@fz-juelich.de)";

:author_affiliation = "Forschungszentrum Juelich GmbH, Wilhelm-Johnen- Str., Juelich, Germany";

:description = "The data included in this gridded data product originate from the surface ozone database of the Tropospheric Ozone Assessment Report […]

:citation = "Schultz MG et al. (2017). Tropospheric Ozone Assessment Report: Database and Metrics Data of Global Surface Ozone Observations, Elementa - Science of the Anthropocene, http://doi.org/10.1525/elementa.244.";

:dataset_doi = "https://doi.org/10.1594/PANGAEA.876108";

:creation_date = "2017-02-10 09:16";

Example 1: TOAR netCDF dataset – global attributes

Motivation Data lifecycle Metadata Publish Data Summary

(31)

Example 2: ECHAM HAMMOZ model results in B2SHARE

title authors publication date description

keywords/tags license

contact

Motivation Data lifecycle Metadata Publish Data Summary

(32)

Example 2 as machine-to-machine record

https://b2share.fz-juelich.de/api/records/?q=HAMMOZ

Motivation Data lifecycle Metadata Publish Data Summary

(33)

Example 3: re3data

https://www.re3data.org/

Motivation Data lifecycle Metadata Publish Data Summary

(34)

Motivation Data lifecycle Metadata Publish Data Summary

(35)

Motivation Data lifecycle Metadata Publish Data Summary

(36)

Well, you saw some of this already Boundaries are blurred

Feel free to describe whatever may be relevant!

Use standard terminology whenever possible!

Descriptive metadata

Motivation Data lifecycle Metadata Publish Data Summary

(37)

WIGOS Metadata for surface observations

Very specific rich specification of theme-specific metadata Motivation

Data lifecycle Metadata Publish Data Summary

(38)

Motivation Data lifecycle Metadata Publish Data Summary

EBAS metadata templates (e.g. ACTRIS)

Converging towards WIGOS metadata standard

(39)

Publish data

https://thepublicationplan.com/2016/05/13/why-do-researchers-refuse-to-share-their-data/

(40)

Examples of modern data portals

Motivation Data lifecycle Metadata Publish Data Summary

(41)

Data publication

Motivation Data lifecycle Metadata Publish Data Summary

(42)

Data publication

    

Easier said than done

Motivation Data lifecycle Metadata Publish Data Summary

(43)

F IND

A CCESS

I NTEROPERABLE

R EUSE

Andreas Petzold‘s lecture

Motivation Data lifecycle Metadata Publish Data Summary

(44)

• Requires registration/access via institution

• Choose “EUDAT” as community (hope this will change at some point)

• Max. upload 10 Gbyte per collection and 2 Gbyte per file Data publication with B2SHARE

Fearless testing withhttps://b2share-testing.fz-juelich.de

B2SHARE creates a PID which canbe resolved by https://dx.doi.org/

Motivation Data lifecycle Metadata Publish Data Summary

(45)

Data publication with B2SHARE

Motivation Data lifecycle Metadata Publish Data Summary

(46)

Data publication with PANGAEA

https://www.pangaea.de

Motivation Data lifecycle Metadata Publish Data Summary

(47)

Data publication with PANGAEA

https://www.pangaea.de

Motivation Data lifecycle Metadata Publish Data Summary

(48)

Data publication with PANGAEA

Motivation Data lifecycle Metadata Publish Data Summary

(49)

Data publishing: Open Data Certificate

https://certificates.theodi.org/en

Motivation Data lifecycle Metadata Publish Data Summary

(50)

https://certificates.theodi.org/en

Data publishing: Open Data Certificate

Motivation Data lifecycle Metadata Publish Data Summary

(51)

https://certificates.theodi.org/en

Data publishing: Open Data Certificate

Motivation Data lifecycle Metadata Publish Data Summary

(52)

Motivation Data lifecycle Metadata Publish Data Summary

Professional data publishing = web services

(53)

Summary

(54)

Summary

• Data management should matter to you

• Data is the fuel for scientific discovery

• Use open standards

• Apply common sense (and standards) to metadata

• Publish your data

• Be FAIR Summary

Referenzen

ÄHNLICHE DOKUMENTE

PANGAEA - Data Publisher for Earth & Environmental Science (www.pangaea.de) is an Open Access data-library aimed at archiving, publishing and distributing georeferenced data

PANGAEA actually provides more than 375 000 data set, consisting of >13 billion data points,. including collections from national and international

• Data are stored georeferenced in space and time in a relational database or a tape archive (large files)?. • Data sets receive a citable and

Stefanie Schumacher, Amelie Driemel, Hannes Grobe, Rainer Sieger Alfred Wegener Institute, Bremerhaven... www.pangaea.de What is

DOI (Digital Object Identifier) > persistent link Web service > distribution in the Internet. Data Warehouse > retrieval

PANGAEA - Data Publisher for Earth & Environmental Science (www.pangaea.de) is an Open Access data-library aimed at archiving, publishing and distributing georeferenced data

(in press) Diel changes in the vertical distribution of suspended particulate matter in the NW Mediterranean Sea investigated with the Underwater Video Profiler. INSTRUMENT/METHOD

(in press) Diel changes in the vertical distribution of suspended particulate matter in the NW Mediterranean Sea investigated with the Underwater Video Profiler..