The (environmental research) data lifecycle
Martin Schultz, JSC
Image credit: FZJ/Panknin: „7 Tage, 7 Bilder - keine Menschen, keine Worte - Tag 4/7“
#blackandwhitechallenge #sevendayssevenpictures
08 Nov 2018:
Copernicus journals sign up to
http://www.copdess.org
Hot off the press
Motivation Data lifecycle Metadata Publish Data Summary
Science is changing!
Paul Magill at first National Air Pollution
Symposium in 1949 (Wikipedia) SPARC General Assembly, Kyoto 2018
JUWELS: Germany‘s fastest supercomputer in September 2018 (12 PFlops)
CDC6600 in 1964 (3 Mflops)
Motivation Data lifecycle Metadata Publish Data Summary
Data is changing!
Data available to Lewis Fry Richardson for first numerical weather prediction in 1921:
Upper air observations from sondes.
Squares marked 'P' provided atmospheric pressure values; those marked 'M' gave atmospheric momentum.
(source: http://www.abc.net.au/science/slab/forecast/story.htm)
Corrected reflectance from NASA Terra&Aqua on 21 Oct 2018: only one of ~90 satellite data products routinely assimilated into ECMWF‘s weather forecast system
Motivation Data lifecycle Metadata Publish Data Summary
Data volumes are growing exponentially!
ECMWF size of MARS archive
Genomic sequencing data
100 Pbytes in 2014
http://www.ifp.uni-stuttgart.de/publications/phowo15/110Wagner.pdf STSci Astronomic data
https://archive.stsci.edu/reports/BigDataSDTReport_Final.pdf
Motivation Data lifecycle Metadata Publish Data Summary
Data throughput keeps growing too!
CERN data output through ATLAS (1 CD ROM per second)
Motivation Data lifecycle Metadata Publish Data Summary
Paradigm change:
Those who understand data will excel in science
New job descriptions:
• Data manager
• Data scientist
Motivation Data lifecycle Metadata Publish Data Summary
Why should you care?
• You must publish your data when you publish scientific results
• Sharing data increases collaborations and advances knowledge
• Ever wanted to get access to other peoples‘ data?
• Try to re-analyse and understand old measurements!
DOAS measurements in SAPHIR (IEK-8) John J. Kelley, USGS report, 1971
Motivation Data lifecycle Metadata Publish Data Summary
Motivation
Can anyone reproduce this plot in 30 years from now?
Motivation Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
Motivation
Can others put your data in context with other Earth system issues?
Mind map „Global Warming“
https://www.dataone.org/best-practices /
The Data lifecycle
Plan data management early in your project
https://www.dataone.org/best-practices
Data lifecycle: Plan
• What kind of data will your instrument/model generate?
• How much data will your instrument/model generate?
• What kind of secondary data products will you generate?
• What kind of data will you want to/need to publish?
• Where will you store your data (and backups)?
• How will you process your data and where?
• How can you structure your data?
• What data do you need from others and how do you manage this?
Write this up so that others can understand it
You have a data mangement plan!
Motivation Data lifecycle Metadata Publish Data Summary
Software tools may help creating a data management plan
https://dmptool.org/
Data lifecycle: Plan
Motivation Data lifecycle Metadata Publish Data Summary
Software tools may help creating a data management plan Trendy: Make DMP „machine actionable“
https://blog.dmptool.org/
Data lifecycle: Plan
Motivation Data lifecycle Metadata Publish Data Summary
This looks awful and cryptic?
Yes.
Machine actionable data management
Motivation Data lifecycle Metadata Publish Data Summary
But it provides structure and context
Software can be written to digest this information
Information can be shared, re-used, and discovered Machine actionable data management
namespace structure unique id resolvable endpoint
controlled vocabulary Elements of machine actionable descriptors
Motivation Data lifecycle Metadata Publish Data Summary
Data lifecycle: Collect
• Define data format(s)
− „ASCII“ is not a data format!
− Don‘t forget to think about date and time
• Define data quality flags and special values
− Missing value = NaN or -9999 or 9999?
• Make use of international standards
− e.g. use CF convention when producing netCDF files
• Make sure to have enough metadata in your data files
− Not everything must be stored in the data file (see UUIDs)
− Store enough information so that data are unambiguous
− Always think of a version number!
• Never modify or delete any rawdata!
Establish a re-usable workflow
See also https://www.dataone.org/webinars/how-not-collect-data-organizing-data-long-term-use-and-re-use
Motivation Data lifecycle Metadata Publish Data Summary
• Develop a quality assurance and quality control plan
• Ensure dataset consistency
− Version control for data format, codes, etc.
− Double-check data
− Confirm match between data and their description in metadata
− Ensure datasets used are reproducible
• Analyse and document data quality
− Identify missing values and define missing value codes
− Identify outliers
− Identify values that are estimated
− Mark data with quality control flags
− Provide version information for use and discovery
− Communicate data quality
Modified from https://www.dataone.org/best-practices
Data lifecycle: Assure
Motivation Data lifecycle Metadata Publish Data Summary
Data lifecycle: Describe
Metadata
Motivation Data lifecycle Metadata Publish Data Summary
• Make sure your analyses are reproducible!
− Make backups of your data and scripts (use git for scripts)
− Ensure that you can always go back to raw data and reprocess
− Fully code your workflows
− Fully document your data (beyond hand-written lab books)
− Publish your data for others and future generations
− Use universal (not proprietary) formats Data lifecycle: Preserve
Motivation Data lifecycle Metadata Publish Data Summary
data = pd.read_csv(“old_data/sunshine_exp.csv”) mass = (data[“kg”]-0.176) * 46.006 / 28.967 print(mass.iloc[12:29])
Workflow preservation = coding practices Bad (why?)
Motivation Data lifecycle Metadata Publish Data Summary
data = pd.read_csv(“old_data/sunshine_exp.csv”) mass = (data[“kg”]-0.176) * 46.006 / 28.967 print(mass.iloc[12:29])
Workflow preservation = coding practices Bad (why?)
filename = “old_data/sunshine_exp.csv”
mass_offset = 0.176 # net weight of scale mw_no2 = 46.006 # molecular mass NO2
mw_air = 28.967 # molecular mass of dry air irow0, irow1 = (12, 29)
data = pd.read_csv(filename)
mass = (data[“kg”]-mass_offset) * mw_no2 / mw_air print(mass.iloc[irow0:irow1])
Better
Motivation Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
Data lifecycle: Discover, Integrate, Analyze
That‘s the fun part
= not in this boring lecture
Metadata
What is metadata?
A set of data that describes and gives information about other data
Oxford English Dictionary
• Needed to discover data (discovery metadata)
• Needed to understand data (descriptive metadata)
Motivation Data lifecycle Metadata Publish Data Summary
http://www.geoportal.org/
Mauna Loa
Motivation Data lifecycle Metadata Publish Data Summary
http://www.geoportal.org/
Motivation Data lifecycle Metadata Publish Data Summary
Discovery metadata
What is needed to find your data?
Mind map „Global Warming“
Motivation Data lifecycle Metadata Publish Data Summary
• Dataset topic
• Dataset name
• Whodunnit?
• Type of dataset
• Geographic location or region
• Temporal extent
• (spatial and temporal) resolution
• How to access the data?
• License
• Creation/modification dates
Use standard terminology whenever possible!
Discovery metadata
Motivation Data lifecycle Metadata Publish Data Summary
// global attributes:
:name = "TOAR gridded monthly statistics of daytime_avg";
:Conventions = "CF-1.6";
:author = "Martin G. Schultz (m.schultz@fz-juelich.de)";
:author_affiliation = "Forschungszentrum Juelich GmbH, Wilhelm-Johnen- Str., Juelich, Germany";
:description = "The data included in this gridded data product originate from the surface ozone database of the Tropospheric Ozone Assessment Report […]
:citation = "Schultz MG et al. (2017). Tropospheric Ozone Assessment Report: Database and Metrics Data of Global Surface Ozone Observations, Elementa - Science of the Anthropocene, http://doi.org/10.1525/elementa.244.";
:dataset_doi = "https://doi.org/10.1594/PANGAEA.876108";
:creation_date = "2017-02-10 09:16";
Example 1: TOAR netCDF dataset – global attributes
Motivation Data lifecycle Metadata Publish Data Summary
Example 2: ECHAM HAMMOZ model results in B2SHARE
title authors publication date description
keywords/tags license
contact
Motivation Data lifecycle Metadata Publish Data Summary
Example 2 as machine-to-machine record
https://b2share.fz-juelich.de/api/records/?q=HAMMOZ
Motivation Data lifecycle Metadata Publish Data Summary
Example 3: re3data
https://www.re3data.org/
Motivation Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
Well, you saw some of this already Boundaries are blurred
Feel free to describe whatever may be relevant!
Use standard terminology whenever possible!
Descriptive metadata
Motivation Data lifecycle Metadata Publish Data Summary
WIGOS Metadata for surface observations
Very specific rich specification of theme-specific metadata Motivation
Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
EBAS metadata templates (e.g. ACTRIS)
Converging towards WIGOS metadata standard
Publish data
https://thepublicationplan.com/2016/05/13/why-do-researchers-refuse-to-share-their-data/
Examples of modern data portals
Motivation Data lifecycle Metadata Publish Data Summary
Data publication
Motivation Data lifecycle Metadata Publish Data Summary
Data publication
Easier said than done
Motivation Data lifecycle Metadata Publish Data Summary
F IND
A CCESS
I NTEROPERABLE
R EUSE
Andreas Petzold‘s lecture
Motivation Data lifecycle Metadata Publish Data Summary
• Requires registration/access via institution
• Choose “EUDAT” as community (hope this will change at some point)
• Max. upload 10 Gbyte per collection and 2 Gbyte per file Data publication with B2SHARE
Fearless testing withhttps://b2share-testing.fz-juelich.de
B2SHARE creates a PID which canbe resolved by https://dx.doi.org/
Motivation Data lifecycle Metadata Publish Data Summary
Data publication with B2SHARE
Motivation Data lifecycle Metadata Publish Data Summary
Data publication with PANGAEA
https://www.pangaea.de
Motivation Data lifecycle Metadata Publish Data Summary
Data publication with PANGAEA
https://www.pangaea.de
Motivation Data lifecycle Metadata Publish Data Summary
Data publication with PANGAEA
Motivation Data lifecycle Metadata Publish Data Summary
Data publishing: Open Data Certificate
https://certificates.theodi.org/en
Motivation Data lifecycle Metadata Publish Data Summary
https://certificates.theodi.org/en
Data publishing: Open Data Certificate
Motivation Data lifecycle Metadata Publish Data Summary
https://certificates.theodi.org/en
Data publishing: Open Data Certificate
Motivation Data lifecycle Metadata Publish Data Summary
Motivation Data lifecycle Metadata Publish Data Summary
Professional data publishing = web services
Summary
Summary
• Data management should matter to you
• Data is the fuel for scientific discovery
• Use open standards
• Apply common sense (and standards) to metadata
• Publish your data
• Be FAIR Summary