Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond

About this post

During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR) which we outlined in a previous post. In this post we will dig deeper into the architecture and look at how a couple of the components interact, specifically; how does a lab-level data management application talk to the institution-level Research Data Repository when a researcher wants to archive a data set for reuse and citation? This work is a partnership with researchers and technicians at the Hawkesbury Institute for the Environment (HIE), our NSW eResearch partner Intersect, the UWS library and IT, and the UWS eResearch team.

Non-technical summary: The data capture application for environmental scientists at HIE will be aimed at obtaining and managing data for immediate use and re-use. This post describes the technical approach we will use to allow researchers to create a data set from one or more data sources, ask the system to keep it for the long term in the UWS Research Data Repository, and issue an identifier they can use to cite it in a research publication. Keeping data in the RDR means both adding data to the Research Data Storage (RDS) component and maintaining a record about the data in the Research Data Catalogue (RDC).

Technical summary (contains jargon which is explained below): The data-curation interface between the ANDS-funded Data Capture (DC21) and Seeding the Commons (SC20) projects at UWS has now been specified. Data sets identified by researchers as important in the DC21 application will be harvested by the institutional Research Data Repository using the OAI-PMH protocol with a RIF-CS payload. Data librarians will check and improve collection descriptions and, for those of significant re-use potential, publish them to Research Data Australia. On publication, the Research Data Repository application will move data from a pre-published to a published state. Pre-published data may be openly accessible for collaboration purposes but will not have DOI identifiers or guaranteed persistence.

Data capture and seeding the commons

We have two Australian National Data Service (ANDS) projects running a UWS at the moment.

  1. There’s a Data Capture project, which, amongst other capabilities, is designed to capture some of the ‘wild’ data, organizing it into collections that can be secured, referenced and re-used by others. This is known as DC21, AKA Climate Change and Energy Research Data Capture Project (DC21).

    Data might be considered ‘wild’ if there questions about its long term management (will we be able to find it ten years from now?), short term safety (is it backed up?), or its status is not know (is it raw or cleansed?).

  2. There’s a Seeding the Commons project which, amongst other things,  is aimed at establishing a catalogue application which publishes descriptions of collections of data available for re-use on a search site; Research Data Australia.

Here’s what the DC21 application is doing:

This project will develop the data architecture and associated software systems to automatically capture data and meta-data from three instruments. The motivation for the project is that on completion the systems developed will serve as a basis for including the additional instruments utilised by CCERF and other research groups at UWS.

And it has a close connection to the Seeding the Commons project SC20.

The project is closely aligned and is partly dependent on the UWS Seeding the Commons project (SC20). The meta-data collected in this project will be contributed to the UWS eResearch Metadata Store. SC20 will be developing RIF-CS and OAI-PMH compliance for the UWS eResearch Metadata Store to allow for it to be harvested into the ARDC.


  1. OAI-PMH is a web protocol allowing one service to pull data from another. It’s very similar to RSS and Atom used to keep track of updates on websites by software like Google Reader.

  2. RIF-CS is the data format used to publish catalogue descriptions of research data and associated entities like people and projects to Research Data Australia. RIF-CS is an ANDS-specific format which is not sufficient on its own to capture a full set of archival and management data about research data collections, but our initial analysis is that it will be sufficient to communicate between the data capture application and the centralised research data repository. 

From data capture – to data embalming, er, preservation and re-consumption

Luc Small of Intersect has written up the DC21 application.

While it’s called a ‘capture’ application, with connotations of Gerald Durrell style antics in the wilds, trapping temperature readings and soil moisture readings with tranquilizer darts, DC21 is really about data domestication. Sure we need to obtain data, but it’s not just about raw, untamed, data; technicians and researchers do things to the data. They clean it and analyse it, and make useful collections out of data from different sources.

The bit we’re interested in here in this post is the point at which someone says “I’m ready to write this up” – at this point they will want to make sure their research is defensible, reproducible and, perhaps most importantly, citable. Before we go on to talk about this process, lets look at some of the assumptions we’re making about the application DC21.

Design Considerations

  • Data capture applications contain working data that might be reworked, cleaned or deleted before it is published or used as the basis for a publication or report.

  • Research projects are born, they run and they get completed. Research facilities are built and will eventually become obsolete. Data capture systems which service these projects and facilities are likely to suffer the same fate – they will not always have governance in place to ensure that they persist over long periods of time. (Yes, we know it’s in the requirements spec that every app is ‘sustainable’ but let’s be realistic).

  • The Research Data Repository (RDR) and its sub parts (the data storage system and the Research Data Catalogue RDC) capture important institutional assets.  To maintain these research data assets, the RDR will need to have governance in place to ensure its long term persistence.

  • The RDR will have RIF-CS-over-OAI-PMH and other interfaces that are needed for compliance and data discovery, meaning that data capture applications need not have these (but they can, of course).

  • A data set that is required for validation of research should have a persistent identifier expressed as an HTTP URI.  (Handles and DOIs can both be used to make URIs, with some benefits and attendant risks).

  • Publicly accessible data sets, as well as those that are expected to be cited even if not available as Open Access

  • And an implementation detail: At UWS, the ReDBox Research Data Catalogue application will be the software that runs the Seeding the Commons and RDC projects.

Rules of Engagement

Here are some rules of engagement, which are emerging as we get further into the design process for the Research Data Repository (RDR), data capture (DC21) and Research Data Catalogue applications (SC20). These rules are helping to ensure that the research data being captured is robust and well managed.  Data sets that are needed to validate research, and which researchers want to be citable:

  • Must be deposited in the Research Data Storage component (RDS) of the RDR or another persistent store that meets the same standards for data preservation. Note that much data will be in the RDS already, deposit is then a state-change rather than a move.

  • Must be described in the Research Data Catalogue (RDC) with a link to where the data resides. (Support will be available for this from the library).

  • Data capture applications must have a mechanism for a researcher to ask for a data set to be ‘curated’ so it is available for a defined period and correctly described, for example if they want to use it as the basis of a publication.

The current solution

Against the background of our medium-terms plans for a UWS Research Data Repository, and the above design considerations, rules of engagement and requirements, the technical teams from the Data Capture project and the Seeding the Commons project spent the best part of a day working out a white-board sketch of the interfaces between the lab-level working data management application and the repository.

While this high level solution design assumes ReDBox, other metadata store applications could be slotted in instead – the interface is standards based (RIF-CS over OAI-PMH).

The whiteboard looked like this. Below, we’ll simplify that with a proper diagram made on a computer.

Figure 1 Interface between data capture application and the Research Data Repository (using OAI-PMH and the RIF-CS standard for metadata about research data)

There are two main interface points:

  1. Name authority lookup, where every bit of metadata entered into DC21 is as high as possible in quality, via:

    1. A linked-data approach using HTTP URIs (AKA URLs) as names for things, as per the Gospel According to Tim.

    2. A single source of truth via the Mint component of ReDBox for data like subject codes, people, organisations etc.

  2. The ‘curation boundary’ where DC21 hands-over metadata to the Research Data Catalogue, and when that’s been curated by data librarians, data is pulled into the public-facing facet of the Research Data Store.

The first of these is already done in DC21 – as far as we know this is the first time a service other than ReDBox has been connected to an instance of the Mint as an authority. We will talk more about the importance of name authorities as ‘sources of institutional truth’ and the use of identifiers as our Research Data Repository project proceeds. For now, we will note that as far as possible every time someone fills out a form with something the institution already knows (a name of a person, a grant-code etc) then the data is looked up in the name authority, rather than replying on people typing strings, or local look-up tables. The UWS Research Data Catalogue is going to be ‘no strings attached’, as in text-strings. URIs all the way!

The more important interface is the second, the main subject of this post, handles deposit of data collections into the trusted Research Data Repository.

Based on all the design considerations and rules of engagement outlined above, the ‘curation boundary’ needs to be crossed when a researcher wants to keep an archival snapshot of a particular data set.

The story here is designed for data sets of moderate size, like those we’re getting from the Hawkesbury Institute for the Environment.

So, here’s the story:

  1. A researcher uses the DC21 application to find a number of data files from across two of the facilities at the institute, conducts some analysis and writes n article. (The system remembers every download from the data store).

    The researcher asks for the particular data set used for the article to be published/curated, either by uploading the data back into the system, or clicking on a search history.

    The DC21 application bundles the requested data, with as much provenance and metadata as possible, such as adding raw data.

    The DC21 application sets a flag against that downloaded collection to mark it as ready for publication – meaning it will start appearing in the OAI-PMH feed. The DC21 application will also remember that the data behind the collection has been referenced in a collection. This is to ensure that the data is not subsequently deleted or modified without due consideration for the collection.

  2. The Research Data Catalogue, which is part of the Research Data Repository picks up the new collection record from the OAI-PMH feed and puts in in the ‘ReDBox inbox’..

  3. The team of data librarians see the new data set in the inbox, add missing metadata for management and discovery purposes, maybe contacting the researcher for more information, and publishes the data.

  4. The Data Catalogue application mints a new DOI for the data set, and causes the data to be copied into the public part of the research data store. (Yes, we have to work out some of the details about when IDs get minted in this process – this step might need to happen earlier.)

  5. Later, another researcher can discover the data, via searching the web, a discovery service like Research Data Australia or via the Research Data Catalogue directly, they get a URL version of the DOI for the data set.

  6. When someone downloads the data using the DOI-URL, they’re redirected to the data in the Research Data Store.

Figure 2 Step-by step data curation and publishing process

Copyright Peter Sefton and Peter Bugeia, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

Climate Change and Energy Research – Intersect eResearch Project Summary

Climate Change and Energy Research – Intersect eResearch Project Summary

[This is document was prepared by Intersect for the DC21 project being run by the University of Western Sydney, funded by the Australian National Data Service. We’re posting this here as part of the story about how UWS is building a Research Data Repository.]

Dr Luc Small | 20 February 2012 | 4.1

Description: C:\Documents and Settings\spencer\My Documents\Intersect_Document_Logo.jpg

Intersect is developing and deploying technology aimed at assisting research into climate change and energy by enhancing the management of environmental sensor data. This document describes the core functionality of the proposed project and is targeted at researchers and their support teams who may wish to join the project as collaborators.

Background and Context

There has been a significant rise in the number of sensors and sensor networks used in environmental research in recent years. This growth has brought with it the challenge of managing sensor infrastructure and the data produced by the increasing numbers of deployed sensors.

Three classes of instruments are targeted for this project, from which data and meta-data will be collected:

  • Eddy Flux Towers – Collect meteorological and flux data (e.g. surface-atmosphere exchanges of CO2, water vapour and energy).

  • Whole tree chambers – Collect meteorological data regarding the environmental impact on a tree wholly encapsulated within the chamber. 

  • Weather stations – Collect meteorological data.

While these instruments are the current focus of the project, the project aims to be sensor/infrastructure agnostic and therefore more generally applicable to sensor data management.

Problem Statement

The problem of insufficient sensor infrastructure and data management affects researchers, data technicians and infrastructure managers the impact of which is:

  • lost or misplaced sensor data.

  • inadequate recording of how and where data was collected

  • inadequate recording of quality assurance, gap filling and other post-processing done to the data, and the assumptions made by the data technician during post-processing.

  • scientific conclusions are based on less than ideally managed source data that is prone to error.

A successful solution would:

  • store data in a secure, backed-up, centralised location.

  • record rich metadata about how and where data was collected.

  • record rich metadata about the post-processing done to the data.

  • provide an intuitive means by which researchers can access data and be fully informed about its nature by consulting its associated metadata.

Project Deliverables

  1. Infrastructure management: The ability to keep track of sensor infrastructure (for example, flux towers and weather stations) and individual sensors, and changes to the sensors and/or the infrastructure.

  2. Raw sensor data acquisition: Manual and automated data acquisition from files generated by sensors and/or their data-loggers.

  3. Versioned data storage: Permanent retention of raw sensor data. When datasets are quality assured, gap filled, or transformed, new versions of the datasets are created, time-stamped, and related back to the original raw sensor datasets. Data is stored in a centralised fashion that can be easily backed up.

  4. Data sharing: Data can be downloaded by those within the research group. Data comes with detailed meta-data describing the sensor and infrastructure used to acquire the dataset and any transformations that may have been done to it. Meta-data can be made available to Research Data Australia to make the research data more readily discoverable by other scientists.

  5. Data upload: As noted above, new versions of a dataset can be uploaded to the system and linked to the original raw dataset. This allows the process of data transformation to be tracked and ensures that it is a non-destructive process because all datasets created are retained.

More Information

If you face a similar problem and would like to join as a collaborator, or you face a related problem and are interested in understanding more about this project and potentially reusing components of it, please contact your local eResearch Analyst or email era@intersect.org.au.

Key Facts

Funding Source/Amount:

ANDS – Data Capture Program – $200k

Lead Organisation/CIs:

University of Western Sydney

Hawkesbury Institute of the Environment

Prof. Ian Anderson


Development commenced in December 2011 and the system will go live in the latter half of 2012.

Related Projects:

TERN/OzFlux: The present project is best regarded as supporting the precursor activities that enable the delivery of quality assured data to a facility such as OzFlux.

A Day in the Life…

A new sensor is installed on a flux tower. Data files are retrieved from the associated datalogger once a day and placed on networked storage. The infrastructure manager:

  • Adds the sensor to the catalogue of sensors associated with the flux tower.

  • Associates the sensor data files with the sensor record.

  • Provides detailed meta-data about the sensor, such as its make, model, position on flux tower, etc.

  • Removes the sensor record for the faulty sensor that this new model has replaced.

Over the days that follow, data starts flowing in from the new sensor. The data technician:

  • Downloads the raw sensor data that has been collected.

  • Gap fills part of the data where the sensor has recorded readings that lie outside the band of expected values.

  • Uploads the gap filled data along with an explanation of the post-processing applied to the data.

The system stores the gap filled data as a new version. This new version is automatically associated with the original raw sensor data. The raw data remains available, unmodified, for future reference. The researcher:

  • Explores the sensors available on the flux tower and selects the one she’s interested in.

  • Browses the data available for the sensor and takes note of the data technician’s comments about any post-processing steps that have been performed.

  • Selects the gap filled version of the data since it is most appropriate in this instance.

  • Downloads the gap filled sensor data and commences analysis.

  • Is aided in analysis and write-up by having the full details of the flux tower, sensor, and post-processing step at hand.

  • Finds anomalies in the gap filled data and isolates the post-processing as the cause by looking at the raw sensor data.

  • Having decided this is “the” dataset the researcher asks the system to archive a copy and mint a new DOI so it can be cited like an article, and retrieved from the UWS Research Data Repository.

Please refer to the diagram below for an indication of how other stakeholders will interact with this project.

Description: DC21 Diagram.pdf

Description: by-nc-sa.eps

This document by Intersect Australia is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.