Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data
By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond
About this post
During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR) which we outlined in a previous post. In this post we will dig deeper into the architecture and look at how a couple of the components interact, specifically; how does a lab-level data management application talk to the institution-level Research Data Repository when a researcher wants to archive a data set for reuse and citation? This work is a partnership with researchers and technicians at the Hawkesbury Institute for the Environment (HIE), our NSW eResearch partner Intersect, the UWS library and IT, and the UWS eResearch team.
Non-technical summary: The data capture application for environmental scientists at HIE will be aimed at obtaining and managing data for immediate use and re-use. This post describes the technical approach we will use to allow researchers to create a data set from one or more data sources, ask the system to keep it for the long term in the UWS Research Data Repository, and issue an identifier they can use to cite it in a research publication. Keeping data in the RDR means both adding data to the Research Data Storage (RDS) component and maintaining a record about the data in the Research Data Catalogue (RDC).
Technical summary (contains jargon which is explained below): The data-curation interface between the ANDS-funded Data Capture (DC21) and Seeding the Commons (SC20) projects at UWS has now been specified. Data sets identified by researchers as important in the DC21 application will be harvested by the institutional Research Data Repository using the OAI-PMH protocol with a RIF-CS payload. Data librarians will check and improve collection descriptions and, for those of significant re-use potential, publish them to Research Data Australia. On publication, the Research Data Repository application will move data from a pre-published to a published state. Pre-published data may be openly accessible for collaboration purposes but will not have DOI identifiers or guaranteed persistence.
Data capture and seeding the commons
We have two Australian National Data Service (ANDS) projects running a UWS at the moment.
There’s a Data Capture project, which, amongst other capabilities, is designed to capture some of the ‘wild’ data, organizing it into collections that can be secured, referenced and re-used by others. This is known as DC21, AKA Climate Change and Energy Research Data Capture Project (DC21).
Data might be considered ‘wild’ if there questions about its long term management (will we be able to find it ten years from now?), short term safety (is it backed up?), or its status is not know (is it raw or cleansed?).
There’s a Seeding the Commons project which, amongst other things, is aimed at establishing a catalogue application which publishes descriptions of collections of data available for re-use on a search site; Research Data Australia.
Here’s what the DC21 application is doing:
This project will develop the data architecture and associated software systems to automatically capture data and meta-data from three instruments. The motivation for the project is that on completion the systems developed will serve as a basis for including the additional instruments utilised by CCERF and other research groups at UWS.
And it has a close connection to the Seeding the Commons project SC20.
The project is closely aligned and is partly dependent on the UWS Seeding the Commons project (SC20). The meta-data collected in this project will be contributed to the UWS eResearch Metadata Store. SC20 will be developing RIF-CS and OAI-PMH compliance for the UWS eResearch Metadata Store to allow for it to be harvested into the ARDC.
OAI-PMH is a web protocol allowing one service to pull data from another. It’s very similar to RSS and Atom used to keep track of updates on websites by software like Google Reader.
RIF-CS is the data format used to publish catalogue descriptions of research data and associated entities like people and projects to Research Data Australia. RIF-CS is an ANDS-specific format which is not sufficient on its own to capture a full set of archival and management data about research data collections, but our initial analysis is that it will be sufficient to communicate between the data capture application and the centralised research data repository.
From data capture – to data embalming, er, preservation and re-consumption
Luc Small of Intersect has written up the DC21 application.
While it’s called a ‘capture’ application, with connotations of Gerald Durrell style antics in the wilds, trapping temperature readings and soil moisture readings with tranquilizer darts, DC21 is really about data domestication. Sure we need to obtain data, but it’s not just about raw, untamed, data; technicians and researchers do things to the data. They clean it and analyse it, and make useful collections out of data from different sources.
The bit we’re interested in here in this post is the point at which someone says “I’m ready to write this up” – at this point they will want to make sure their research is defensible, reproducible and, perhaps most importantly, citable. Before we go on to talk about this process, lets look at some of the assumptions we’re making about the application DC21.
Data capture applications contain working data that might be reworked, cleaned or deleted before it is published or used as the basis for a publication or report.
Research projects are born, they run and they get completed. Research facilities are built and will eventually become obsolete. Data capture systems which service these projects and facilities are likely to suffer the same fate – they will not always have governance in place to ensure that they persist over long periods of time. (Yes, we know it’s in the requirements spec that every app is ‘sustainable’ but let’s be realistic).
The Research Data Repository (RDR) and its sub parts (the data storage system and the Research Data Catalogue RDC) capture important institutional assets. To maintain these research data assets, the RDR will need to have governance in place to ensure its long term persistence.
The RDR will have RIF-CS-over-OAI-PMH and other interfaces that are needed for compliance and data discovery, meaning that data capture applications need not have these (but they can, of course).
A data set that is required for validation of research should have a persistent identifier expressed as an HTTP URI. (Handles and DOIs can both be used to make URIs, with some benefits and attendant risks).
Publicly accessible data sets, as well as those that are expected to be cited even if not available as Open Access
And an implementation detail: At UWS, the ReDBox Research Data Catalogue application will be the software that runs the Seeding the Commons and RDC projects.
Rules of Engagement
Here are some rules of engagement, which are emerging as we get further into the design process for the Research Data Repository (RDR), data capture (DC21) and Research Data Catalogue applications (SC20). These rules are helping to ensure that the research data being captured is robust and well managed. Data sets that are needed to validate research, and which researchers want to be citable:
Must be deposited in the Research Data Storage component (RDS) of the RDR or another persistent store that meets the same standards for data preservation. Note that much data will be in the RDS already, deposit is then a state-change rather than a move.
Must be described in the Research Data Catalogue (RDC) with a link to where the data resides. (Support will be available for this from the library).
Data capture applications must have a mechanism for a researcher to ask for a data set to be ‘curated’ so it is available for a defined period and correctly described, for example if they want to use it as the basis of a publication.
The current solution
Against the background of our medium-terms plans for a UWS Research Data Repository, and the above design considerations, rules of engagement and requirements, the technical teams from the Data Capture project and the Seeding the Commons project spent the best part of a day working out a white-board sketch of the interfaces between the lab-level working data management application and the repository.
While this high level solution design assumes ReDBox, other metadata store applications could be slotted in instead – the interface is standards based (RIF-CS over OAI-PMH).
The whiteboard looked like this. Below, we’ll simplify that with a proper diagram made on a computer.
Figure 1 Interface between data capture application and the Research Data Repository (using OAI-PMH and the RIF-CS standard for metadata about research data)
There are two main interface points:
Name authority lookup, where every bit of metadata entered into DC21 is as high as possible in quality, via:
A linked-data approach using HTTP URIs (AKA URLs) as names for things, as per the Gospel According to Tim.
A single source of truth via the Mint component of ReDBox for data like subject codes, people, organisations etc.
The ‘curation boundary’ where DC21 hands-over metadata to the Research Data Catalogue, and when that’s been curated by data librarians, data is pulled into the public-facing facet of the Research Data Store.
The first of these is already done in DC21 – as far as we know this is the first time a service other than ReDBox has been connected to an instance of the Mint as an authority. We will talk more about the importance of name authorities as ‘sources of institutional truth’ and the use of identifiers as our Research Data Repository project proceeds. For now, we will note that as far as possible every time someone fills out a form with something the institution already knows (a name of a person, a grant-code etc) then the data is looked up in the name authority, rather than replying on people typing strings, or local look-up tables. The UWS Research Data Catalogue is going to be ‘no strings attached’, as in text-strings. URIs all the way!
The more important interface is the second, the main subject of this post, handles deposit of data collections into the trusted Research Data Repository.
Based on all the design considerations and rules of engagement outlined above, the ‘curation boundary’ needs to be crossed when a researcher wants to keep an archival snapshot of a particular data set.
The story here is designed for data sets of moderate size, like those we’re getting from the Hawkesbury Institute for the Environment.
So, here’s the story:
A researcher uses the DC21 application to find a number of data files from across two of the facilities at the institute, conducts some analysis and writes n article. (The system remembers every download from the data store).
The researcher asks for the particular data set used for the article to be published/curated, either by uploading the data back into the system, or clicking on a search history.
The DC21 application bundles the requested data, with as much provenance and metadata as possible, such as adding raw data.
The DC21 application sets a flag against that downloaded collection to mark it as ready for publication – meaning it will start appearing in the OAI-PMH feed. The DC21 application will also remember that the data behind the collection has been referenced in a collection. This is to ensure that the data is not subsequently deleted or modified without due consideration for the collection.
The Research Data Catalogue, which is part of the Research Data Repository picks up the new collection record from the OAI-PMH feed and puts in in the ‘ReDBox inbox’..
The team of data librarians see the new data set in the inbox, add missing metadata for management and discovery purposes, maybe contacting the researcher for more information, and publishes the data.
The Data Catalogue application mints a new DOI for the data set, and causes the data to be copied into the public part of the research data store. (Yes, we have to work out some of the details about when IDs get minted in this process – this step might need to happen earlier.)
Later, another researcher can discover the data, via searching the web, a discovery service like Research Data Australia or via the Research Data Catalogue directly, they get a URL version of the DOI for the data set.
When someone downloads the data using the DOI-URL, they’re redirected to the data in the Research Data Store.
Figure 2 Step-by step data curation and publishing process
Copyright Peter Sefton and Peter Bugeia, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>