By Peter Sefton and Toby O’Hara
This post looks at some of the ways that researchers and the staff who support them might interact with the new Research Data Repository (RDR) we are building at the University of Western Sydney (UWS). This is a document for discussion. As implementation manager, Toby has consulted with several stakeholders on how the RDR will be used and we’d like feedback preferably via the comments on this blog or by email if that suits you better. This post will be of most interest to those involved in the RDR project at UWS and those who are currently engaged in setting up or improving an institutional research repositories as well as the potential ‘customers’ of a data repository.
And it might be of interest to anonymous blogger the Library Loon who doesn’t approve of most research data management planning. She points out:
Most data-management models (the one she is thinking of no exception) map extremely poorly to the research-project cycles and timelines that researchers are accustomed to. The milestones researchers think about—grant applications, awards, data capture, data analysis, interim-report writing, article authoring, renewal applications, and so forth—barely appear in data-management models.
What we’re going to talk about here is quite similar to what I think is happening at many Australian institutions. A lot of the thinking here follows the lead of Vicki Picasso’s team at Newcastle and their work on build an institutional research data catalogue which responds to institutional triggers, including grant applications and awards.
The goal of our RDR project is to provide the benefits we described before: backed-up, well-described reusable data to foster collaboration and data citations, not to mention keeping funding bodies happy. All with the least possible negative impact or the research community. To accomplish this, the RDR project aims to fit in with the existing research lifecycle. Below, we outline several different scenarios where various participants interact with the RDR. If these scenarios make sense, after further consultation then we will use them to inform our data management planning at the university.
A word about the diagrams
The diagrams in this post all have slightly differing levels of detail – the idea is to illustrate each point once, rather than repeat things. A few things to keep in mind when reading the diagrams:
Research data and the methods for producing data come in all varieties. The diagrams are necessarily simple, so as not to exclude data types, or different ways of handling data.
While not all the diagrams include a data management they will be a part of our new repository workflows; the intention is to start with human-readable plans and then progress to machine-driven planning. For example making it possible for research management systems to send reminders when data deposits are due.
The Research Data Repository is in two components. The Research Data Store (RDS) is for storing the data itself, in groups, packages, notebooks, bundles, or “collections”. The Research Data Catalogue (RDC) is for storing descriptions of the data, and these descriptions tie back to the data which is deposited, in the RDS or in trusted store elsewhere.
The library is responsible for management of the Research Data Catalogue component of the RDR. In each case, there is a loop back from a data librarian to the Researcher in the event that the data description needs to be improved to maximise its ability to advertise the existence of data, promote re-use and assist in data management.
Information from the university’s research management systems about researchers, grants and publications add information and trigger notifications to the RDR, or to a designated librarian mailbox.
As mentioned elsewhere in this blog, there is an overarching objective to raise awareness of what data has been collected and may be available for reuse. One of the ways this can happen is by sending the data descriptions to the ANDS Research Data Australia service (the pink boxes, below).
The first use case, the library-led deposit deals with:
historical research completed some time ago,
research that has recently completed, or
research that is ongoing, but has been flagged as a potential candidate for inclusion in the repository for strategic reasons.
Our first scenario is already happening; as a way of introducing the concept of a Research Data Catalogue to the university and getting experience in managing one, the library is leading a project, funded by the Australian National Data Service (ANDS) known as “Seeding the commons”. This project involves the Library team identifying places that data might reside via discussions with research administration, searching for grants awarded and looking at UWS’s publications.
The Seeding the Commons program is a way of getting some experience with the business of describing research data, relating it to other entities such as people, organisational units, research projects and grants. Many of the processes we’re going through at this stage are not scalable to the whole enterprise over the long term; a librarian will not be this directly involved in every deposit of data in the future, but the library does figure in all the stories we’re telling in this post and the next about how the RDR will work. Nor will we make the mistake of thinking that if we build it they will come and expect researchers to start spontaneously depositing data.
Data Capture deposits
A lot of research data is generated by machines such as sensors or instruments in such a way that it can be captured, labelled and described close to the data source in such a way that people other than the original researchers involved in collecting in it know what it is and how it might be reused. ANDS have funded a number of projects in this area under their Data Capture stream. We have one of these projects at UWS, being built for the Hawkesbury Institute for the Environment – it will capture data coming from large forest-based experiments, house it in a working data store where researchers can interrogate it and work with it, then send it to the Research Data Repository for archiving. In this workflow, there will be some automated data processing – once a certain class of data is well described it can flow from a facility automatically, for example a months worth of weather data does not need a research librarian to describe it every month but for some other scenarios, particularly compiling a data set to support a publication (which is covered below), there will still be humans involved in selecting and describing data.
There is an additional level of detail in this diagram, which is not in the Library-led process above, which simply makes explicit the fact that the RDR encompasses the storage and the catalogue. The storage is where the data is deposited and the catalogue is where the descriptions of the data are kept, and (where permissible) shared with other systems.
We don’t need to point out that publishing is one of the key parts of the research life cycle. In our initial work on the UWS Research Data Repository, we have been approached by researchers wanting to deposit data somewhere accessible (anywhere!) because it is required by the journal to which they’re submitting. This will become more and more important as funders start to mandate open access data along with open access publications, and the scholarly process (not just the scholarly communications process) re-forms itself. Note that in this scenario there is a DOI – a Digital Object Identifier – created for a data set so it can be cited like a publication. This is not hugely important to researchers yet, but we’re betting that it will rapid become so as systems to collect data citation metrics start coming on line and almost certainly counting towards government reporting.
The Library Loon’s timely post about making sure that data management plans align with research practice aligns with our thinking on this. We’re working with researchers as the Hawkesbury Institute for the Environment on how data capture and repository systems need to be configured to support their publications, for example where they use R to fetch data, clean it, run models, then generate the figures for an article. More on that soon.
Grant-driven data deposit
Grants are key to the research lifecycle. At the University of Western Sydney work has already been done on integrating data management into the research lifecycle, starting with applications for internal grants. We know that changing the research culture will take some time, but eventually thinking about eResearch requirements, not just data management but computing and collaboration needs will become normal for all researchers, just as ethics forms are for many now.
We present two scenarios here, one when a grant starts, and the researcher is prompted to finish and deposit a data management plan, and another when the grant finishes and there is a check to make sure the data management plan has been followed. In between, of course there might be other research-lifecycle-events that trigger data deposits.
Reporting-driven data deposit
In Australia one of the key drivers, and a driver with a very heavy right foot, is the government. All universities have to report on publications via HERDC and research excellence via ERA, this means that there are processes in place for reporting that form a large part of the research lifecycle. This is another place to tie-in data management processes for data that is of strategic significance. The next diagram shows a couple of scenarios that could be driven by the reporting cycle, either a significant publication, where it is important to make sure that data is kept for reproducibility and reuse, or when reporting on research of global significance, where advertising data might lead to more such research.
We recognise that there’s room for improvement in each of the scenarios above, and not just because we’ve kept the sequences high level and skipped over some intricacies. It is hoped that this will drive some discussion and exploration of options. Perhaps your university is also implementing similar processes and technologies, please feel free to use us as a sounding board. If there is sufficient interest in any of the scenarios above, we’re happy to organise a forum for discussion.
Copyright Peter Sefton and Toby O’Hara 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>