Research Data Repository (RDR) progress report, May 2013

The RDR project at UWS started in 2010 with the purchase of some storage infrastructure, and was expanded in scope in 2012, based on this scoping document. Work began in earnest in June 2012 when project manager Toby O’Hara joined the team. We set out with these broad principles in mind:

The repository will consist of two main components:

  1. A scalable storage service linked to a combination of local and cloud-based high performance computing. Some data may also reside in other, trusted storage systems such as national infrastructure or discipline repositories with suitable governance in place.

  2. A catalogue of research data for internal use in management, and external use in dissemination and collaboration.

But the project is about much more than supplying storage and computing. It is about creating an organisational capability and culture of managing research data throughout the research lifecycle. We aim:

  • To enable research in all disciplines at UWS to take place efficiently and effectively on existing and new data sets.

  • To enable the validation of research through appropriate management of data inputs and outputs.

  • For re-use in new research which will cite the creators of data sets at UWS.

  • For compliance with funder requirements and codes of practice.

Those two main components are now established. We have both working storage (RDS) and archival storage (RDR) now commissioned and working on a small scale. (Note that terminology on this project has changed a bit – the RDR used to refer to all the components but it became quite clumsy to talk about ‘the archival repository part of the broader Research Data Repository’).

Figure – Super-simple view of the Research Data Repository with the two main kinds of storage – Working vs Archival

On top of that simple view, we can show how the RDR sits with other systems.

Figure RDR interaction with two other services. integration is a simple one-way approach while the HIEv data capture application interacts with both working and archival storage via the Catalogue

There are many, many ways that these services could be extended but we have identified three high priorities from consulting with UWS researchers, and talking to other eResearch teams, which we’ll talk about in more detail below:

  1. Adding support for distributed version control systems used by tech-savvy researchers to manage software code and documents.

  2. Adding more support for distributed file-systems like Dropbox, but with better support for data security, access control and the ability to add eResearch applications over the top of the storage.

  3. Dealing with the looming ‘feral file’ problem, where data storage tends to fill up, and there are a lack of options for researchers to hand-over data to an archival store.

Dealing with source-code and document version-control systems

There are two widely used distributed version control systems: git and Mercurial. Many researchers use these to manage program code and/or document sources for publications in text-markup such as LaTeX and increasingly MarkDown, via tools like KintR in the R environment. We are working to add support for this class of repository in our repository, which should be fairly straightforward, as the modern distributed code repositories support the key use-case by design. That is, they allow you to ‘push’ code changes to more than one repository, so a UWS member of a team that is already happily working with say BitBucket could push repository changes to a UWS archival repository for safe keeping, as well as the team repository. Why would they want to do this? It’s not about short term risk, but about having copies of data that are independent of service providers that might come and go in the medium to long term. And it’s about exactly the same use-cases for packaging data and depositing in an archival repository as with any other data project, when projects end, articles are published etc. More on this in a post soon.

Future file systems

The file sync-and-share product is a clear winner in the distributed file-system stakes. It has a low-friction viral quality that lets it spread in ways that permeated and subverted our institutional networks and command-and-control structures. And it has an unparalleled ease of use1. But there are two major problems:

  1. There are some kinds of data for which one should NOT use the researcher has to decide if they are meeting ethical standards, funder requirements and layers of institutional policy.

  2. And while has an API – an interface against which third parties can write software applications, it is severely limited for doing the kind of ‘bridging’ work we want to between the RDS working-data store and the RDR archival store.

So, the fact that is so popular, and so good, makes it clear that even if we can’t match it completely, we should be thinking about how to provide a similar service so research teams can:

  • Store stuff on all their devices and have it automatically synchronise between them, with some limits about re-sharing..

  • Invite others that they identify as collaborators to see the files. (No, that does not mean getting them to fill and sign a form apply for a university account, the way I have heard it described at a big university not far from here, it means I send you an invitation by email, you log in using something that (a) suits you and (b) works, for example, a gmail account, and once I’m sure that you are you, then the sharing starts. Yes, there are exceptions where we need higher-levels of assurance but for most collaborations too many barriers mean people will revert to Dropbox and smuggled USB drives.)

And, beyond what can provide:

  • Store stuff in the right jurisdiction.

  • Allow eResearch tools, such as the one we cover next to access data via full-service machine interfaces (APIs).

There is a promising new application in this space now, run by AARNET called Cloudstor+. This gives Australian Researchers 100GB of free storage which can be expanded at low cost. This runs on the open source OwnCloud platform.

But note that there are many kinds of data that should NOT be placed in sharing-syncing services for various privacy and other legal reasons.

Creating a bridge between working file-storage and the archive.

We are now starting to hand out file-shares, which will, of course, fill up with files as researchers begin to take advantage of the storage space. But what will happen to those files when articles are published, projects and grants finish, research staff leave the institution? There are good reasons in all these situations to make sure that data are catalogued, and stuff is transferred to the Archival Store.

But it would be naïve to think that just because there are good reasons for these things to happen that they will. That’s why we have been working out how to encourage researchers to deposit data at various points in the existing research lifecycle – see our previous post on data management use-cases when we look at how and more importantly why people might be motivated to catalogue and deposit data.

Some data will come to the catalogue via applications like HIEv – the environmental data capture application. At the Hawkesbury Institute for the Environment (which is where the HIE in the name comes from) data is captured by technical research infrastructure and routed automatically to HIEv, where institute staff and collaborators can work with it. When they use a data set and publish an article or create a data set for re-use then they can trigger the process of having it sent to archival storage and cataloguing.

But for data that is not coming through a data capture application, uncatalogued, ‘wild’ or ‘feral’ data we want to provide a way for research teams to:

  • Look at their file-share and see all their (file-based) stuff.

  • Select groups of things that belong together, by directory, by file-type, by a search query, or by picking them out manually.

  • Add metadata to contextualise and explain the files, to support future re-use, and to explain how data supports published finding.

  • Publish/archive the data by sending to ReDBOX, the archival part of the overall Research Data Repository, where librarians will help optimise metadata and mind the data for the appropriate length of time.

Enter CrateIt (or Cra8it – (that’s Crate-it), an application to enable a user to pack-and-label-and-send as just described. In this part of the RDR project Lloyd is writing an OwnCloud plugin which can be used to find, preview, describe, pack and send research data files from the working store to the Research Data Repository for archival storage (or in the case of very large data sets, send links to the files).

We have written previously about a prototype application that does a lot of this already but the OwnCloud version is promising because it is integrated with OwnCloud’s existing sharing and replication services so Cr8it can take advantage of its access control services.

What next?

Work is proceeding now on the three priorities mentioned above; integration with version control systems, file-sharing and synchronisation and the Cr8it application for corralling files.

Beyond that, the future is less certain; the roadmap for eResearch at UWS, which is now more or less complete, but yet to be approved by the eResearch Steering committee calls for a steady roll-out of:

  • More data capture applications at more sites, including research institutes and research groups.

  • Developing institute and school level data management plans following the lead of the Hawkesbury Institute for the Environment.

  • Further integrating data management services into the research lifecycle.

  • Improved integration with computing resources and collaboration tools.

  • Incremental improvements and upgrades to all of existing services.

    1For a quirky take on this, consider Les Orchard’s musing on how it treats him like he treats his pets. This is a interesting way to think about service provision:

    consider these pointers for being nice to animals:

    • Give them a reason to come to you. Don’t chase after and grab.

    • If they want to leave, let them. Don’t hold on and squeeze tight.

    • If you are allowed to pick them up, hold them gently yet offer enough support to make them feel safe.

    • Pay attention to their reactions, learn what kind of attention they like. This gives them a reason to come back when you let them leave.

    Les lives with bunnies, I live with a dog. With dogs you need to show them very explicitly they rank in the family pack (ie below the humans). That’s not a strategy I’d recommend IT or eResearch staff take with your local institute director!

Creative Commons License
Research Data Repository (RDR) progress report, May 2013 by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.