eResearch Tools day. Another look at CKAN
On Tuesday August 12th the UWS eResearch team had one of our irregular-but-should-be-regular tool-hacking days, looking at the CKAN data repostiory software again There were three main aims:
Evaluate the software as a working-data repository for an engineering lab and maybe an entire institute, similar to the way HIEv sits in the Hawkesbury Insitute for the Environment.
Evaluate the software as a generic research data management solution for groups wanting to capture data into a reposiotry as part of their research. Does CKAN fit with our principles for eResearch Development and Selection?. Joss Winn wrote about CKAN as a research data store a couple of years ago, why they chose it at Lincoln, and there was a workshop lastyear in London which produced some requirements documents etc.
Provide a learning opportunity for staff, giving them a chance to try new things an develop skills (such as using an API, picking up a bit of Python etc).
David demoed CKAN and showed:
Simple map-based vizualization using a spreasheet of capital cities he found on the Intents
Simple plotting of some maths-data
And then what?
We then broke up into small groups (mainly of size one, if we’re honest), to investigate different aspects of CKAN.
- Katrina and Carmi: Looking at the abilities for uploading Excel files by ingesting some data.gov.au datasets. What can be done, what can’t? What happens with metadata?
- David: Looking into the upload of a HIEv package/cr8it crate into CKAN. Can we automagically get the metadata out and stash it in CKAN. Can we represent the packages file structure in CKAN?
- Alf: Document this instance and preview infrastructure needs.
- PeterS: previews for markdown and other files. getting stuff out of files. events/queues. RDF and URIs.
- PeterB: TOA5 uploads from HIEv
- Lloyd: POST an ADELTA record into CKAN.
So how did we do?
Well, we got data moving around via a number of methods – spreadsheets went in via the web interface, documents went in over the API, documents came out over the API.
We learnt the differences between CKAN’s structured and unstructured data. "Structured" data is essentially tabular data: if you’re bringing it in via a CSV or a spreadsheet then it’s structured. What this means is that it gets stored as a relational table within CKAN and in principle this means you can access particular rows. Unstructured data is anything else, and you can access all of a blob or none of it.
We found gists handy for passing code snippets and wee “how-to” texts between the team on Slack.
My CKAN day…
We had a reasonably successful day. I found the upload of a file resource through the CKAN API (from Python) worked a lot easier with the extra documenttion. We had some problems with the security key in that the API wouldn’t run for me or Peter S when using our own keys, but it all worked when we used each others’ – reason: unknown. From a python script we were able to open a specially formatted csv file (TOA5 format from Campbell Scientific having 2 additional rows of metadata at the top), decode the first 2 rows and turn the metadata into name/value pairs when we created the CKAN dataset. So this was fairly flexibly done. A lot of our HIE climate change data is formatted this way and it means we should be able to ingest records failry readily as csv.
I wrote some short instructions (in a gist) on how to start up our CKAN instance.
Unfortunately the rest of the time was more heat than fire, as I read up on CKAN’s web-based previewing feature which uses Recline.js as well as Data Proxy, but it still a little bit unclear to me how it’s tied together.
Peter B pointed out that extracting individual rows from datasets is possible if the dataset is kept in a database underneath CKAN rather than as a file "blob". So I did some reading and partial setup of the CKAN Data Storer Extension. The setup guide is aimed at someone with more Python experience than me, so I got trapped in "celery and pasta (paster) land" for most of the afternoon!
Initial success in dusting down my long-dormant Python skills and getting data in and out via the API was followed by losing a lot of time trying to extract the RDFa from the HIEv package’s HTML. Neither manual crufting nor Python’s [RDFaDict][https://pypi.python.org/pypi/rdfadict] could get it all out (in fact, the library got nothing. Nothing!). The lesson here is to be sure that we put metadata in a place and a form that we can get it out programmatically.
Notwithstanding that, CKAN had a lot going for it in terms of upload and access, but it wasn’t immediately clear how it would handle complex metadata within its data model.
At Tools Day I learned to create a new dataset item plus upload a file with data to that item via the CKAN API using Python for the first time. That was the highlight for me. It was also interesting to see what is possible in terms of visualising data. I uploaded a few excel spreadsheets and the graphing interface was very user-friendly. I would like to see it utilised for data visualisation in the Centre for the Development of Western Sydney’s website.
This time posting actual data to CKAN seemed easier – I am assuming the documentation must have improved. I managed to put together something that could create new datasets and attach new files – a potential denial of service attack against CKAN or a tool for testing its scalability. And at Peter B’s suggestion worked on some very simple code to extract metadata and CSV from TOA5 files, as used by Campbell Scientific data loggers residing at the Hawkesbury Institute for the Environemnt.
The $64,000 Question: is CKAN up to it?
I general, yes CKAN seems to be a reasonable platform for data management that aligns well with our principles.
It has the basic features we need:
APIs for getting stuff in and out and searching
A discovery interface with faceted search
Previews for different file types
There are some limitations.
Despite what is says on the website and what Joss Winn reports, it’s not really ‘linked-data-ready’
It does have metadata and that is extensible but there’s not formal support of recognized ‘proper’ metadata schemas, jsut name-value pairs
There are some questions still to explore:
How well will it scale? We can probe this easily enough by pumping a lot of data into it
How robust and transactional is the data store? If we have different people or processes trying to act on the same objects at the same time will it cope or collapse?
Can we use more sophistcated metadata? We might look at stuff like the ability to add an RDF file that contains richer metadata than the built in stuff? How hard would this be? Could we allow richer forms for filling out, say, MODS metadata?
Ditto for using URIs. How easy would it be to add real linked data support? Would a hack do? ie instead of storing name/value pairs allow some conventions like name (URI)/value (URI). Again, how easy is it to hack the user interface to support stuff like autocomplete using name authorities rather than collecting yet more strings.
We didn’t talk to each other as much as we should have. This possibly due the venue – our offices – which meant people went to their desks. Next time we’ll be in a more interactive venue.
David was held up by the design of the data packages from HIEv – we need to revise the data packaging so that it has metadata in easy-to-use JSON as well as metadata embedded in RDFa.
eResearch Tools day. Is CKAN aDORAble? by members of the UWS eResearch team is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.