Penguin Archive workflow design

One of our aims is to produce a workflow for publishing Linked Open Data that is simple for our collection curators to engage with and continue to use post-project. Whilst we can produce one-off Linked Data snapshots of their archives during the project, a real measure of success would be to see those processes embedded into the standard work of the collection maintainers. From their point-of-view the steps for re-publishing the various collections as Linked Data should be as straightforward as possible. Just uploading a raw data file, for example. We want to avoid (as much as possible) introducing requirements that involve too much manual intervention.

So let’s consider one possible workflow for the Penguin Archive in this post. (This workflow is based around the concept of local Linked Data publishing platform, though we are also considering remote third-party options.)

  • The collection is held in an instance of Axiell CALM. Using its desktop client it is a relatively straightforward process for an archivist to export collection data to EAD XML format.
  • The resulting files can be uploaded to our Linked Data Publishing Service (LDPS) via a browser-based tool. This tool will provide a simple management interface to govern the workflow, the first part of which is transformation into RDF. Building on work done in the LOCAH project we use XSLT to produce RDF from the EAD XML.
  • Using the LDPS again, the RDF is loaded into a triple store. Our intention here is to use Fuseki for this task. Using its support for SPARQL over HTTP the LDPS will manage each collection as a separate Named Graph. Fuseki provides a SPARQL endpoint to these Graphs and also serves the Default Graph as their union.
  • In order to serve the Linked Data we intend to use Elda. This is an implementation of the Linked Data API and will answer Linked Data requests by translating them to backend SPARQL queries.

Here is a graphical overview of this architecture:

In terms of automation it doesn’t look to bad. The archivist needs to manage the CALM exports and XML file uploads to the LDPS. Then they (or someone else) would use the LDPS to manage the RDF transforms and triple store contents. The interface to these processes should hopefully be reasonably straightforward (upload, transform, publish to triple store). Elda, once configured, shouldn’t need touching.

So what is missing? Well, linking to other data sets for a start. In this area we’ll need the input of subject experts. When making links from our data sets to others we’ll need a human to confirm the links any automation (text parsing via DBpedia Spotlight?) has proposed. The design of this type of workflow needs to be carefully considered, with the results being cached for future use wherever possible. The subject for a future post.

Leave a Reply