You are currently browsing the archives for the Penguin Archive category.

Archive for the 'Penguin Archive' Category

Progress against workplan

Friday, May 25th, 2012

The Bricolage Advisory Group reviewed progress against the project workplan at its second meeting on 20 April 2012. We agreed to bring forward by a month each the two remaining AG meetings – to May and July – to ensure continuing review and steer towards the end of the project in August.

The Linked Data: Hosting work package has started with options such as Talis and data.bris being considered. Timing of the data.bris project may preclude its use, although it may not be the best place to host data from Bricolage anyway. It could, however, be used to create URIs and expose data in an external view. Options to be considered more fully at the meeting in May.

The Linked Data : Metadata review/export work package is moving at a different pace with the Penguin Archive and the Geology Museum (see separate blog posts on export work). In Earth Sciences, student effort is being used to move data from spreadsheet format into an online (Drupal) database. Some of the issues arising are in formats used, including free text, and the need to restrict terminology. There is a huge amount of currently unstructured data. Review and export piloting will carry on for another 2-3 weeks. For the Penguin Archive, wok has focused on trying to add authority terms but this has proved extremely labour-intensive, more so than anticipated. Good authority data is needed for good Linked Data, and this needs to be taken into account when initially cataloguing collections. Legacy data without authority data continue to pose problems.

The entity extraction process poses the question ‘can we identify things in textual descriptions, linking unlinked data? Some online services can analyse text, eg DBPedia Spotlight entity detection and disambiguation services for constructing bespoke Named Entity Recognition solutions. The Women’s Library at London Met deals effectively with disambiguation of names. We are interested in parsing text through the process to see how useful and accurate it can be. Entity extraction is at the experimental end of our work, but Linked Data to an authority source and processes around this are of interest.

Export implementation should be complete by end of May.

Identifiers and Linking work package: Geology have thus far been creating internal links within their data. Work to link this data to other datasets has not yet started. Work on the Penguin Archive to date has highlighted a problem around the stability of URIs; sustainability is an issue for the future. For example, a person identified in the Penguin Archive data could have a unique ID in CALM but that identifier could easily break if CALM’s internal ID scheme is changed by the vendor. Alternative ID schemes that rely on the person’s name or their biographical dates also pose similar problems if, say, the person changes their name or their commonly accepted biographical dates change. We need persistent IDs (eg DOIs) in combination with a resolver service to map from persistent IDs to appropriate internal current IDs (eg CALM IDs).

On microdata, for Geology we’re looking to embed data coming out of the catalogue in the public site so that big search engines can find structured data. We’ll be looking at metadata as RDFa within the HTML of the public site.

Sustainability of the tools and workflow developed during the project is important. The key is in developing a set of processes and tools that are easy to use in terms of the export process and publication of Linked Data, so that archivists might routinely use them. Questions arise about what is most useful for the long-term, what is transferable.

The demonstrator workpackages will begin at the end of May; evaluation and dissemination work packages will be discussed at the May Advisory Group meeting.

Penguin Archive workflow design

Thursday, May 17th, 2012

One of our aims is to produce a workflow for publishing Linked Open Data that is simple for our collection curators to engage with and continue to use post-project. Whilst we can produce one-off Linked Data snapshots of their archives during the project, a real measure of success would be to see those processes embedded into the standard work of the collection maintainers. From their point-of-view the steps for re-publishing the various collections as Linked Data should be as straightforward as possible. Just uploading a raw data file, for example. We want to avoid (as much as possible) introducing requirements that involve too much manual intervention.

So let’s consider one possible workflow for the Penguin Archive in this post. (This workflow is based around the concept of local Linked Data publishing platform, though we are also considering remote third-party options.)

  • The collection is held in an instance of Axiell CALM. Using its desktop client it is a relatively straightforward process for an archivist to export collection data to EAD XML format.
  • The resulting files can be uploaded to our Linked Data Publishing Service (LDPS) via a browser-based tool. This tool will provide a simple management interface to govern the workflow, the first part of which is transformation into RDF. Building on work done in the LOCAH project we use XSLT to produce RDF from the EAD XML.
  • Using the LDPS again, the RDF is loaded into a triple store. Our intention here is to use Fuseki for this task. Using its support for SPARQL over HTTP the LDPS will manage each collection as a separate Named Graph. Fuseki provides a SPARQL endpoint to these Graphs and also serves the Default Graph as their union.
  • In order to serve the Linked Data we intend to use Elda. This is an implementation of the Linked Data API and will answer Linked Data requests by translating them to backend SPARQL queries.

Here is a graphical overview of this architecture:

In terms of automation it doesn’t look to bad. The archivist needs to manage the CALM exports and XML file uploads to the LDPS. Then they (or someone else) would use the LDPS to manage the RDF transforms and triple store contents. The interface to these processes should hopefully be reasonably straightforward (upload, transform, publish to triple store). Elda, once configured, shouldn’t need touching.

So what is missing? Well, linking to other data sets for a start. In this area we’ll need the input of subject experts. When making links from our data sets to others we’ll need a human to confirm the links any automation (text parsing via DBpedia Spotlight?) has proposed. The design of this type of workflow needs to be carefully considered, with the results being cached for future use wherever possible. The subject for a future post.

First steps with the Penguin Archive data

Wednesday, March 21st, 2012

Over the last few weeks, Jasper and I have met with Anna and Hannah, the archivists managing the Penguin Archive held by the University Library Special Collections Department, and I’ve had a first stab at processing some sample EAD XML data for a small subset of the collections which make up the archive.

As in the case of the metadata for the Mass Observation Archive that I worked on in the SALDA project last year, the data is held and managed within a CALM data management system, and the EAD data is generated by an export process from the CALM database. In comparison with the case of the Archives Hub, where data is aggregated from diverse sources and systems, this offers the advantage that there is much less structural variation across the documents, as the XML markup is being generated by a single common process. A second benefit is that the data content has been subject to various normalisation/validation processes within the CALM system.

I’m taking an approach similar to that I applied in the SALDA project, taking as a starting point (though we may refine/amend this) the model, URI patterns and the XSLT transform used in the LOCAH and Linking Lives projects, overriding or discarding some of the elements that are specific to the Archives Hub context, and adding (so far relatively few) elements specific to the Bristol/Penguin context. (Aside: in the Linking Lives project, I’ve been doing some work on the transform recently, to fix bugs, to extend it slightly and generally try to make it a bit more “robust” in terms of the range of inputs it handles, so I felt using this version was probably the best starting point – I hope something will be available about that on the Linking Lives blog very soon.)

Also within Linking Lives, I’ve spent some time tidying up the conversion processing, wrapping it up in some shell scripts, driving it from URI lists and adding some capture of (very basic) process metadata. My scripting skills are limited and I’m sure it’s not as elegant and efficient as it could be, but I was pleased to find that I could repurpose that work and get things up and running for Bricolage with a minimum amount of tweaking, and Jasper and I will be looking at getting it onto a firmer footing over the next few weeks.

The Penguin Archive data differs from the MOA data in that it is made up of a large number of collections, exported as distinct EAD documents. However, as noted above, the export process ensures a good level of structural consistency across the set. I think there are some (relatively minor) variations in the “style” of cataloguing, and we probably need to examine a larger sample to make sure the process is coping with that, but so far, the results look pretty good.

Also in contrast to the MOA data, the Penguin data does have index terms applied – in the data I’ve seen so far, personal and corporate names following National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names. We’ve also had access to a sample of “authority record data” exported from CALM: this gives us access to the name data in structured form, so by transforming this data alongside the EAD data we can add that structured data into the RDF output.

Currently the URI pattern for Persons (and “Conceptualisations of Persons”) makes use of a “slug” constructed from the “authority form” of the name, e.g. the EAD construct

  <persname rules="ncarules">Schmoller; Hans (1916-1985); typographer; designer</persname>

is transformed into RDF data like the following (I’ve omitted some triples for the sake of brevity as I really just want to show the URI structures):

@prefix rdfs: <> .
@prefix foaf: <> .
@prefix skos: <> .
@prefix concept: <> .
@prefix person: <> .

  a skos:Concept ;
  rdfs:label "Schmoller; Hans (1916-1985); typographer; designer"@en ;
  foaf:focus person:schmollerhans1916-1985typographerdesigner .

  a foaf:Person ;
  rdfs:label "Schmoller; Hans (1916-1985); typographer; designer"@en ;
  foaf:familyName "Schmoller" ;
  foaf:giveName "Hans" .

However, it’s been at the back of my mind that there is possibly some “brittleness” in the URI construction here: if there are changes to the name in the source data (e.g. the addition of a new “epithet” or “title”, or of a previously unknown date of death), when that data is reprocessed a different URI is generated. In principle, we could maintain both the old and new URIs, especially if such changes are rare, but it would be preferable to ensure from the offset that our URIs are based on data that does not change. Within the CALM system the authority records do use reference numbers for identification, so this raises the question of whether those reference numbers might be used as the basis for these URIs. But would those reference numbers offer more stability than the names? Are they stable across any internal reorganisation within CALM, or across upgrades between versions of CALM? Would they survive any future migration from CALM to some other data management system? These are questions we need to explore with Anna and Hannah before making any changes.

Finaly, where there are also some similarities with the MOA/SALDA data is that there are sets of resources which don’t have explicit index terms in the EAD data but for which the names of other entities are present, typically embedded in titles – names of persons in some cases, names of publications in others – which might usefully be linked to things described in other datasets. So one of our next steps is to analyse these cases further and decide whether it is useful/effective to apply some specific/”local” processing to generate additional entities in the RDF output in those cases.