Penguin Archive software

November 13th, 2012

The open source software developed as part of the Bricolage project to manage the workflow of publishing EAD data as Linked Data is now available from github: https://github.com/ilrt/bricolage

It comprises three components:

  • Bricol: the management interface
  • Fuseki: a deployment of Fuseki configured for use with Bricolage
  • Elda: a deployment of Elda configured for use with Bricolage

Each of these components is a maven project, with build instructions.

Technical Approach: Penguin Archive

November 6th, 2012

The workflow architecture we adopted to support the publishing of Linked Data from the Penguin Archive is illustrated here:

The workflow breaks down into these steps:

  1. Export from CALM. This is achieved via the CALM desktop client and managed by the archivist administrator. It produces EAD XML format files.
  2. Import into Linked Data Publishing Service. Again managed by the archivist, a browser-based interface allows them to upload an EAD file per collection.
  3. Convert to RDF. Done via the LDPS, produces an RDF version of a collection’s EAD document by running it against a conversion XSLT stylesheet. Both the EAD and RDF documents are simply files held on the LDPS server.
  4. Publish Linked Data. The collection’s RDF is pushed (using SPARQL Update over HTTP) to our triple store (Fuseki).
  5. Link data. The LDPS aslo lists people details published in the previous step (via a SPARQL query run against the triple store). These can be used to query a third-party authority service (VIAF) with the aim of identifying a record for the individual there. If the archivist selects a record this information is asserted into our triple store.

At this point the data is available as Linked Data via the Linked Data API served from an instance of Elda. This data is therefore available in formats suitable for machine-processing, but as an example here is the human-friendly HTML view for the collection at http://tc-bricol.ilrt.bris.ac.uk/elda/api/penguin/id/archivalresource/gb-3-dm1187. (Note that this view is not intended as the primary interface to this material and as such not all the Elda functionality works at present.)

The software developed to support this process will be detailed in a subsequent post.

Penguin Archive demonstrator live

November 6th, 2012

Penguin Archive

The final version of the Penguin Archive demonstrator has been completed and embedded into the Special Collections’ Penguin Archive website.

Penguin demonstrator embedded in Special Collections

The demonstrator details have been documented in an earlier post, but in summary it combines a pre-populated chronology with live data pulled from the Archive’s newly-populated linked data store.

Lessons learned: summary

October 15th, 2012

At the final Advisory Group meeting towards the end of July 2012, the following points were made in relation to the evaluation of the quality of Linked Data produced and techniques used:

  • Quality depends not only on time-consuming human-crafted links to third party datasets, but also on the quality of those datasets (the project had identified some potential mistakes in VAIF)
  • Export processes such as those developed for the Penguin Archive use case were not really sustainable with the limited resources that archives usually have
  • Limitations of some parts of Drupal mean that, in the Geology use case, we may not be able to make as much of the Linked Data as we would like
  • Although we have made considerable efforts to make the user interface to the export and publication processes as smooth as possible, they are still not integrated enough to be adopted in normal working practice
  • There is a considerable learning curve in understanding Linked Data and what is needed to create and publish them, which requires intensive support and/or time to read around the subject
  • From a technical perspective, the project has highlighted how much impact on the data the production of Linked Data has – it’s not a simple conversion process
  • The extent to which Linked Data has the potential to ‘draw in’ new audiences for collections is more limited than envisaged, as collection level descriptions are already available in the Archives Hub, ranked highly in Google searches and accessible via Calm
  • It needs more resource, more space and more time!

In terms of methodology, the bringing together of different use cases and technical expertise had worked well, despite learning curves on all sides. The project had been beneficial in raising awareness of Linked Data issues in the Special Collections and Geology teams, and of archival and cataloguing practice in the technical team. Geology and Special Collections were also more aware of each others’ collections and potential for working together in the future.

 

Lessons learned: sustainability of workflow

October 15th, 2012

The two use cases present different scenarios for the sustainability of processes for publishing Linked Data. As noted in an earlier blogpost, for the Penguin Archive, the process has been unexpectedly time-consuming. The production of Linked Data requires enhancement of the collection metadata way beyond the requirements of currently accepted archival standards and, for the majority of archivists, presents a considerable technical learning curve. The key lesson for the archives community and for those promoting the creation of Linked Data is around the limitations that archivists’ core values and practice, and their very constrained time, impose.

The Penguin Archive had benefitted from a funded cataloguing post for two and a half years to transfer paper catalogue records to Calm; one of the archivists too up to a week to create new records for publication as Linked Data as part of the project.

For both collections, free-text catalogue entries were particularly problematic; if any structure is needed, time has to be spent on extracting specific text into new fields manually, or on re-categorisation. Although the project coincided with the development of a new Drupal-based online catalogue for the Geology Museum and publication of Linked Data is therefore largely automatic, the extent of data cleaning required as part of the process came as a surprise. The project allowed for effort to be expended on data cleaning; without this effort, any Linked Data published would have been of very limited use, even though publication is largely automatic through the additional Drupal module. Even where structure exists internally to collection data, there are still issues of linking these with third party datasets.

One recommendation may be for JISC to support a project to identify the amount of effort required for the creation and publication of Linked Data for different types of collections.

Overall, the message from both use cases is that the creation of Linked Data, links to authority files and to third party datasets should not be considered a quick and easy solution. Future upgrades of products such as Calm may well integrate creation and publication of Linked Data without there needing to be a separate process; without this, it is unlikely that data from collections like the Penguin Archive, where much human intervention is needed, will be published as Linked Data.

Lessons learned: linked data hosting

October 15th, 2012

The hosting review workpackage had originally identified Talis as the main option for external hosting. Talis wound up development of its external hosting platform during the Bricolage project’s lifetime, so this was no longer an option. The sustainability or otherwise of hosting and other platforms in relatively experimental areas of work is something to consider for future projects; having several options, in this case including internal hosting solutions, helps reduce risk of not being able to deliver on project plans. The Geology data will in any case be hosted as an integrated part of the museum’s new enhanced online presence. The Penquin Archive data will remain as a snapshot, also hosted on internal servers.

Demonstrator previews

August 15th, 2012

One deliverable of the Bricolage project is demonstrators:

  • a browser-based mapping application for exploring the Geology collection via its geography
  • an interactive timeline displaying the chronology of selected resources within the Penguin Archive

The demonstrators were designed to show the potential for building data visualisations from the Linked Data produced by the project. As such, they both take a similar technical approach: a javascript application that (via a RESTful web service) retrieves json-formatted data from the Linked Data service. This data is then rendered as required by the web-based application.

Both these demonstrators will be publicly available in September, hosted on the Penguin Archive and Geology collection sites. For now here are some screenshots and a little more detail…

Penguin Archive Timeline

 

The Penguin Archive Timeline uses the freely available TimelineJS to present an interactive chronology of key events in the history of Penguin Books. Its construction involved the following steps:

  • The collection curators created a spreadsheet containing the key events, their dates and (optionally) an illustrative image. The event data also included the unique collection identifiers of any related collection held in the archive. These identifiers would later provide a route to the Linked Data.
  • The spreadsheet was then parsed into a JSON data structure understandable by the TimelineJS javascript application.
    {
    "startDate":"1863",
    "endDate":"",
    "headline":"Birth of Samuel Lane",
    "text":"<span class='lod' res='gb-3-dm2244;gb-3-dm1649;gb-3-dm1676'><img style='border:none;' class='throbber' src='img/throbber.gif'/></span>"
    }
  • Loading this data (and hosting the linked images) gave us the basic chronology, but without any sign of Linked Data so far! Now to use the supplied collection codes.
  • A small edit was made to TimelineJS to provide a callback upon event data load. This callback gives us the chance to query the Linked Data service. E.g.
    /elda/api/penguin/id/archivalresource/gb-3-dm2309.json
  •  The JSON returned from the call is then parsed and used to populate the timeline (the embedded box in the image above).
    {
    "format" : "linked-data-api",
    "version" : "0.2",
    "result" : {
      "_about" : "http://tc-bricol.ilrt.bris.ac.uk/elda/api/penguin/id/archivalresource/gb-3-dm2309.json",
      "dc_title" : "Pelican Books, Penguin Books, Penguin Handbooks, Penguin Specials, Pan Books, and other materials",
      "extent" : "4 records management boxes (359 books)",
      ...
      }
    }

So the demonstrator shows the possibility of augmenting a purely browser-based application with rich, structured data.

Javascript libraries used: jquery-1.7.2, TimelineJS.

This demonstrator will be made publicly available via the Penguin Archive site in September. All associated code will also shortly be available under an Open Source licence.

A map interface for the Geology Collection

The Geology demonstrator provides a map-based route into the museum’s collection. As with the Penguin demonstrator, it is a javascript-based browser app that is backed by the Linked Data created by the project. The steps involved in creating the demonstrator were as follows:

  • The existing catalogue only had textual place name information. As part of the project this data was reviewed and somewhat cleaned. Then, in order to be able to locate the resources on a map, the place names were passed through a geocoding service. The resulting coordinates were stored with the records. This was done as part of the one-time data migration into the new Drupal platform. Code was also put in place to automatically geocode records created or edited as part of the ongoing catalogue work.
  • Drupal has RDF support, and this was configured (to be blogged about elsewhere shortly) to produce RDF versions of resource records like this:
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/terms/"
    xmlns:sioc="http://rdfs.org/sioc/ns#"
    xmlns:ad="http://schemas.talis.com/2005/address/schema#"
    xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
    
    <rdf:Description rdf:about="http://geomuse-dev.ilrt.bris.ac.uk/id/47-1">
    <rdf:type rdf:resource="http://schema.org/CreativeWork"/>
    <rdf:type rdf:resource="http://rdfs.org/sioc/ns#Item"/>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-08T11:58:16+01:00</dc:date>
    <dc:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-08T11:58:16+01:00</dc:created>
    <dc:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-09T15:21:33+01:00</dc:modified>
    <sioc:num_replies rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</sioc:num_replies>
    <dc:classification rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/age/phanerozoic"/>
    <dc:classification rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/age/mesozoic"/>
    <dc:classification rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/age/jurassic"/>
    <dc:classification rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/age/early-jurassic"/>
    <dc:classification rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/age/pliensbachian"/>
    <ad:regionName rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/place/radstock"/>
    <dc:creator rdf:resource="http://geomuse-dev.ilrt.bris.ac.uk/person/tutcher-jw-0"/>
    <geo:lat rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">51.257415</geo:lat>
    <geo:long rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">-2.504067</geo:long>
    </rdf:Description>
  • The data we are interested in here are classification and geo. Queries for data to populate the map are parameterised with bounding box coordinates and (optionally) age classification.
  • Drupal RDF also includes a SPARQL endpoint, and here, for ease of use from the javascript browser application, we fronted it using a RESTful Java web application. Thus an ajax request from the browser might look like:
    http://host/bricol-geology/rest/specimens/51.73155108088844,-0.5085178417969018/50.85218333554836,-4.166965107421902/bajocian

    and this would be translated into a SPARQL query by the web application:

    SELECT ?id ?lat ?lng ?region
    WHERE {  GRAPH ?g
      { ?id <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat . 
        ?id <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?lng . 
        ?id <http://purl.org/dc/terms/classification> <http://geomuse-dev.ilrt.bris.ac.uk/age/bajocian> .
        OPTIONAL { ?id <http://schemas.talis.com/2005/address/schema#regionName> ?region  }
        FILTER ( ?lat < 51.69240914989516 && ?lat > 50.812284718809906 && ?lng > -4.183444599609402 && ?lng < -0.5249973339844018)
      }
    }
  • Once parsed by the web app, the json returned is along these lines:
    [{"uri":"http://geomuse-dev.ilrt.bris.ac.uk/id/53-1",
      "regionUri":"http://geomuse-dev.ilrt.bris.ac.uk/place/dundry",
      "long":-2.638459,
      "lat":51.39859},
     ...
    ]
  • These points are then displayed (using Google maps api and MarkerClustererPlus) on the map tool. Roll-over popups provide further information on points as well as routes for launching collection browsing.

Javascript libraries used: jquery-1.7.2, jQWidgets, google maps api v3.

This demonstrator is still awaiting the final full Geology data set to be migrated and will be made publicly available via the Geology collection site in September. All associated code will also shortly be available under an Open Source licence.

Users and use cases: Penguin archive

July 26th, 2012

When we first looked at the data from the Penguin Archive, we wanted to index our collection level records with the names of the creating person or organization so that they could be linked to VIAF.  This seemed straightforward at first: we were working with a relatively small number of records (around 130); we use the Calm database; and the archivist who catalogued the collection had already created authority records for the most significant people and organizations in the archive.

 

However, the indexing turned out to be much more labour-intensive than it appeared.  I had not catalogued the collection, and I do not have the knowledge that our Penguin archivist had developed, so I needed to spend some time analysing the collections to work out an accurate provenance.  Although some of the creating people and organizations were available already, not all were, and it quickly became apparent that creating even the most basic authority records would use far too much of the time available in the whole project. At a very rough estimate, it took between twenty minutes and half an hour to find the information for a single authority record, and about a week to create the new records needed.

 

The time available for cataloguing in most archive services is extremely limited, and many important collections have little or no online presence beyond a collection level record.  Cataloguing frequently happens with short-term project funding, within tight timescales, and may be focused on providing a resource for outreach, for writing a company history, or for some other direct benefit to the organization that owns the collection.  This means that it can be difficult to justify spending time creating contextual information or doing a lot of indexing.  If there has to be a choice between creating a catalogue for a collection that is invisible on the Web and creating authority records relating to a collection that is already catalogued, many archivists would take the view that cataloguing is a better use of time.

 

This experience is a useful reminder that even when the right tools and standards are in place, projects should still plan for the research time which will be required to use them effectively.

Users and use cases – overview

June 28th, 2012

The Bricolage project will publish catalogue metadata as Linked Open Data for two of the University of Bristol’s most significant collections: the Penguin Archive and the Geology Museum (site in development). We will also encode resource microdata into the Geology Museum’s forthcoming online catalogue with the aim of improving collection visibility via major search engines and develop two simple demonstrators to illustrate the potential of data linking and reuse.

The project’s users are therefore archive and museum staff responsible for cataloguing and managing these important collections. The Linked Data production workflows need to be easy to use to enable embedding in the collection teams’ routine and to maximise sustainability of export and publication processes beyond the end of the project lifetime so that Linked Data can continue to be produced for reuse. Separate blog posts describe the use case for the Penguin Archive  and for the Geology Museum indicate how the project affects our users and how they are being engaged and are reacting to the project.

Users and use cases: The Geology Museum

June 28th, 2012

The Geology Museum (site under development) is based in the University of Bristol’s School of Earth Sciences. It holds historically and scientifically important collections that are unique to the institution. The museum holds an estimated 100,000 museum specimens, many of which are unique and of international importance. Highlights include: an estimated 20,000 invertebrate fossils including material with important historical associations, over 4,500 mineral specimens, including many display-quality items from nowadays inaccessible mines, over 3,000 vertebrate fossils and casts and the Fry collection of over 4,000 invertebrate and plant fossils from the UK. There is also an extensive teaching collection of 16,000 specimens. Over the past 15 years 41,420 digital records have been produced on the basis of historic museum registers, card index catalogues and specimen labels. The creation of digital metadata has focused on valuable specimens and collection of national or international importance. These records represent about two thirds of the entire collection. Each metadata record contains information in 30 categories, 18 of which will be published by this project.

The School of Earth Sciences is already undertaking work to enhance the online presence of the Geology Museum by improving the museum website and online access to the collections. Included in this work is the migration of the existing collection metadata into a Drupal backed system, which can be used to publish Linked Data automatically.

Initial work focused on moving data from existing spreadsheet format into the Drupal database. Issues arise in the formats used, including free text, and the need to restrict terminology. There is a huge amount of data but it is largely unstructured, so requires manual effort to review and test. Unlike the Penguin Archive use case, the export and publication processes are largely automated by Drupal’s in-built modules for handling RDF, returning it in response to a Linked Data request. The aim is to embed data from the catalogue in the Geology Museum’s new public website using schema.org metadata in the HTML of the site, so that large search engines can find structured data.

The Collections & Practicals Manager in the School of Earth Sciences has suggested that a map demonstrator would be useful for the Geology Museum Linked Data. She is concerned, however, that much of the geo-location data about the collection is embedded as free text in description fields, which would make it difficult to plot the data on a map consistently, if at all. She has proposed using geodata for ‘type specimen’ data for the centre of the UK, although this also raises questions about the level of resolution at which these data could be plotted: for some, the catalogue may only include data about the nearest town or village rather than a precise geolocation related to OS references. Given the Museum’s relationships with local schools and geology enthusiast  groups, one way of resolving this issue – and assisting the ‘clean up’ of the data and giving information on use of the site overall – could be to invite these ‘end users’ to provide feedback and correct location data via the site. She has arranged a meeting with one such group in July which could provide a starting point for this. It will need to be made clear to any users beyond the Museum staff, however, that the demonstrators are not at ‘full service grade’.

The Collection Manager has engaged fully with the project, participating in Advisory Board meetings, 1-1 meetings with the development team and piloting and providing feedback on data migration to Drupal. The demonstrator will provide a concrete example of how Linked Data published via Drupal can be used but evaluation of the value of embedding microdata to facilitate search engine optimisation is unlikely to extend beyond the lifetime of the project.