You are currently browsing the Bricolage weblog archives for June, 2012.





Archive for June, 2012

Users and use cases – overview

Thursday, June 28th, 2012

The Bricolage project will publish catalogue metadata as Linked Open Data for two of the University of Bristol’s most significant collections: the Penguin Archive and the Geology Museum (site in development). We will also encode resource microdata into the Geology Museum’s forthcoming online catalogue with the aim of improving collection visibility via major search engines and develop two simple demonstrators to illustrate the potential of data linking and reuse.

The project’s users are therefore archive and museum staff responsible for cataloguing and managing these important collections. The Linked Data production workflows need to be easy to use to enable embedding in the collection teams’ routine and to maximise sustainability of export and publication processes beyond the end of the project lifetime so that Linked Data can continue to be produced for reuse. Separate blog posts describe the use case for the Penguin Archive  and for the Geology Museum indicate how the project affects our users and how they are being engaged and are reacting to the project.

Users and use cases: The Geology Museum

Thursday, June 28th, 2012

The Geology Museum (site under development) is based in the University of Bristol’s School of Earth Sciences. It holds historically and scientifically important collections that are unique to the institution. The museum holds an estimated 100,000 museum specimens, many of which are unique and of international importance. Highlights include: an estimated 20,000 invertebrate fossils including material with important historical associations, over 4,500 mineral specimens, including many display-quality items from nowadays inaccessible mines, over 3,000 vertebrate fossils and casts and the Fry collection of over 4,000 invertebrate and plant fossils from the UK. There is also an extensive teaching collection of 16,000 specimens. Over the past 15 years 41,420 digital records have been produced on the basis of historic museum registers, card index catalogues and specimen labels. The creation of digital metadata has focused on valuable specimens and collection of national or international importance. These records represent about two thirds of the entire collection. Each metadata record contains information in 30 categories, 18 of which will be published by this project.

The School of Earth Sciences is already undertaking work to enhance the online presence of the Geology Museum by improving the museum website and online access to the collections. Included in this work is the migration of the existing collection metadata into a Drupal backed system, which can be used to publish Linked Data automatically.

Initial work focused on moving data from existing spreadsheet format into the Drupal database. Issues arise in the formats used, including free text, and the need to restrict terminology. There is a huge amount of data but it is largely unstructured, so requires manual effort to review and test. Unlike the Penguin Archive use case, the export and publication processes are largely automated by Drupal’s in-built modules for handling RDF, returning it in response to a Linked Data request. The aim is to embed data from the catalogue in the Geology Museum’s new public website using schema.org metadata in the HTML of the site, so that large search engines can find structured data.

The Collections & Practicals Manager in the School of Earth Sciences has suggested that a map demonstrator would be useful for the Geology Museum Linked Data. She is concerned, however, that much of the geo-location data about the collection is embedded as free text in description fields, which would make it difficult to plot the data on a map consistently, if at all. She has proposed using geodata for ‘type specimen’ data for the centre of the UK, although this also raises questions about the level of resolution at which these data could be plotted: for some, the catalogue may only include data about the nearest town or village rather than a precise geolocation related to OS references. Given the Museum’s relationships with local schools and geology enthusiast  groups, one way of resolving this issue – and assisting the ‘clean up’ of the data and giving information on use of the site overall – could be to invite these ‘end users’ to provide feedback and correct location data via the site. She has arranged a meeting with one such group in July which could provide a starting point for this. It will need to be made clear to any users beyond the Museum staff, however, that the demonstrators are not at ‘full service grade’.

The Collection Manager has engaged fully with the project, participating in Advisory Board meetings, 1-1 meetings with the development team and piloting and providing feedback on data migration to Drupal. The demonstrator will provide a concrete example of how Linked Data published via Drupal can be used but evaluation of the value of embedding microdata to facilitate search engine optimisation is unlikely to extend beyond the lifetime of the project.

Users and use cases: The Penguin Archive

Thursday, June 28th, 2012

The Penguin Archive, housed in the Special Collections of the University of Bristol Library, contains the archives of Penguin Books Limited from its foundation in 1935 through to the 1980s. Its wide variety of materials covers the company’s establishment and business life, social events, legal cases (particularly the Lady Chatterley’s Lover trial of 1960), exhibitions on the company’s history and the private lives of prominent figures in the early history of the company. The archive also includes a large collection of Penguin books from 1935 to date. The collection comprises 2093 archive boxes of editorial files, 466 archive boxes, 24 records management boxes and 84 box files of other archival material and approximately 30,000 book titles. The digital catalogue is held in the Special Collections CALM (Computer Aided Library Management) installation. Holdings there comprise: 123 collection level descriptions containing over 4000 individual metadata records, plus detailed digital guides to areas of the archive.

JISC  has already undertaken work looking at techniques for exporting Linked Data from CALM and the current Step Change project will ensure that Linked Data support is embedded in a future release of CALM, albeit not within the Bricolage project’s lifetime. We will follow the approach developed by LOCAH and SALDA projects: data will be exported as EAD/XML, transformed via XSLT into Linked Data expressed in RDF/XML format, based on the XSLT stylesheet developed within LOCAH and made available as Linked Data. A handful of collection level Penguin Archive records are already lodged with the Archives Hub. Our project will augment this data with a Linked Data set containing thousands of resource-level catalogue records, which will be linked to the Archives Hub identifiers as and when these become available.

Initial work in the project focused on archivists trying to add authority terms to catalogue metadata but this proved extremely labour-intensive, more so than anticipated. The process has revealed how good authority data is needed for good Linked Data and that this needs to be taken into account when initially cataloguing collections – not an option for an existing catalogue like the Penguin Archive. Issues with the CALM export process and stability of URIs have been reported in other project blogposts.

Early development of tools to automate as far as possible the workflow of metadata review and export indicates the need to make it easy to keep the Lnked Data up to date after project funding ends. A batch upload process could be used for initial publication. The archivists confirm that the catalogue is “quite fluid” and is often updated, so ease of use and maintenance of Linked Data are important to our users. One option for increasing the automation of the publishing process could be to upload exports to a folder which was monitored for changes.  This may also address concerns that users have already expressed, i.e. that “any non-trivial publishing process would not be used in practice after the project ends”.  The project will aim to make the process as ‘light-touch’ as possible.

The Archivist in the University’s Special Collections notes that the primary concern of archivists is to publish sufficient metadata to enable those interested in the materials to be able to identify what exists, and to visit the Penguin Archive to use them for research, journalistic or other purposes. The Archivists have considered what would make an appropriate demonstrator for Linked Data published through the project; they would like to focus on the ‘administrative history’ of the Archive, plotting collection level records against a timeline of, for example, dates when key staff were appointed. Administrative history is a familiar archival concept so the demonstrator would be of interest both to other archivists and potentially to end users of the catalogue/Linked Data. A visual representation of the timeline list of events would need to be created manually; within the scope and timeframe of the project this will  only be possible for 1-2 decades, with just some key events plotted for the whole timeline.

The Penguin Archives archivists have engaged fully with the project, participating in Advisory Board meetings, 1-1 meetings with the development team and piloting and providing feedback on workflow processes.

Penguin Archive workflow progress

Friday, June 1st, 2012

Progress to date: we’ve combined the components described in our previous post and produced a simple UI, illustrated below. The interface allows an administrator to:

  • Upload EAD XML files exported from CALM
  • Transform to RDF (using XSLT)
  • Publish to the triple store (Fuseki)
  • ‘Unpublish’ and delete EAD/RDF files

The triple store is fronted by a Linked Data API (elda), so this process results in the publishing of Linked Data (a web-friendly version is shown in the screenshot below)

This seems a reasonable start, though there are some obvious next steps:

  • Bulk actions. There are over 100 collections within the archive. We’ll need at least some of bulk upload, bulk RDF conversion, bulk publishing.
  • Automation. Again, thinking of the administrator who ends up looking after this, we should have the option for non-interactive upload through to publishing
  • Security. There is none as yet.
  • Link suggestions & validations. Need to include a process for producing lists of suggested links to third-party data, and for user validation of those links.