Users and use cases: The Penguin Archive

June 28th, 2012

The Penguin Archive, housed in the Special Collections of the University of Bristol Library, contains the archives of Penguin Books Limited from its foundation in 1935 through to the 1980s. Its wide variety of materials covers the company’s establishment and business life, social events, legal cases (particularly the Lady Chatterley’s Lover trial of 1960), exhibitions on the company’s history and the private lives of prominent figures in the early history of the company. The archive also includes a large collection of Penguin books from 1935 to date. The collection comprises 2093 archive boxes of editorial files, 466 archive boxes, 24 records management boxes and 84 box files of other archival material and approximately 30,000 book titles. The digital catalogue is held in the Special Collections CALM (Computer Aided Library Management) installation. Holdings there comprise: 123 collection level descriptions containing over 4000 individual metadata records, plus detailed digital guides to areas of the archive.

JISC  has already undertaken work looking at techniques for exporting Linked Data from CALM and the current Step Change project will ensure that Linked Data support is embedded in a future release of CALM, albeit not within the Bricolage project’s lifetime. We will follow the approach developed by LOCAH and SALDA projects: data will be exported as EAD/XML, transformed via XSLT into Linked Data expressed in RDF/XML format, based on the XSLT stylesheet developed within LOCAH and made available as Linked Data. A handful of collection level Penguin Archive records are already lodged with the Archives Hub. Our project will augment this data with a Linked Data set containing thousands of resource-level catalogue records, which will be linked to the Archives Hub identifiers as and when these become available.

Initial work in the project focused on archivists trying to add authority terms to catalogue metadata but this proved extremely labour-intensive, more so than anticipated. The process has revealed how good authority data is needed for good Linked Data and that this needs to be taken into account when initially cataloguing collections – not an option for an existing catalogue like the Penguin Archive. Issues with the CALM export process and stability of URIs have been reported in other project blogposts.

Early development of tools to automate as far as possible the workflow of metadata review and export indicates the need to make it easy to keep the Lnked Data up to date after project funding ends. A batch upload process could be used for initial publication. The archivists confirm that the catalogue is “quite fluid” and is often updated, so ease of use and maintenance of Linked Data are important to our users. One option for increasing the automation of the publishing process could be to upload exports to a folder which was monitored for changes.  This may also address concerns that users have already expressed, i.e. that “any non-trivial publishing process would not be used in practice after the project ends”.  The project will aim to make the process as ‘light-touch’ as possible.

The Archivist in the University’s Special Collections notes that the primary concern of archivists is to publish sufficient metadata to enable those interested in the materials to be able to identify what exists, and to visit the Penguin Archive to use them for research, journalistic or other purposes. The Archivists have considered what would make an appropriate demonstrator for Linked Data published through the project; they would like to focus on the ‘administrative history’ of the Archive, plotting collection level records against a timeline of, for example, dates when key staff were appointed. Administrative history is a familiar archival concept so the demonstrator would be of interest both to other archivists and potentially to end users of the catalogue/Linked Data. A visual representation of the timeline list of events would need to be created manually; within the scope and timeframe of the project this will  only be possible for 1-2 decades, with just some key events plotted for the whole timeline.

The Penguin Archives archivists have engaged fully with the project, participating in Advisory Board meetings, 1-1 meetings with the development team and piloting and providing feedback on workflow processes.

Penguin Archive workflow progress

June 1st, 2012

Progress to date: we’ve combined the components described in our previous post and produced a simple UI, illustrated below. The interface allows an administrator to:

  • Upload EAD XML files exported from CALM
  • Transform to RDF (using XSLT)
  • Publish to the triple store (Fuseki)
  • ‘Unpublish’ and delete EAD/RDF files

The triple store is fronted by a Linked Data API (elda), so this process results in the publishing of Linked Data (a web-friendly version is shown in the screenshot below)

This seems a reasonable start, though there are some obvious next steps:

  • Bulk actions. There are over 100 collections within the archive. We’ll need at least some of bulk upload, bulk RDF conversion, bulk publishing.
  • Automation. Again, thinking of the administrator who ends up looking after this, we should have the option for non-interactive upload through to publishing
  • Security. There is none as yet.
  • Link suggestions & validations. Need to include a process for producing lists of suggested links to third-party data, and for user validation of those links.

Bricolage: demonstrators

May 25th, 2012

At the Advisory Group meeting on 20 April, we discussed potential scope and focus of the two demonstrators that the project will develop. For the Geology Museum, we may want to focus on a demonstrator that links to promoting their work in schools, which could include a mapping feature. The Penguin Archive may want to consider a timeline demonstrator linked to a specific area of the Archive.

We looked at some examples to help refine thinking on demonstrators:

Examples of the use of a timeline:

Example of a geographical view:

The Advisory Group will finalise demonstrators to be developed at its meeting in May; the key will be in demonstrating how the use of Linked Data can enhance the collections, which may in turn encourage sustainability of the tools and processes used.

Progress against workplan

May 25th, 2012

The Bricolage Advisory Group reviewed progress against the project workplan at its second meeting on 20 April 2012. We agreed to bring forward by a month each the two remaining AG meetings – to May and July – to ensure continuing review and steer towards the end of the project in August.

The Linked Data: Hosting work package has started with options such as Talis and data.bris being considered. Timing of the data.bris project may preclude its use, although it may not be the best place to host data from Bricolage anyway. It could, however, be used to create URIs and expose data in an external view. Options to be considered more fully at the meeting in May.

The Linked Data : Metadata review/export work package is moving at a different pace with the Penguin Archive and the Geology Museum (see separate blog posts on export work). In Earth Sciences, student effort is being used to move data from spreadsheet format into an online (Drupal) database. Some of the issues arising are in formats used, including free text, and the need to restrict terminology. There is a huge amount of currently unstructured data. Review and export piloting will carry on for another 2-3 weeks. For the Penguin Archive, wok has focused on trying to add authority terms but this has proved extremely labour-intensive, more so than anticipated. Good authority data is needed for good Linked Data, and this needs to be taken into account when initially cataloguing collections. Legacy data without authority data continue to pose problems.

The entity extraction process poses the question ‘can we identify things in textual descriptions, linking unlinked data? Some online services can analyse text, eg DBPedia Spotlight entity detection and disambiguation services for constructing bespoke Named Entity Recognition solutions. The Women’s Library at London Met deals effectively with disambiguation of names. We are interested in parsing text through the process to see how useful and accurate it can be. Entity extraction is at the experimental end of our work, but Linked Data to an authority source and processes around this are of interest.

Export implementation should be complete by end of May.

Identifiers and Linking work package: Geology have thus far been creating internal links within their data. Work to link this data to other datasets has not yet started. Work on the Penguin Archive to date has highlighted a problem around the stability of URIs; sustainability is an issue for the future. For example, a person identified in the Penguin Archive data could have a unique ID in CALM but that identifier could easily break if CALM’s internal ID scheme is changed by the vendor. Alternative ID schemes that rely on the person’s name or their biographical dates also pose similar problems if, say, the person changes their name or their commonly accepted biographical dates change. We need persistent IDs (eg DOIs) in combination with a resolver service to map from persistent IDs to appropriate internal current IDs (eg CALM IDs).

On microdata, for Geology we’re looking to embed data coming out of the catalogue in the public site so that big search engines can find structured data. We’ll be looking at metadata as RDFa within the HTML of the public site.

Sustainability of the tools and workflow developed during the project is important. The key is in developing a set of processes and tools that are easy to use in terms of the export process and publication of Linked Data, so that archivists might routinely use them. Questions arise about what is most useful for the long-term, what is transferable.

The demonstrator workpackages will begin at the end of May; evaluation and dissemination work packages will be discussed at the May Advisory Group meeting.

Penguin Archive workflow design

May 17th, 2012

One of our aims is to produce a workflow for publishing Linked Open Data that is simple for our collection curators to engage with and continue to use post-project. Whilst we can produce one-off Linked Data snapshots of their archives during the project, a real measure of success would be to see those processes embedded into the standard work of the collection maintainers. From their point-of-view the steps for re-publishing the various collections as Linked Data should be as straightforward as possible. Just uploading a raw data file, for example. We want to avoid (as much as possible) introducing requirements that involve too much manual intervention.

So let’s consider one possible workflow for the Penguin Archive in this post. (This workflow is based around the concept of local Linked Data publishing platform, though we are also considering remote third-party options.)

  • The collection is held in an instance of Axiell CALM. Using its desktop client it is a relatively straightforward process for an archivist to export collection data to EAD XML format.
  • The resulting files can be uploaded to our Linked Data Publishing Service (LDPS) via a browser-based tool. This tool will provide a simple management interface to govern the workflow, the first part of which is transformation into RDF. Building on work done in the LOCAH project we use XSLT to produce RDF from the EAD XML.
  • Using the LDPS again, the RDF is loaded into a triple store. Our intention here is to use Fuseki for this task. Using its support for SPARQL over HTTP the LDPS will manage each collection as a separate Named Graph. Fuseki provides a SPARQL endpoint to these Graphs and also serves the Default Graph as their union.
  • In order to serve the Linked Data we intend to use Elda. This is an implementation of the Linked Data API and will answer Linked Data requests by translating them to backend SPARQL queries.

Here is a graphical overview of this architecture:

In terms of automation it doesn’t look to bad. The archivist needs to manage the CALM exports and XML file uploads to the LDPS. Then they (or someone else) would use the LDPS to manage the RDF transforms and triple store contents. The interface to these processes should hopefully be reasonably straightforward (upload, transform, publish to triple store). Elda, once configured, shouldn’t need touching.

So what is missing? Well, linking to other data sets for a start. In this area we’ll need the input of subject experts. When making links from our data sets to others we’ll need a human to confirm the links any automation (text parsing via DBpedia Spotlight?) has proposed. The design of this type of workflow needs to be carefully considered, with the results being cached for future use wherever possible. The subject for a future post.

People and their URIs

May 15th, 2012

Over on the Linking Lives blog Pete Johnston has published an interesting discussion of some of the problems surrounding the generation of URIs from EAD data, especially with regards to people. These considerations are proving to be particularly pertinent to our work with the Penguin collection data.

First steps with the Penguin Archive data

March 21st, 2012

Over the last few weeks, Jasper and I have met with Anna and Hannah, the archivists managing the Penguin Archive held by the University Library Special Collections Department, and I’ve had a first stab at processing some sample EAD XML data for a small subset of the collections which make up the archive.

As in the case of the metadata for the Mass Observation Archive that I worked on in the SALDA project last year, the data is held and managed within a CALM data management system, and the EAD data is generated by an export process from the CALM database. In comparison with the case of the Archives Hub, where data is aggregated from diverse sources and systems, this offers the advantage that there is much less structural variation across the documents, as the XML markup is being generated by a single common process. A second benefit is that the data content has been subject to various normalisation/validation processes within the CALM system.

I’m taking an approach similar to that I applied in the SALDA project, taking as a starting point (though we may refine/amend this) the model, URI patterns and the XSLT transform used in the LOCAH and Linking Lives projects, overriding or discarding some of the elements that are specific to the Archives Hub context, and adding (so far relatively few) elements specific to the Bristol/Penguin context. (Aside: in the Linking Lives project, I’ve been doing some work on the transform recently, to fix bugs, to extend it slightly and generally try to make it a bit more “robust” in terms of the range of inputs it handles, so I felt using this version was probably the best starting point – I hope something will be available about that on the Linking Lives blog very soon.)

Also within Linking Lives, I’ve spent some time tidying up the conversion processing, wrapping it up in some shell scripts, driving it from URI lists and adding some capture of (very basic) process metadata. My scripting skills are limited and I’m sure it’s not as elegant and efficient as it could be, but I was pleased to find that I could repurpose that work and get things up and running for Bricolage with a minimum amount of tweaking, and Jasper and I will be looking at getting it onto a firmer footing over the next few weeks.

The Penguin Archive data differs from the MOA data in that it is made up of a large number of collections, exported as distinct EAD documents. However, as noted above, the export process ensures a good level of structural consistency across the set. I think there are some (relatively minor) variations in the “style” of cataloguing, and we probably need to examine a larger sample to make sure the process is coping with that, but so far, the results look pretty good.

Also in contrast to the MOA data, the Penguin data does have index terms applied – in the data I’ve seen so far, personal and corporate names following National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names. We’ve also had access to a sample of “authority record data” exported from CALM: this gives us access to the name data in structured form, so by transforming this data alongside the EAD data we can add that structured data into the RDF output.

Currently the URI pattern for Persons (and “Conceptualisations of Persons”) makes use of a “slug” constructed from the “authority form” of the name, e.g. the EAD construct

  <persname rules="ncarules">Schmoller; Hans (1916-1985); typographer; designer</persname>

is transformed into RDF data like the following (I’ve omitted some triples for the sake of brevity as I really just want to show the URI structures):

@prefix rdfs: <> .
@prefix foaf: <> .
@prefix skos: <> .
@prefix concept: <> .
@prefix person: <> .

  a skos:Concept ;
  rdfs:label "Schmoller; Hans (1916-1985); typographer; designer"@en ;
  foaf:focus person:schmollerhans1916-1985typographerdesigner .

  a foaf:Person ;
  rdfs:label "Schmoller; Hans (1916-1985); typographer; designer"@en ;
  foaf:familyName "Schmoller" ;
  foaf:giveName "Hans" .

However, it’s been at the back of my mind that there is possibly some “brittleness” in the URI construction here: if there are changes to the name in the source data (e.g. the addition of a new “epithet” or “title”, or of a previously unknown date of death), when that data is reprocessed a different URI is generated. In principle, we could maintain both the old and new URIs, especially if such changes are rare, but it would be preferable to ensure from the offset that our URIs are based on data that does not change. Within the CALM system the authority records do use reference numbers for identification, so this raises the question of whether those reference numbers might be used as the basis for these URIs. But would those reference numbers offer more stability than the names? Are they stable across any internal reorganisation within CALM, or across upgrades between versions of CALM? Would they survive any future migration from CALM to some other data management system? These are questions we need to explore with Anna and Hannah before making any changes.

Finaly, where there are also some similarities with the MOA/SALDA data is that there are sets of resources which don’t have explicit index terms in the EAD data but for which the names of other entities are present, typically embedded in titles – names of persons in some cases, names of publications in others – which might usefully be linked to things described in other datasets. So one of our next steps is to analyse these cases further and decide whether it is useful/effective to apply some specific/”local” processing to generate additional entities in the RDF output in those cases.

The Project Plan

March 8th, 2012

Aims, Objectives and Final Output(s) of the project

The Bricolage (University of Bristol Collections as Linked Open Data) project will work with two of its most significant collections to publish catalogue metadata as Linked Open Data.

  • The Penguin Archive, a comprehensive collection of the publisher’s papers and books.
  • The Geology Museum, a 100,000 specimen collection housing many unique and irreplaceable resources.

The project will re-apply the best practice processes and tools produced by relevant preceding projects to create persistent identifiers, identify and create links to authoritative datasets and vocabularies, and work with the two collections’ infrastructure platforms: CALM and Drupal. The Linked Data production workflows will be embedded in the collections’ teams to ensure future sustainability. The project will also produce two simple demonstrators to illustrate the potential of data linking and reuse, and will encode resource microdata into the Geology Museum’s forthcoming online catalogue with the aim of improving collection visibility via the major search engines.

The metadata will be also licensed for ease of reuse according to JISC guidelines.

The main outputs of this project:

  • Linked Data sets for the Geology Museum and the Penguin Archive, with reuse guidance.
  • Two demonstrators illustrating data reuse: a browser-based mapping application for exploring the Geology collection via its geography, and an interactive timeline displaying the chronology of selected resources within the Penguin Archive.
  • A case study report on the experiences of embedding microdata into the Geology Museum website.

Wider Benefits to Sector & Achievements for Host Institution

One of the main achievements for the project’s host institution will be the sustainable production of public open Linked Data for two of its largest collections. As well as increasing the profile, visibility and potential for reuse of the catalogues in question, the experience gained during the project will provide a solid grounding for the reapplication of the methods to other collections in future.

For the sector, the wider benefits of this work include the following.

  • The addition of two significant collections to the Linked Data ecosystem. The new datasets will be interlinked with existing public vocabularies and datasets, so aiding their ease of discovery and reuse.
  • Both CALM and Drupal are widely used within the HE sector. CALM is an established library tool and Drupal, though originally a content management system, is increasingly found in cataloguing environments. The project’s work with both platforms will provide useful learning for the community.
  • Other valuable lessons will also be shared with the community, in particular as regards working with subject experts to identify authoritative public schemas and datasets, embedding sustainable processes within collection teams, our experience of using microdata and of using the data to produce examples of reuse.

Risk Analysis and Success Plan








Action to Prevent/Manage Risk





The staff named below all have significant experience within their areas of expertise. IT Services and Bristol University in general offer a pool of staff with suitably equivalent skills in the event of any staff departures occurring in the project.





The need to manage a team spanning three departments has been considered when allocating the proportion of project management.





The project remit will be highly focused, and is building upon work already done in this area. In addition there is experience of Linked Data within the team gained from previous JISC projects. The project also has two hosting options.





Licensing issues that may limit the reuse of the data produced have been considered and are not deemed to be a barrier. Both collections have committed to use permissive licences. Any software produced will be available under an open source licence.

Stakeholder engagement




Engagement with stakeholders is important to the project and the workplan includes effort to support engagement activities. These will also be evaluated.

The main issue that would arise for the project if its outputs were to prove popular would be managing any excessive demands on the hosting resources. Simple downloads of the data set would not be problematic in this regard but interfaces that required server-side processing (e.g. SPARQL) could be. These questions will be considered when the project is conducting its review of the data hosting options.


Both collections within the project have committed to release their catalogue metadata as Linked Data for reuse under the ODC-PDDL or CC0 licence, as per the guidance given by the Open Bibliographic Data Guide. This commitment will ensure that the Linked Data produced will be open to reuse, and it also meets the requirement for involvement with the Talis Platform Connected Commons scheme.

Any source code produced will be the copyright of the University of Bristol. It will be made available under an open source licence for free and non-commercial use and will be available to the UK Higher Education and Further Education community in perpetuity.

Project Team Relationships and End User Engagement

The team and their roles:

  • Professor Mike Benton, Professor of Vertebrate Palaeontology. Advisory Group.
  • Claudia Hildebrandt, Collections and Practicals Manager. Geology catalogue expertise and Advisory Group.
  • Pete Johnston, Eduserv. Metadata Consultant.
  • Hannah Lowery, Archivist, Special Collections. Advisory Group.
  • Anna Riggs, Archivist. Penguin catalogue expertise.
  • Jasper Tredgold, Senior Technical Developer.
  • Geology Studentship. Working on the Geology catalogue.
The project team will devise a dissemination plan at an early stage. We will disseminate good practice and lessons learned through the project blog and JISC events. We will seek to collaborate with related JISC projects where possible and will make our reports freely available online. We have a strong record of collaboration and regularly disseminate project outputs to the HEI community.
The project will also engage with stakeholders via the existing relationships held by the Penguin Archive and the Geology Museum. These include researchers, teachers and other academics. The engagement will raise awareness of the project and seek to gain input into potential applications for data reuse.

Projected Timeline, Workplan & Overall Project Methodology

M1 M2 M3 M4 M5 M6 M7
Governance and Engagement Establish mailing lists, project blog and project wiki
Advisory group establish and meet
Detailed work plan (to be evaluated monthly)
Linked Data Hosting review
Collection metadata review and preparation
Export process development
Identifiers and linking
Export implementation
Documentation for reuse
Microdata Schema review
Microdata markup creation
Embedding in Geology online catalogue
Sustainability Embed Linked Data maintenance processes
Demonstrators Produce two demonstrations of reuse
Evaluation Evaluation of the Linked Data produced and the techniques used. The project methodology will also be evaluated.
Final Reporting & Dissemination Lessons learned, findings of value to the JISC community
Final release of Linked Data with documentation

A few more details on selected workpackages follow.

Linked Data: Hosting review

The project has the commitment of both the Geology Museum and the Library Special Collections as regards the hosting of the Linked Data produced. In addition the team has experience of hosting Linked Data from previous projects. However, at an early stage we will also assess the suitability of using the Talis Platform Connected Commons scheme to host the project’s Linked Data outputs. This scheme supports the publishing and the reuse of Linked Data by removing, for qualifying data sets, the associated hosting costs.

Linked Data: Collection metadata review and preparation

One of the first tasks of the project will be to review the current collection metadata with particular regard to its structure. While labour-intensive changes are not in scope the team will seek to make edits that will ensure coherency and aid the subsequent transformation of the data to a format that supports reuse. Examples of this may be date, place name and person name formats.

The project will also assess the scope for the archivists to undertake some limited manual enrichment of the data. An example, related to the Penguin Archive in particular, might be to add event information. So metadata for a set of minutes of a committee meeting would be extended with data describing the meeting as an event associated with a time, place, people etc.

Linked Data: Export process development

The Penguin Archive is held in the Special Collection’s CALM installation. JISC has already undertaken work looking at techniques for exporting Linked Data from CALM, and this project will reference and build on that work, in particular the SALDA and LOCAH projects. It will also maintain links with the recently funded JISC Step Change project. This latter project will ensure Linked Data support is embedded in a future release of CALM. Although this release will not occur in the lifetime of Bricolage by keeping up-to-date with their work and other developments relevant to the Discovery programme, we will ensure our outputs will be compatible with outputs from current infrastructure projects.

For the catalogue data held in CALM the project will follow the approach developed by LOCAH and SALDA. Data will be exported as EAD/XML, transformed via XSLT into Linked Data expressed in RDF/XML format. The starting point for the transformation will be the XSLT stylesheet developed within LOCAH and made available by the Archives Hub.

Linked Data: Identifiers and linking

Part of the project is to use metadata released by the project in conjunction with already existing open metadata. In pursuit of this goal the subject experts within the team will identify appropriate open datasets and vocabularies and lead the work to inter-link the Bristol datasets with them. Obvious examples include DBpedia and the LCSH (or FAST) and VIAF authority services. The Linked Data version of the British National Bibliography will be of particular interest to the Penguin Archive, and the CIDOC Conceptual Reference Model (CRM), and perhaps the BBC Wildlife Finder, will be for Geology. The project will also reuse the RDF vocabulary produced by the LOCAH project. We anticipate that the techniques developed by our subject experts for this process will provide interesting lessons for the community.

As noted in the Discovery programme’s draft high-level technical principles, resource discovery “relies on persistent global identifiers”. The project will follow best practice in this area and use carefully designed URIs, in consultation with other on-going institutional work in this area. These URIs will be created with interoperability and persistence in mind.

Within the Geology domain the project envisages linking geographical information about museum specimens with open access geographical databases (e.g. GeoNames) and GIS systems and interfaces. We believe that this will allow users to not only search the collection database but visualise geographical distributions of specimens and familiarise themselves with local and regional geology – a useful tool for scientists and schools.


The project will also work with the University’s online enhancement team (co-located with the project team) to embed microdata derived from the Geology metadata into their new museum website. This
microdata work will seek to use and extend the schemas found at, and as a result, will provide structured data recognizable by the major search providers. This strategy aims to improve the discoverability of the museum’s collections, as described in the Discovery programme’s draft technical principles.


For the Geology data the demonstrator will be a browser-based mapping application, allowing a user to navigate the collection via the geographic locations of the resources. This will utilise the links made from the resource metadata to open access geographical databases and will provide an example of a new and versatile way to explore the museum’s collection.

For the Penguin Archive the project will produce an interactive timeline-based interface to aspects of the collection, in particular the resources associated with the Lady Chatterley’s Lover trial. This will provide a chronological view of the data not possible using traditional catalogue data and interfaces.


Total project cost: £81,557.

Of which £43,095 from JISC, £38,462 from University of Bristol.

The problem we are addressing and why

January 10th, 2012

Bricolage, or to give it its full name, the ‘University of Bristol Collections as Linked Open Data’ project, has recently been funded by JISC as part of the 16/11 grant funding: JISC Digital infrastructure programme.

The project will start 1st Feb 2012 and run through to 31st August 2012.

This project will publish catalogue metadata as Linked Open Data for two of its most significant collections: the Penguin Archive, a comprehensive collection of the publisher’s papers and books; and the Geology Museum, a 100,000 specimen collection housing many unique and irreplaceable resources.

The metadata will be licensed for ease of reuse according to JISC guidelines.

The project will re-apply the best practice processes and tools produced by relevant preceding projects to create persistent identifiers, identify and create links to authoritative datasets and vocabularies, and work with the two collections’ infrastructure platforms: CALM and Drupal. The Linked Data production workflows will be embedded in the collections’ teams to ensure future sustainability.

The project will also produce two simple demonstrators to illustrate the potential of data linking and reuse, and will encode resource microdata into the Geology Museum’s forthcoming online catalogue with the aim of improving collection visibility via the major search engines.