You are currently browsing the archives for the Penguin Archive category.

Archive for the 'Penguin Archive' Category

Penguin Archive software

Tuesday, November 13th, 2012

The open source software developed as part of the Bricolage project to manage the workflow of publishing EAD data as Linked Data is now available from github:

It comprises three components:

  • Bricol: the management interface
  • Fuseki: a deployment of Fuseki configured for use with Bricolage
  • Elda: a deployment of Elda configured for use with Bricolage

Each of these components is a maven project, with build instructions.

Technical Approach: Penguin Archive

Tuesday, November 6th, 2012

The workflow architecture we adopted to support the publishing of Linked Data from the Penguin Archive is illustrated here:

The workflow breaks down into these steps:

  1. Export from CALM. This is achieved via the CALM desktop client and managed by the archivist administrator. It produces EAD XML format files.
  2. Import into Linked Data Publishing Service. Again managed by the archivist, a browser-based interface allows them to upload an EAD file per collection.
  3. Convert to RDF. Done via the LDPS, produces an RDF version of a collection’s EAD document by running it against a conversion XSLT stylesheet. Both the EAD and RDF documents are simply files held on the LDPS server.
  4. Publish Linked Data. The collection’s RDF is pushed (using SPARQL Update over HTTP) to our triple store (Fuseki).
  5. Link data. The LDPS aslo lists people details published in the previous step (via a SPARQL query run against the triple store). These can be used to query a third-party authority service (VIAF) with the aim of identifying a record for the individual there. If the archivist selects a record this information is asserted into our triple store.

At this point the data is available as Linked Data via the Linked Data API served from an instance of Elda. This data is therefore available in formats suitable for machine-processing, but as an example here is the human-friendly HTML view for the collection at (Note that this view is not intended as the primary interface to this material and as such not all the Elda functionality works at present.)

The software developed to support this process will be detailed in a subsequent post.

Penguin Archive demonstrator live

Tuesday, November 6th, 2012

Penguin Archive

The final version of the Penguin Archive demonstrator has been completed and embedded into the Special Collections’ Penguin Archive website.

Penguin demonstrator embedded in Special Collections

The demonstrator details have been documented in an earlier post, but in summary it combines a pre-populated chronology with live data pulled from the Archive’s newly-populated linked data store.

Lessons learned: summary

Monday, October 15th, 2012

At the final Advisory Group meeting towards the end of July 2012, the following points were made in relation to the evaluation of the quality of Linked Data produced and techniques used:

  • Quality depends not only on time-consuming human-crafted links to third party datasets, but also on the quality of those datasets (the project had identified some potential mistakes in VAIF)
  • Export processes such as those developed for the Penguin Archive use case were not really sustainable with the limited resources that archives usually have
  • Limitations of some parts of Drupal mean that, in the Geology use case, we may not be able to make as much of the Linked Data as we would like
  • Although we have made considerable efforts to make the user interface to the export and publication processes as smooth as possible, they are still not integrated enough to be adopted in normal working practice
  • There is a considerable learning curve in understanding Linked Data and what is needed to create and publish them, which requires intensive support and/or time to read around the subject
  • From a technical perspective, the project has highlighted how much impact on the data the production of Linked Data has – it’s not a simple conversion process
  • The extent to which Linked Data has the potential to ‘draw in’ new audiences for collections is more limited than envisaged, as collection level descriptions are already available in the Archives Hub, ranked highly in Google searches and accessible via Calm
  • It needs more resource, more space and more time!

In terms of methodology, the bringing together of different use cases and technical expertise had worked well, despite learning curves on all sides. The project had been beneficial in raising awareness of Linked Data issues in the Special Collections and Geology teams, and of archival and cataloguing practice in the technical team. Geology and Special Collections were also more aware of each others’ collections and potential for working together in the future.


Lessons learned: sustainability of workflow

Monday, October 15th, 2012

The two use cases present different scenarios for the sustainability of processes for publishing Linked Data. As noted in an earlier blogpost, for the Penguin Archive, the process has been unexpectedly time-consuming. The production of Linked Data requires enhancement of the collection metadata way beyond the requirements of currently accepted archival standards and, for the majority of archivists, presents a considerable technical learning curve. The key lesson for the archives community and for those promoting the creation of Linked Data is around the limitations that archivists’ core values and practice, and their very constrained time, impose.

The Penguin Archive had benefitted from a funded cataloguing post for two and a half years to transfer paper catalogue records to Calm; one of the archivists too up to a week to create new records for publication as Linked Data as part of the project.

For both collections, free-text catalogue entries were particularly problematic; if any structure is needed, time has to be spent on extracting specific text into new fields manually, or on re-categorisation. Although the project coincided with the development of a new Drupal-based online catalogue for the Geology Museum and publication of Linked Data is therefore largely automatic, the extent of data cleaning required as part of the process came as a surprise. The project allowed for effort to be expended on data cleaning; without this effort, any Linked Data published would have been of very limited use, even though publication is largely automatic through the additional Drupal module. Even where structure exists internally to collection data, there are still issues of linking these with third party datasets.

One recommendation may be for JISC to support a project to identify the amount of effort required for the creation and publication of Linked Data for different types of collections.

Overall, the message from both use cases is that the creation of Linked Data, links to authority files and to third party datasets should not be considered a quick and easy solution. Future upgrades of products such as Calm may well integrate creation and publication of Linked Data without there needing to be a separate process; without this, it is unlikely that data from collections like the Penguin Archive, where much human intervention is needed, will be published as Linked Data.

Users and use cases: Penguin archive

Thursday, July 26th, 2012

When we first looked at the data from the Penguin Archive, we wanted to index our collection level records with the names of the creating person or organization so that they could be linked to VIAF.  This seemed straightforward at first: we were working with a relatively small number of records (around 130); we use the Calm database; and the archivist who catalogued the collection had already created authority records for the most significant people and organizations in the archive.


However, the indexing turned out to be much more labour-intensive than it appeared.  I had not catalogued the collection, and I do not have the knowledge that our Penguin archivist had developed, so I needed to spend some time analysing the collections to work out an accurate provenance.  Although some of the creating people and organizations were available already, not all were, and it quickly became apparent that creating even the most basic authority records would use far too much of the time available in the whole project. At a very rough estimate, it took between twenty minutes and half an hour to find the information for a single authority record, and about a week to create the new records needed.


The time available for cataloguing in most archive services is extremely limited, and many important collections have little or no online presence beyond a collection level record.  Cataloguing frequently happens with short-term project funding, within tight timescales, and may be focused on providing a resource for outreach, for writing a company history, or for some other direct benefit to the organization that owns the collection.  This means that it can be difficult to justify spending time creating contextual information or doing a lot of indexing.  If there has to be a choice between creating a catalogue for a collection that is invisible on the Web and creating authority records relating to a collection that is already catalogued, many archivists would take the view that cataloguing is a better use of time.


This experience is a useful reminder that even when the right tools and standards are in place, projects should still plan for the research time which will be required to use them effectively.

Users and use cases – overview

Thursday, June 28th, 2012

The Bricolage project will publish catalogue metadata as Linked Open Data for two of the University of Bristol’s most significant collections: the Penguin Archive and the Geology Museum (site in development). We will also encode resource microdata into the Geology Museum’s forthcoming online catalogue with the aim of improving collection visibility via major search engines and develop two simple demonstrators to illustrate the potential of data linking and reuse.

The project’s users are therefore archive and museum staff responsible for cataloguing and managing these important collections. The Linked Data production workflows need to be easy to use to enable embedding in the collection teams’ routine and to maximise sustainability of export and publication processes beyond the end of the project lifetime so that Linked Data can continue to be produced for reuse. Separate blog posts describe the use case for the Penguin Archive  and for the Geology Museum indicate how the project affects our users and how they are being engaged and are reacting to the project.

Users and use cases: The Penguin Archive

Thursday, June 28th, 2012

The Penguin Archive, housed in the Special Collections of the University of Bristol Library, contains the archives of Penguin Books Limited from its foundation in 1935 through to the 1980s. Its wide variety of materials covers the company’s establishment and business life, social events, legal cases (particularly the Lady Chatterley’s Lover trial of 1960), exhibitions on the company’s history and the private lives of prominent figures in the early history of the company. The archive also includes a large collection of Penguin books from 1935 to date. The collection comprises 2093 archive boxes of editorial files, 466 archive boxes, 24 records management boxes and 84 box files of other archival material and approximately 30,000 book titles. The digital catalogue is held in the Special Collections CALM (Computer Aided Library Management) installation. Holdings there comprise: 123 collection level descriptions containing over 4000 individual metadata records, plus detailed digital guides to areas of the archive.

JISC  has already undertaken work looking at techniques for exporting Linked Data from CALM and the current Step Change project will ensure that Linked Data support is embedded in a future release of CALM, albeit not within the Bricolage project’s lifetime. We will follow the approach developed by LOCAH and SALDA projects: data will be exported as EAD/XML, transformed via XSLT into Linked Data expressed in RDF/XML format, based on the XSLT stylesheet developed within LOCAH and made available as Linked Data. A handful of collection level Penguin Archive records are already lodged with the Archives Hub. Our project will augment this data with a Linked Data set containing thousands of resource-level catalogue records, which will be linked to the Archives Hub identifiers as and when these become available.

Initial work in the project focused on archivists trying to add authority terms to catalogue metadata but this proved extremely labour-intensive, more so than anticipated. The process has revealed how good authority data is needed for good Linked Data and that this needs to be taken into account when initially cataloguing collections – not an option for an existing catalogue like the Penguin Archive. Issues with the CALM export process and stability of URIs have been reported in other project blogposts.

Early development of tools to automate as far as possible the workflow of metadata review and export indicates the need to make it easy to keep the Lnked Data up to date after project funding ends. A batch upload process could be used for initial publication. The archivists confirm that the catalogue is “quite fluid” and is often updated, so ease of use and maintenance of Linked Data are important to our users. One option for increasing the automation of the publishing process could be to upload exports to a folder which was monitored for changes.  This may also address concerns that users have already expressed, i.e. that “any non-trivial publishing process would not be used in practice after the project ends”.  The project will aim to make the process as ‘light-touch’ as possible.

The Archivist in the University’s Special Collections notes that the primary concern of archivists is to publish sufficient metadata to enable those interested in the materials to be able to identify what exists, and to visit the Penguin Archive to use them for research, journalistic or other purposes. The Archivists have considered what would make an appropriate demonstrator for Linked Data published through the project; they would like to focus on the ‘administrative history’ of the Archive, plotting collection level records against a timeline of, for example, dates when key staff were appointed. Administrative history is a familiar archival concept so the demonstrator would be of interest both to other archivists and potentially to end users of the catalogue/Linked Data. A visual representation of the timeline list of events would need to be created manually; within the scope and timeframe of the project this will  only be possible for 1-2 decades, with just some key events plotted for the whole timeline.

The Penguin Archives archivists have engaged fully with the project, participating in Advisory Board meetings, 1-1 meetings with the development team and piloting and providing feedback on workflow processes.

Penguin Archive workflow progress

Friday, June 1st, 2012

Progress to date: we’ve combined the components described in our previous post and produced a simple UI, illustrated below. The interface allows an administrator to:

  • Upload EAD XML files exported from CALM
  • Transform to RDF (using XSLT)
  • Publish to the triple store (Fuseki)
  • ‘Unpublish’ and delete EAD/RDF files

The triple store is fronted by a Linked Data API (elda), so this process results in the publishing of Linked Data (a web-friendly version is shown in the screenshot below)

This seems a reasonable start, though there are some obvious next steps:

  • Bulk actions. There are over 100 collections within the archive. We’ll need at least some of bulk upload, bulk RDF conversion, bulk publishing.
  • Automation. Again, thinking of the administrator who ends up looking after this, we should have the option for non-interactive upload through to publishing
  • Security. There is none as yet.
  • Link suggestions & validations. Need to include a process for producing lists of suggested links to third-party data, and for user validation of those links.

Bricolage: demonstrators

Friday, May 25th, 2012

At the Advisory Group meeting on 20 April, we discussed potential scope and focus of the two demonstrators that the project will develop. For the Geology Museum, we may want to focus on a demonstrator that links to promoting their work in schools, which could include a mapping feature. The Penguin Archive may want to consider a timeline demonstrator linked to a specific area of the Archive.

We looked at some examples to help refine thinking on demonstrators:

Examples of the use of a timeline:

Example of a geographical view:

The Advisory Group will finalise demonstrators to be developed at its meeting in May; the key will be in demonstrating how the use of Linked Data can enhance the collections, which may in turn encourage sustainability of the tools and processes used.