Progress against workplan

The Bricolage Advisory Group reviewed progress against the project workplan at its second meeting on 20 April 2012. We agreed to bring forward by a month each the two remaining AG meetings – to May and July – to ensure continuing review and steer towards the end of the project in August.

The Linked Data: Hosting work package has started with options such as Talis and data.bris being considered. Timing of the data.bris project may preclude its use, although it may not be the best place to host data from Bricolage anyway. It could, however, be used to create URIs and expose data in an external view. Options to be considered more fully at the meeting in May.

The Linked Data : Metadata review/export work package is moving at a different pace with the Penguin Archive and the Geology Museum (see separate blog posts on export work). In Earth Sciences, student effort is being used to move data from spreadsheet format into an online (Drupal) database. Some of the issues arising are in formats used, including free text, and the need to restrict terminology. There is a huge amount of currently unstructured data. Review and export piloting will carry on for another 2-3 weeks. For the Penguin Archive, wok has focused on trying to add authority terms but this has proved extremely labour-intensive, more so than anticipated. Good authority data is needed for good Linked Data, and this needs to be taken into account when initially cataloguing collections. Legacy data without authority data continue to pose problems.

The entity extraction process poses the question ‘can we identify things in textual descriptions, linking unlinked data? Some online services can analyse text, eg DBPedia Spotlight entity detection and disambiguation services for constructing bespoke Named Entity Recognition solutions. The Women’s Library at London Met deals effectively with disambiguation of names. We are interested in parsing text through the process to see how useful and accurate it can be. Entity extraction is at the experimental end of our work, but Linked Data to an authority source and processes around this are of interest.

Export implementation should be complete by end of May.

Identifiers and Linking work package: Geology have thus far been creating internal links within their data. Work to link this data to other datasets has not yet started. Work on the Penguin Archive to date has highlighted a problem around the stability of URIs; sustainability is an issue for the future. For example, a person identified in the Penguin Archive data could have a unique ID in CALM but that identifier could easily break if CALM’s internal ID scheme is changed by the vendor. Alternative ID schemes that rely on the person’s name or their biographical dates also pose similar problems if, say, the person changes their name or their commonly accepted biographical dates change. We need persistent IDs (eg DOIs) in combination with a resolver service to map from persistent IDs to appropriate internal current IDs (eg CALM IDs).

On microdata, for Geology we’re looking to embed data coming out of the catalogue in the public site so that big search engines can find structured data. We’ll be looking at schema.org metadata as RDFa within the HTML of the public site.

Sustainability of the tools and workflow developed during the project is important. The key is in developing a set of processes and tools that are easy to use in terms of the export process and publication of Linked Data, so that archivists might routinely use them. Questions arise about what is most useful for the long-term, what is transferable.

The demonstrator workpackages will begin at the end of May; evaluation and dissemination work packages will be discussed at the May Advisory Group meeting.

Leave a Reply