News, Resources

Resource: Extracting Keywords from Crowdsourced Collections Project

Extracting Keywords from Crowdsourced Collections was a Digital Scholarship @ Oxford (DiSc) Research Development Grant-funded project based in the Faculty of English at the University of Oxford. Using the Their Finest Hour Online Archive, a digital collection of 2,000+ records and 26,000+ files related to the Second World War, as a case study, this project set out to explore how Natural Language Processing (NLP) methods could be utilised to extract keywords from crowdsourced digital collection data.

Assigning appropriate keyword tags to digital collection records is a crucial step in supporting search and discovery, as well as adherence to FAIR data principles. Traditionally, this process has involved manually assigning keywords, often using a pre-defined/inherited controlled vocabulary used within a particular institution. Manual tagging of keywords can be resource-intensive, potentially lead to the misrepresentation of records or collections, and can perpetuate historic assumptions, biases and stereotypes associated with particular domains. While there have been efforts to democratise digital collections metadata creation, in the case of keyword tagging, the primary assumption that underpins this process remains the same: individuals should select and then impose keywords on historical data.

This project sought to invert that assumption, and explore the extent to which NLP methods and tools can be used to allow collection records to generate their own keyword tags, and thus describe or ‘speak for’ themselves. This is particularly relevant to crowdsourced collections of personal histories, especially Second World War collections, at a time when representations of the past are being reshaped to serve political interests.

See full post.