Editor’s Choice: Fork, Merge and Crowd-Sourcing Data Curation

Over the past few weeks there has been a sudden increase in the amount of financial data on scholarly communications in the public domain. This was triggered in large part by the Wellcome Trust releasing data on the prices paid for Article Processing Charges by the institutions it funds. The release of this pretty messy dataset was followed by a substantial effort to clean that data up. This crowd-sourced data curation process has been described by Michelle Brook. Here I want to reflect on the tools that were available to us and how they made some aspects of this collective data curation easy, but also made some other aspects quite hard.

The data started its life as a csv file on Figshare. This is a very frequent starting point. I pulled that dataset and did some cleanup using OpenRefine, a tool I highly recommend as a starting point for any moderate to large dataset, particularly one that has been put together manually. I could use OpenRefine to quickly identify and correct variant publisher and journal name spellings, clean up some of the entries, and also find issues that looked like mistakes. It’s a great tool for doing that initial cleanup, but its a tool for a single user, so once I’d done that work I pushed my cleaned up csv file to github so that others could work with it.

Read the full post here.

This content was selected for Digital Humanities Now by Editor-in-Chief Benjamin Schneider based on nominations by Editors-at-Large: Anu Paul, Elizabeth Goins, Chiara Bernardi, James O'Sullivan, Beth Secrist, Amy Williams, Angela Galvan, Aisha Clarke, Sayema Rawof, Sarah Canfield Fuller, Andrew Hyde, Souvenise St. Louis, Kevin McQueeney, and Rebecca Nesvet