Resource: Text Data from the Archive – Digital Humanities Now

During graduate school I visited my fair share of archives. Living on funds dispensed from the FAFSA gods in combination with whatever part-time job I had, I often found myself hard-pressed to pony up money for photocopies. Somewhere along the line I got smarter and started using a point and shoot camera to gather as much primary source mana as possible. Since that time smart folks have written great posts geared toward helping people engaged in similar work. Robin Camille Davis has a nice post on using an iPhone, a ruler, and a couple stacks of books as an impromptu digitization station. Miriam Posner shows us how to batch process photos. The University of Illinois Libraries maintain a really useful guide on digital tools for archival research. Building on these foundations I am going to describe how you can generate plain text data from images of archival documents. With plain text data in hand you’ll be well provisioned for engaging a number of different Digital Humanities methods. The focus here is on extracting plain text data from images of print based archival content using optical character recognition(OCR). The result likely won’t be highly accurate, but I’d argue that “just good enough” is all you need to begin exploring your sources.

Read the Full Post Here.