Editors’ Summary: This post from the Data-Sitters Club provides a helpful orientation to corpus building for newcomers to DH. This is a part of their spinoff series: Data-Sitters Little TL;DR, where they offer key ideas and takeaways for people interested in digital humanities. The post details the legality of using text as data under Fair Use, the considerations needed when using a pre-built corpus, and best practices for OCR and metadata. When working with an already-ready corpus, the author emphasizes the need to consider not just what is included, but what is excluded. The gaps and silences found in archives also apply to online archives, like HathiTrust. This is a very useful pedagogical resource and references various other reports from the DSC.
Editors’ Choice: Forming Your Corpus