Editors' Choice: A Proposal for a Corpus Sharing Protocol

Digital humanists working in computational text analysis need a better way to share corpora. Following is a rough sketch of a way to share texts in way that facilitates collaboration, provides for easy error correction, and adheres as much as possible to decentralized, open-source, and open-access models.

The problem of corpus availability is deep and pervasive. At last year’s Digital Humanities conference, I saw a fantastic presentation by the Stanford Literary Lab on the detection of poetic meter. Since they had analyzed a set of texts from the Literature Online (LION) database, I asked them how they were able to get permissions for that data set. They replied that they’d simply asked ProQuest for it. With this in mind, I contacted John Pegum at ProQuest, in the hopes of obtaining texts I might use for large-scale analyses. His reply was polite and thorough, but concluded that I very likely couldn’t afford to pay for this privilege. When I responded with a few other options—one of which was working directly on a server of their choosing, without making a copy of the text—I never received a reply.

Even some of the best text repositories, like the Oxford Text Archive, are designed in such a way as to make them prohibitively difficult to use for macroanalysis. Many of the texts in the Oxford archive are behind a permissions wall, and seemingly by default. To gain access to a text, a researcher must apply to Oxford and specifically request a particular text. If that scholar is then interested in analyzing hundreds of texts, that would require hundreds of requests.

This problem was one of the major topics of discussion at the workshop Computer-Based Analysis of Drama I attended last week in Munich. Many of the presenters used texts from the TextGrid repository, and it was suggested that this could be a platform for the sharing of corpora among researchers. Yet TextGrid is apparently losing funding soon, and might go down. So how can we find a common platform for sharing texts, so that others might benefit from them?

Source: A Proposal for a Corpus Sharing Protocol