A vibrant discussion followed my March 15th post, “A Proposal for a Corpus Sharing Protocol.”. Carrie Schroeder, Allen Riddel and others on Twitter pointed out that, especially in non-English DH fields, many corpora are already on GitHub. These include texts from the Chinese Buddhist Electronic Text Association, the Open Greek and Latin Project at Leipzig, and papyri from the Integrating Digital Papyrology Project. The Text Creation Partnership has released some 25,000 of their texts in January of this year, and uploaded them to GitHub. One of the more interesting Git corpus projects I became aware of following this discussion is GITenberg. Led by Seth Woodworth, the project scrapes a text from Project Gutenberg, initializes a git repository for it, adds README and CONTRIBUTING files generated from the text’s metadata, and uploads the resulting repository to GitHub. They have gitified around 43,000 works this way. The project also converts Project Gutenberg vanilla plain text into ASCIIDOC—a good example of this is the GITenberg edition of The Adventures of Huckleberry Finn. This is an amazingly ambitious project that holds the promise of wide-ranging applications for editing, versioning, and disseminating literature.
One such application might lie with the 68,000 digital texts recently created by the British Library. James Baker, a digital curator of the British Library, left a comment on my original post, suggesting that the method I describe might be used to parse and post the Library’s texts. He sent me a few sample texts of the ALTO XML documents that the Stanford Literary Lab had used. I adapted some of the GITenberg code to read these texts, generate README files for them, and turn them into GitHub repositories. I’m provisionally calling this project Git-Lit.
Read More: INTRODUCING GIT-LIT | JONATHAN REEVE