Editors’ Choice: An Introduction to the Textreuse Package

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out.


Read full post here.

This content was selected for Digital Humanities Now by Editor-in-Chief Amanda Regan based on nominations by Editors-at-Large: Antonio Jimenez-Munoz, Jan Lampaert, Shayda Schilleman, Rachelle Barlow, Rebecca Napolitano, Maribel Hidalgo-Urbaneja, Covadonga Lamar, John Matson, Amanda Fencl, Dene Grigar, Vanessa Stone, and Chelsea Gunn