By Ryan Heuser and Long Le-Khac, Fred Gibbs and Dan Cohen | June 5, 2012
Editors’ Note: Two new publications using quantitative methods to study the literary and intellectual history of nineteenth century Britain have been released. The first by Ryan Heuser and Long Le-Khac from the Stanford Literary Lab, and the second from Dan Cohen and Fred Gibbs. Excerpts and links to the original texts are included below.
A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method
by Ryan Heuser and Long Le-Khac, Stanford Literary Lab.
The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.
This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.
A Conversation with Data: Prospecting Victorian Words and Ideas
“Literature is an artificial universe,” author Kathryn Schulz recently declared in the New York Times Book Review, “and the written word, unlike the natural world, can’t be counted on to obey a set of laws” (Schulz). Schulz was criticizing the value of Franco Moretti’s “distant reading,” although her critique seemed more like a broadside against “culturomics,” the aggressively quantitative approach to studying culture (Michel et al.). Culturomics was coined with a nod to the data-intensive field of genomics, which studies complex biological systems using computational models rather than the more analog, descriptive models of a prior era. Schulz is far from alone in worrying about the reductionism that digital methods entail, and her negative view of the attempt to find meaningful patterns in the combined, processed text of millions of books likely predominates in the humanities.
Historians largely share this skepticism toward what many of them view as superficial approaches that focus on word units in the same way that bioinformatics focuses on DNA sequences. Many of our colleagues question the validity of text mining because they have generally found meaning in a much wider variety of cultural artifacts than just text, and, like most literary scholars, consider words themselves to be context-dependent and frequently ambiguous. Although occasionally intrigued by it, most historians have taken issue with Google’s Ngram Viewer, the search company’s tool for scanning literature by n-grams, or word units. Michael O’Malley, for example, laments that “Google ignores morphology: it ignores the meanings of words themselves when it searches…[The] Ngram Viewer reflects this disinterest in meaning. It disambiguates words, takes them entirely out of context and completely ignores their meaning…something that’s offensive to the practice of history, which depends on the meaning of words in historical context.” (O’Malley)
Such heated rhetoric—probably inflamed in the humanities by the overwhelming and largely positive attention that culturomics has received in the scientific and popular press—unfortunately has forged in many scholars’ minds a cleft between our beloved, traditional close reading and untested, computer-enhanced distant reading. But what if we could move seamlessly between traditional and computational methods as demanded by our research interests and the evidence available to us?
In the course of several research projects exploring the use of text mining in history we have come to the conclusion that it is both possible and profitable to move between these supposed methodological poles. Indeed, we have found that the most productive and thorough way to do research, given the recent availability of large archival corpora, is to have a conversation with the data in the same way that we have traditionally conversed with literature—by asking it questions, questioning what the data reflects back, and combining digital results with other evidence acquired through less-technical means.
We provide here several brief examples of this combinatorial approach that uses both textual work and technical tools. Each example shows how the technology can help flesh out prior historiography as well as provide new perspectives that advance historical interpretation. In each experiment we have tried to move beyond the more simplistic methods made available by Google’s Ngram Viewer, which traces the frequency of words in print over time with little context, transparency, or opportunity for interaction.