I am currently teaching a graduate course (eng630: “Digital Humanities”: Emerging Tools and Debates in Literary Study) and, as much as possible, I’m trying to make clear the mechanics behind some of the text-analysis in the works we’re reading. So, this week, as I prepared to discuss Stephen Ramsay’s Reading Machines, I wanted to reproduce some of the analysis done there. The first chapter, for instance, offers a tf-idf reading of Woolf’s The Waves. Here is how Ramsay describes it:
It is possible—and indeed an easy matter—to use a computer to transform Woolf’s novel into lists of tokens in whcih each list represents the words spoken by the characters ordered from most distinctive to least distinctive term. Tf-idf, one of the classic formulas from the field of information retrieval, endeavours to generate lists of distinctive terms for each document in a corpus. We might therefore conceive of Woolf’s novel as a ‘corpus’ of separate documents (each speaker’s monologue representing a separate document), and use the formual to factor the presence of a word in a particular speaker’s vocabulary against the presence of the word in other speakers’ vocabularies. (11)
This post summarizes how I tried to do just that, and the different results I got. I’m not sure what accounts for the differences from Ramsay’s (and Sara Steger’s) results; I’ll try to show you what I mean below. In a future post I’ll use the same “method” on aa different text (spoiler: it’s Ulysses).