Editors' Choice: Where to start with text mining

This post is less a coherent argument than an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). But I’m putting this on the blog since some of the links might be useful for a broader audience. Also, we won’t really cover all this material, so the blog post may give workshop participants a chance to explore things I only gestured at in person.

In the morning I’ll give a few examples of concrete literary results produced by text mining. I’ll start the afternoon workshop by opening two questions for discussion: first, what are the obstacles confronting a literary scholar who might want to experiment with quantitative methods? Second, how do those methods actually work, and what are their limits?

I’ll also invite participants to play around with a collection of 818 works between 1780 and 1859, using an R program I’ve provided for the occasion. Links for these materials are at the end of this post.

I. HOW DIFFICULT IS IT TO GET STARTED?
There are two kinds of obstacles: getting the data you need, and getting the digital skills you need.

1. Is it really necessary to have a large collection of texts?
This is up for debate. But I tend to think the answer is “yes.”

Not because bigger is better, or because “distant reading” is the new hotness. It’s still true that a single passage, perceptively interpreted, may tell us more than a thousand volumes.

But if you want to interpret a single passage, you fortunately already have a wrinkled protein sponge that will do a better job than any computer. Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory. Your mileage may vary, but I’d say, more than ten books?

And actually, you need a larger collection than that, because quantitative analysis tends to require context before it becomes meaningful. It doesn’t mean much to say that the word “motion” is common in Wordsworth, for instance, until we know whether “motion” is more common in his works than in other nineteenth-century poets. So yes, text-mining can provide clues that lead to real insights about a single author or text. But it’s likely that you’ll need a collection of several hundred volumes, for comparison, before those clues become legible.

Words that are consistently more common in works by William Wordsworth than in other poets from 1780 to 1850. I’ve used Wordle’s graphics, but the words have been selected by a Mann-Whitney test, which measures overrepresentation relative to a context — not by Wordle’s own (context-free) method. See the R script at the end of this post.

This isn’t to deny that there are interesting things that can be done digitally with a single text: digital editing, building timelines and maps, and so on. I just doubt that quantitative analysis adds much value at that scale. (And to give credit where it’s due: Mark Olsen was saying all this back in the 90s — see References.)

2. So, where do I get all those texts?
That’s what I was asking myself 18 months ago. A lot of excitement about digital humanities is premised on the notion that we already have large collections of digitized sources waiting to be used. But it’s not true, because page images are not the same thing as clean, machine-readable text.

If you’re interested in twentieth-century secondary sources, the JSTOR Data for Research API can probably get you what you need. Primary sources are a harder problem. In our own (Romantic) era, optical character recognition (OCR) is unreliable. The ratio of words transcribed accurately ranges from around 80% to around 98%, depending on print quality and typographical quirks like the notorious “long s.” For a lot of text-mining purposes, 95% might be fine, if the errors were randomly distributed. But they’re not random: errors cluster in certain words and periods.

What you see in a page image.

The problem can be addressed in several different ways. There are a few collections (like ECCO-TCP and the Brown Women Writers Project) that transcribe text manually. That’s an ideal solution, but coverage of that kind is stronger in the eighteenth than the nineteenth century.

What you may get as OCR.

So Jordan Sellers and I have supplemented those collections by automatically correcting 19c OCR that we got from the Internet Archive. Our strategy involved statistically cautious, period-specific spellchecking, combined with enough reasoning about context to realize that “mortal fin” is probably “mortal sin,” even though “fin” is a correctly spelled word. It’s not a perfect solution, but in our period it works well enough for text-mining purposes. We have corrected about 2,000 volumes this way, and are happy to share our texts and metadata, as well as the spellchecker itself (once I get it packaged well enough to distribute). I can give you either a zip file containing the 19c texts themselves, or a tab-separated file containing docIDs, words, and word counts for the whole collection. In either scheme, the docIDs are keyed to this metadata file.

Of course, selecting titles for a collection like this raises intractable questions about representativeness. We tried to maximize diversity while also selecting volumes that seemed to have reached a significant audience. But other scholars may have other priorities. I don’t think it would be useful to seek a single right answer about representativeness; instead, I’d like to see multiple scholars building different kinds of collections, making them all public, and building on each other’s work. Then we would be able to test a hypothesis against multiple collections, and see whether the obvious caveats about representativeness actually make a difference in any given instance.

3. Is it necessary to learn how to program?
I’m not going to try to answer that question, because it’s complex and better addressed through discussion.

I will tell a brief story. I went into this gig thinking that I wouldn’t have to do my own programming, since there were already public toolsets for text-mining (Voyant, MONK, MALLET, TAPoR, SEASR) and for visualization (Gephi). I figured I would just use those.

But I rapidly learned otherwise. Tools like MONK and Voyant taught me what was possible, but they weren’t well adapted for managing a very large collection of texts, and didn’t permit me to make my own methodological innovations. When you start trying to do either of those things, you rapidly need “nonstandard parts,” which means that someone in the team has to be able to program.

That doesn’t have to be a daunting prospect, because the programming involved is of a relatively forgiving sort. It’s not easy, but it’s also not professional software development. So if you want to do it yourself, that’s a plausible aspiration. Alternately, if you want to collaborate with someone, you don’t necessarily need to find “a computer scientist.” A graduate student or fellow humanist who can program will do just fine.

If you do want to learn to program, I would recommend starting with either Python or R.Of the two languages, Python is certainly easier. It’s intuitive, and well-documented, and great for working with text. If you expect to use existing tools (like MALLET), and just need some “glue” to connect them to each other, Python is probably the way to go. R is a more specialized and less intuitive language. But it happens to be specialized in some ways that are useful for text mining. In particular, it has built-in statistical functions, and a built-in plotting/graphing capacity. I’ve used it for the sample exercise that accompanies this post. But if you’re learning to program for the first time, Python might be a better all-around choice, and you could in principle extend it to do everything R does.

…

Read Full Post Here