Editors’ Choice: Keeping the words in Topic Models

Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to  ‘topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.

Humanists seem to want different to different things than what computer scientists do with topic models. David Blei’s group at Princeton (David Mimno aside) most often seem to push LDA (I’m using topic modeling and LDA interchangeably again) as an advance in information retrieval: making large collections of text browsable by giving useful tags to the documents. When someone gives you 100,000 documents, you can ‘read’ the topic headings first, and then only read articles in the ones that interest you.

Probably there are people using LDA for this sort of thing. I haven’t it seen it much in practice, though: it just isn’t very interesting* to talk about. And while this power of LDA is great for some institutions, it’s not a huge sellling point for the individual researcher: it’s a lot of effort for something that produces almost exactly the same outcome as iterative keyword searching. Basically you figure out what you’re interested in, read the top documents in the field. If discovery is the goal, humanists would probably be better off trying to get more flexible search engines than more machinely learned ones.

*I spun around a post for a while trying to respond to Trevor Owens’ post about the binary of  “justification” and “discovery” by saying that really only justification matters, but I couldn’t get it to cohere—obviously discovery matters in some way. That post of his is ironclad. So I’ll just say here that I think conversations which are purely about discovery methods are rare, and usually uninteresting; when scholars make public avowals of their discovery methodology, they frequently do it in part as evidence for the quality of their conclusions. Even if they say they aren’t. Anyhow.

So instead of building a browser for their topics, humanists like to take some or all of the topics and plot their relative occurrence over time. I could come up with at least a dozen examples: in DH, one of the most high-profile efforts like this is Rob Nelson’s Mining the Dispatch. On the front page is this plot, which assigns labels to two topics and uses them to show rates of two different types of advertising.

 Nelson

There’s an obvious affinity between plotting topic frequencies and plotting word frequencies, something dear to my heart. The most widely-used line charts of this sort are Google Ngrams. (The first time I myself read up on topic modeling was after seeing it referenced in the comments to Dan Cohen’s first post about Google Ngrams.) Bookworm is obviously similar to Ngrams: it’s designed to keep the Ngrams strategy of investigating trends through words, but also foreground the individual texts that underlie the patterns: it makes it natural to investigate the history of books using words, as well as the history of words using books.

Bookworm/Ngrams-type graphs and these topic-model graphs promote pretty much the same type of reflection, and share many of the same pitfalls. But one of the reasons I like the Ngrams-style approach better is that it wears its weaknesses on its sleeves. Weaknesses like: vocabulary changes, individual words don’t necessarily capture the full breadth of something like “Western Marxism,” any word can have multiple meanings, an individual word is much rarer.

Topic modeling seems like an appealing way to fix just these problems, by producing statistical aggregates that map the history of ideas better than any word could. Instead of dividing texts into 200,000 (or so) words, it divides them into 200-or-so topics that should be nearly as easy to cognize, but that will be much more heavily populated; the topics should map onto concepts better than words; and they avoid the ambiguity of a word like “bank” (riverbank? Bank of England?) by splitting it into different bins based on context.

So that’s the upside. What’s the downside? First, as I said last time, the model can go wrong in ways that the standard diagnostics I see humanists applying won’t work. (Dave Mimno points out that MALLET’s diagnostic package can catch things like this, which I believe; but I’m not clear that even the humanists using topic modeling are spending much time using these.) Each individual model thus takes some serious work to get one’s head around. Second, even if the model works, it’s no longer possible to judge the results without investment in the statistical techniques. If I use Ngrams to argue that Ataturk’s policies propelled the little city of Istanbul out of obscurity around 1930, anyone can explain why I’m an idiot. If I show a topic model I created, on the other hand, I’ll have a whole trove of explanations at hand of how it doesn’t match the problems you see.

Read Full Post Here

This content was selected for Digital Humanities Now by Editor-in-Chief based on nominations by Editors-at-Large: