Editors' Choice: Using pyLDAvis with Mallet

One useful library for viewing a topic model is LDAvis, an R package for creating interactive web visualizations of topic models, and its Python port, PyLDAvis. This library is focused on visualizing a topic model, using PCA to chart the relationship between topics and between topics and words in the topic model. It is also agnostic about library you use to create the topic model, so long as you extract the necessary data in the correct formats.

While the python version of the library works very smoothly with Gensim, which I have discussed before, there is little documentation for how to move from a topic model created using MALLET to data that can be processed by the LDAvis library. For reasons that require their own blog post, I have shifted from using Gensim for my topic model to using MALLET (spoilers: better documentation of output formats, more widespread use in the humanities so better documentation and code examples generally). But I still wanted to use this library to visualize the full model as a way of generating an overall view of the relationship between the 250 topics it contains.

The documentation for both LDAvis and PyLDAvis relies primarily on code examples to demonstrate how to use the libraries. My primary sources were a python exampleand two R examples, one focused on manipulating the model data and one on the full model to visualization process. The “details” documentation for the R library also proved key for trouble-shooting when the outputs did not match my expectations. (Pro tip: word order matters.)

Looking at the examples, the data required for the visualization library are:

topic-term distributions (matrix, phi)
document-topic distributions (matrix, theta)
document lengths (number vector)
vocab (character vector)
term frequencies (number vector)

One challenge is that the order of the data needs to be managed, so that the terms columns in phi, the topic-term matrix, are in the same order as the vocab vector, which is in the same order as the frequencies vector, and the documents index of theta, the document-topic matrix, is in the same order of the document lengths vector.

Read the full post here.