Editors’ Choice: How do we Model Stereotypes without Stereotyping (Again)?

In a previous post, we explored how using language models and the idea of “perplexity” can allow us to study stereotypes in movie character roles using their dialogue as a basis. We examined a corpus of 750 Hollywood films, released between 1970 and 2014, and tried to model assumptions from the research that people of colour are more often criminalized or depicted in criminal roles than white actors.

In this post, we want to discuss how entropy, and information theory, can also be a useful approach to this kind of research. It is a measure of how “surprising” an event is (i.e., how much “information” it carries), based on the probability of that event occurring – the less probable, the more surprising. In the previous post, we used a crime language model, built from crime TV shows, to approximate film character dialogue (not limited to any genre). A perplexity score, measuring how surprising the new dialogue was, told us how different the dialogue was from the model.

In coming up with potential models to explore the feature of the “criminality” of a role, we discovered a huge flaw in this kind of research: creating a model for stereotyping presupposes an existing stereotype that you, the researcher, have to define. In an effort to call attention to pigeonholing and tokenism, your own biases, however subconscious, will undoubtedly come forward.

One method to circumvent this is to get rid of a particular (and potentially subjective) language model and search for more general linguistic variability between groups. Forget any model or any expectation of how these groups would sound, and ask, how similar do the groups sound to each other?

So, we tried a new approach. Sticking with information theory and the idea of surprisal, we turned to Kullback-Leibler divergence (KLD), or relative entropy. Instead of building a model to which the dialogue will be compared, the dialogue of one group will serve as the the model to approximate the dialogue of another. KLD, then, is valuable because of the asymmetry it offers. Corpus A serves as the model for Corpus B, and we get a surprisal score, telling us how well Corpus B predicts the words that we find in Corpus A. Then we go the other way, seeing how well Corpus A predicts the words we find in Corpus B. These two directions will not necessarily yield the same result, because one corpus could be far more varied than the other, but include all of the same words that the other has.

Read the full post here.