Editors’ Choice: Giallo: Using a vision language model to analyze Italian Giallo films

Italian Giallo Films seem to be having a little resurgence recently. I first was exposed to them when the Criterion Channel app had a collection of them for Halloween a few years ago and now watch a few of them around this time of year mixed in with other spooky season movies. They are terrible movies, but are usually visually striking. It’s like the cinematographer or director is competing to make the most saturated, dramatic and elaborate scene possible. Coupled with the 1960/70s aesthetic they are really kind of fun to visually experience. Watching them you notice recurring tropes, things like extreme zoom shots of the actors eyes, a shadow creeping up a staircase or a POV knife wielding gloved hand. I was wondering if you had a large corpus could you automatically group these tropes and maybe discover new ones or even see influences?

Image Similarity algorithms are a pretty common approach to work with a corpus to find visually similarity but I wanted to group the content of the scenes not just how they looked. I want to group two scenes together of a shadow walking down a hallway even if they are visually very different. So I thought I could maybe use a vision language model to describe a scene and then use that textual description to group and explore. I decided to try and use the Qwen2.5-VL-72B-Instruct model and send it a frame from the movie every 5 seconds to build a textual representation of the film. I did this for about 70 Giallo films and used those texts to do some experiments. This post looks at that process and results.

See full post.