Editors’ Choice: Characterizing the Google Books Corpus

It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s.

Read full post here.

This content was selected for Digital Humanities Now by Editor-in-Chief Amanda Regan based on nominations by Editors-at-Large: Antonio Jimenez-Munoz, Jan Lampaert, María Cumplido, Rebecca Napolitano, Danae Tapia, Leigh Bonds, and Covadonga Lamar