Resource: Understanding and Using Common Similarity Measures for Text Analysis

From the resource:

Statistical measures of similarity allow scholars to think computationally about how alike or different their objects of study may be, and these measures are the building blocks of many other clustering and classification techniques. In text analysis, the similarity of two texts can be assessed in its most basic form by representing each text as a series of word counts and calculating distance using those word counts as features. This tutorial will focus on measuring distance among texts by describing the advantages and disadvantages of three of the most common distance measures: city block or “Manhattan” distance, Euclidean distance, and cosine distance. In this lesson, you will learn when to use one measure over another and how to calculate these distances using the SciPy library in Python.

Read the full resource here.