Report: The ContentMine Scraping Stack — Literature-scale Content Mining with Community-maintained Collections of Declarative Scrapers

From the post:

Successfully mining scholarly literature at scale is inhibited by technical and political barriers that have been only partially addressed by publishers’ application programming interfaces (APIs). Many of those APIs have restrictions that inhibit data mining at scale, and while only some publishers actually provide APIs, almost all publishers make their content available on the web. Current web technologies should make it possible to harvest and mine the scholarly literature regardless of the source of publication, and without using specialised programmatic interfaces controlled by each publisher. Here we describe the tools developed to address this challenge as part of the ContentMine project.

Source: D-Lib: The ContentMine Scraping Stack: Literature-scale Content Mining with Community-maintained Collections of Declarative Scrapers