Editors’ Choice: Extracting A Large Corpus from the Internet Archive, A Case Study

Editors’ Summary: This article articulates an AI-assisted workflow in developing a Python script to collect information at scale from the Internet Archive (IA) via IA’s API. IA is a large online container of websites, print materials, audios, newspapers, and others. The author correctly identifies a need to share more information about how users could interact with IA at scale. This article acknowledges the user of AI in building its tools and shares with its readers the prompts used during the interaction with AI. With IA being the repository of a large amount of primary (and secondary) sources used in many scholars and students’ research, the workflow developed here will be a great reference to the DH community.

See full post.