Report: The Magnificent Seven – Looking Back on a Year of Exploring the Web Archives Datasets

From the report:

It has been just over a year since we kicked off a deep dive into the Library of Congress Web Archives on the Signal! Now at over 2 petabytes, the web archives are a complex aggregation of interrelated web objects that make up the internet as we know it (images, text, code, audio, video, etc.). In keeping with the Digital Strategy for the Library of Congress, we are working to “throw open the treasure chest” by making this digital content as broadly available as possible. However, without the proper tools to navigate this complex resource, users may think of the treasure chest as more of a Pandora’s box! Two broad goals directed our investigation: 1) to develop a better understanding of the individual media objects that comprise the web archives, and 2) to surface specific sets of individual resources from the web archives that will support users exploring research and creative uses of archived content. Let’s check in on how things have progressed.

Read the full report here.