Yale University Library Research Guides: Researcher-Led Web Data Archiving: Home

Welcome

This is a work-in-progress guide developed in response to research concerns about at-risk data sets hosted online, with significant contributions by Yale Library's Born Digital Archives Advisory Group.

If you have further questions, please feel free to reach out or schedule a consultation.

Resource List

IIPC
The International Internet Preservation Consortium (IIPC) website includes in-depth information about web archiving, including tools, training, and guidance.
Intellectual Property Rights and Web Archiving
Yale Library's Web Archiving Working Group developed this guide to copyright and other IP concerns regarding web archiving.
The Wayback Machine
The Wayback Machine, described in more detail in the "Tools and Initiatives" section of this page, is a good first place to look for archived webpages.
DIY Web Archiving Zine
A zine, published by a number of technologists, on web archiving independently

Overview

Web archiving, the process of preserving digital materials originally distributed on the Internet, has long roots at Yale.

In one form or another, Yale Library has been archiving the web since 2004. As well as archiving Yale's own web resources, the Library contributes to collaborative collections of curated content such as The Ivy Plus Libraries Confederation (IPLC) Web Collecting Program.

Researchers interested in quickly archiving sources of research data currently available on the web, but which they identify as being at risk, are encouraged to refer to this guide to understand where to look for existing archives and how to contribute to archives themselves.

Getting Started

If you've identified a site, journal article, data set, or source of research data on the web that you are concerned may be at risk, you can follow a few steps to get started:

Check to see whether your web resource of interest has already been preserved, for example in the Wayback Machine
Check the Data Rescue Tracker to see if your web resource of interest has been catalogued for preservation
Look for existing web archiving groups or initiatives who may be working to preserve your web resource of interest, even if it doesn't appear in the Data Rescue Tracker, or who might be willing to add your web resource of interest to their list
If your web resource is already preserved or being preserved, consider joining the group or initiative preserving it!
If no existing initiative has preserved your web resource, and you can't find an initiative to take on the work of preserving it, consider which web archiving tool will best capture its content and obtain a copy yourself

If you are more generally interested in participating in these efforts, we recommend you look at the list of existing web archiving initiatives to participate in.

Tips for Accessing Web Archives

Look deep

Webpages are, by their nature, interconnected by links. As you're looking through an archive for a web resource, make sure that all the content you need is preserved—check for data downloads, summary pages, and other materials as well as top-level sites when you're looking through an archived site.

Check multiple snapshots

Some web archives will have more than one snapshot saved for a given resource, taken on different dates—check more than one if one option doesn't have what you're looking for. Not only does web content change, but the conditions of the archiving may have changed; for example, a server might have crashed partway through on one occasion, but not on another!

Try another approach

If the resource you're looking for hasn't been archived, consider: other webpages that might have copies of the resource you're looking for, and might have been archived; other parts of the website of interest that might have been archived that could guide you to your resource, such as authorship information to reach out directly

Web Archiving Vocabulary

crawl, web crawl, web crawler: a web crawler is a tool that starts with a URL, collects the page (or data about it), then repeats the process for every link it finds on that page

robots.txt: a robots.txt page is a standardized way site owners can communicate how and whether they'd like their sites used by web crawlers and other automatic ways of accessing web content; find this at the base URL of a site with "/robots.txt" appended

local: in this context, something that is on your computer (or other data storage), rather than accessed via the web

Research Technology Lead

Gavi Levy Haskell

they/them or she/her

Email Me

Researcher-Led Web Data Archiving: Home