This is a work-in-progress guide developed in response to research concerns about at-risk data sets hosted online, with significant contributions by Yale Library's Born Digital Archives Advisory Group.
If you have further questions, please feel free to reach out or schedule a consultation.
Web archiving, the process of preserving digital materials originally distributed on the Internet, has long roots at Yale.
In one form or another, Yale Library has been archiving the web since 2004. As well as archiving Yale's own web resources, the Library contributes to collaborative collections of curated content such as The Ivy Plus Libraries Confederation (IPLC) Web Collecting Program.
Researchers interested in quickly archiving sources of research data currently available on the web, but which they identify as being at risk, are encouraged to refer to this guide to understand where to look for existing archives and how to contribute to archives themselves.
If you've identified a site, journal article, data set, or source of research data on the web that you are concerned may be at risk, you can follow a few steps to get started:
If you are more generally interested in participating in these efforts, we recommend you look at the list of existing web archiving initiatives to participate in.
Webpages are, by their nature, interconnected by links. As you're looking through an archive for a web resource, make sure that all the content you need is preserved—check for data downloads, summary pages, and other materials as well as top-level sites when you're looking through an archived site.
Some web archives will have more than one snapshot saved for a given resource, taken on different dates—check more than one if one option doesn't have what you're looking for. Not only does web content change, but the conditions of the archiving may have changed; for example, a server might have crashed partway through on one occasion, but not on another!
If the resource you're looking for hasn't been archived, consider: other webpages that might have copies of the resource you're looking for, and might have been archived; other parts of the website of interest that might have been archived that could guide you to your resource, such as authorship information to reach out directly
crawl, web crawl, web crawler: a web crawler is a tool that starts with a URL, collects the page (or data about it), then repeats the process for every link it finds on that page
robots.txt: a robots.txt page is a standardized way site owners can communicate how and whether they'd like their sites used by web crawlers and other automatic ways of accessing web content; find this at the base URL of a site with "/robots.txt" appended
local: in this context, something that is on your computer (or other data storage), rather than accessed via the web