It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Library building hours and access are limited at this time. Services and support are being provided remotely. See COVID-19 library updates.
Intellectual Property Rights and Web Archiving: Home
this guide is being developed by members of the Yale Web Archiving Working Group.
What is robots.txt?
Robots.txt is a file that site owners can include within their websites to specify directions for web crawlers/robots. These instructions may request that certain pages or all pages within the site not be crawled. For more details about this process, see "About / robots.txt."
Suggest a new FAQ
If you have questions about intellectual property rights related to web archiving not covered here, please contact any of the members of the Web Archiving Working Group (see membership list).
Web Archiving Consent Approaches
Welcome! This guide contains information regarding intellectual property rights related to web archiving. The resources on this page present formal policy statements and general guidelines offered by a range of institutions.
In general, consensus seems to be forming around the following resources and principles:
collecting institutions provide explicit acknowledgement that the copyright of the content owners stays with the creators
opt-out instructions are provided
robots.txt may be ignored, with prior notice (attempts to notify) to content owners
The following institutions have adopted a policy of notifying content owners and providing opt-out instructions:
As of June 2019, Stanford’s policy is to notify content owners of the intent to archive and make accessible web content, with a 6-month embargo period following notification. The policy page provides a link for opting out. The policy also covers FERPA procedures regarding archiving of student course work.
Stanford’s policy includes links to the following resources as guidelines for the creation of the Stanford policy: NDSA WA Survey Reports from 2011-12 and 2013-14, ARL Code of Best Practices for Fair Use, section 108 study group report
Columbia provides several helpful resources for those conducting web archiving activities, as well as for content owners who may have questions about the process. The program's policies define the process for selection and harvesting and the guidelines for permissions and access. The institution attempts to notify all organizations and/or individuals whose websites are selected for archiving and includes contact information for take-down requests in its FAQs page. Columbia also provides an information page for how website owners can optimize their sites for preservation.
This site provides detailed information for website owners regarding crawlers settings, notification, and access by researchers to crawled sites, as well as opt-out instructions for content owners. The FAQs also point to the Rights and Access statements provided for each collection page and item record. To collect as much data as possible from websites, the Library's policy is to notify site owners before crawling and to ignore robots.txt exclusions.
The Library also provides Supplementary Guidelines related to Web Archiving, which outline current practices related to web collection and the institution's collecting policies.
The policy specifies that for website content from "private individuals, organizations, or associations, every effort will be made to inform the content owners" of the harvesting and "to inform them of their right to opt out or suppress content."
Other institutional policies do not specify prior notification to content owners but provide a mechanism to report concerns related to the collection and distribution of archived web content:
The policy outlines the principles that inform the institution's collecting decisions for website and social media content, which fit within the broader context of the Collecting Policy for Duke University Archives. For social media sites in particular, the Archives collects in compliance with the terms of service of the specific social media platforms. The policy notes that the institution may restrict access to collections or anonymize contributions for privacy reasons. It also directs those with concerns related to the collection of web and social media content to contact the University Archives office and provides a link to contact information.
Hirtle gives an overview of general copyright concerns related to digital preservation and the principles of fair use. He also discusses the Internet Archive's preservation of harvested websites and several provisions that "bolster a possible fair use defense," including the ability for content owners to opt out, honoring robots.txt exclusions, and the possibility of content removal under certain circumstances. However, he does note that these provisions can result in reduced usefulness of a web archive, as certain sites or pages may be completely omitted from the archive.
This presentation provides a helpful overview of intellectual property issues related to web archiving. Underwood notes that the practice of notifying website owners prior to capture is "not mandatory but advisable" and can facilitate collaboration.
The report outlines several challenges related to web archiving, including legal/intellectual property concerns. Restrictions may vary according to the copyright laws of specific countries, the presence of legal mandate or creative commons licenses, and the policies of specific institutions. The authors note, "Whichever the approach organizations take to web archiving, they need to consider data protection and citizens’ privacy rights."
The IIPC provides information and resources regarding web archiving, including reports, presentations, recommendations for tools, case studies, and sample policies. The site also includes a section with information about legal considerations for web archiving.
The NDSA has conducted four surveys of web archiving activities in the United States since 2011. The latest report (published in December 2018) noted that a majority (seventy percent) of the surveyed institutions did not notify content owners or obtain permission from them for web harvesting. Over ninety percent of the surveyed institutions responded that they had never received a request to cease harvesting or remove previously crawled content.
Published in 2008, the report includes a recommendation that a new exception be added to Section 108 of the United States Copyright Act to permit libraries and museums to capture and reproduce publicly available websites and other online content, on an opt-out basis for rights holders.
This case study of the web archiving effort started by the University of Victoria in 2013 includes an analysis of opt-in vs. opt-out approaches to web archiving, to support the University of Victoria's decision to implement an opt-out approach as "the only workable solution to web archiving."
Slania gives an overview of the National Museum of Women in the Arts (NMWA) project to create the Contemporary Women Artists on the Web Collection. She discusses several challenges related to web archiving, including copyright, technical and selection issues. The NMWA adopted an opt-out approach for this project and honored robots.txt instructions.
Abrams discusses the creation of the Contemporary Composers Web Archive (CCWA), an Ivy Plus Libraries Consortium collection. Prior to including websites in the collection, the Ivy Plus Libraries notified the composers (or the relevant estate or institution) of the intent to harvest. Content owners could opt out of the archive at any time. If the opt-out request was made before capture, the site would not be harvested. If a request was made post-capture, the Ivy Plus Libraries would remove public access to prior captures.
The Library policy discusses social media preservation as a potential future activity for which strategies are being investigated by archivists and notes, "in the event these strategies lead to the preservation of such materials, the Bentley Historical Library will adhere to the Fair Use exceptions of the Copyright Act as well as its standard practices for protecting the intellectual property rights of donors and content owners."
This report addresses concerns and strategies for the archiving of social media. Social media presents several challenges for preservation, including the variety of platforms and differing terms of service, the volume of data, the need to harvest via API rather than more traditional means such as web crawlers, as well as legal and ethical considerations related to user-generated content.
Webber discusses several challenges related to harvesting social media content for the UK Web Archive. Several of the challenges were technical, such as issues caused by shortened URLs and failure to capture certain types of content such as advertisements. However, other challenges result from the terms of service of the individual media platforms, which may restrict access to crawlers or require logins for access to pages.