Skip to Main Content

Web Archiving @ Yale: Intellectual Property Rights and Web Archiving

FAQs

What is robots.txt?

Robots.txt is a file that site owners can include within their websites to specify directions for web crawlers/robots. These instructions may request that certain pages or all pages within the site not be crawled. For more details about this process, see "About / robots.txt."

 

Suggest a new FAQ

If you have questions about intellectual property rights related to web archiving not covered here, please contact any of the members of the Web Archiving Working Group (see membership list).

Web Archiving Consent Approaches

Welcome! This guide contains information regarding intellectual property rights related to web archiving. The resources on this page present formal policy statements and general guidelines offered by a range of institutions.

In general, consensus seems to be forming around the following resources and principles:

  • collecting institutions provide explicit acknowledgement that the copyright of the content owners stays with the creators
  • opt-out instructions are provided
  • robots.txt may be ignored, with prior notice (attempts to notify) to content owners

The following institutions have adopted a policy of notifying content owners and providing opt-out instructions:

  • Stanford University Web Archiving Policy
    • As of June 2019, Stanford’s policy is to notify content owners of the intent to archive and make accessible web content, with a 6-month embargo period following notification. The policy page provides a link for opting out. The policy also covers FERPA procedures regarding archiving of student course work.
    • Stanford’s policy includes links to the following resources as guidelines for the creation of the Stanford policy: NDSA WA Survey Reports from 2011-12 and 2013-14, ARL Code of Best Practices for Fair Use, section 108 study group report
  • Columbia University Libraries Web Resources Collection Program Policies
    • Columbia provides several helpful resources for those conducting web archiving activities, as well as for content owners who may have questions about the process. The program's policies define the process for selection and harvesting and the guidelines for permissions and access. The institution attempts to notify all organizations and/or individuals whose websites are selected for archiving and includes contact information for take-down requests in its FAQs page. Columbia also provides an information page for how website owners can optimize their sites for preservation.
  • Library of Congress Web Archiving Program FAQs
    • This site provides detailed information for website owners regarding crawlers settings, notification, and access by researchers to crawled sites, as well as opt-out instructions for content owners. The FAQs also point to the Rights and Access statements provided for each collection page and item record. To collect as much data as possible from websites, the Library's policy is to notify site owners before crawling and to ignore robots.txt exclusions.
    • The Library also provides Supplementary Guidelines related to Web Archiving, which outline current practices related to web collection and the institution's collecting policies.
  • University of Michigan Bentley Historical Library Web Archives Collection Development Policy
    • The policy specifies that for website content from "private individuals, organizations, or associations, every effort will be made to inform the content owners" of the harvesting and "to inform them of their right to opt out or suppress content."

Other institutional policies do not specify prior notification to content owners but provide a mechanism to report concerns related to the collection and distribution of archived web content:

  • Duke University Archives Website and Social Media Collecting Policy
    • The policy outlines the principles that inform the institution's collecting decisions for website and social media content, which fit within the broader context of the Collecting Policy for Duke University Archives. For social media sites in particular, the Archives collects in compliance with the terms of service of the specific social media platforms. The policy notes that the institution may restrict access to collections or anonymize contributions for privacy reasons. It also directs those with concerns related to the collection of web and social media content to contact the University Archives office and provides a link to contact information.

Articles, Presentations and Blog Posts

  • Digital Preservation and Copyright by Peter Hirtle
    • Hirtle gives an overview of general copyright concerns related to digital preservation and the principles of fair use. He also discusses the Internet Archive's preservation of harvested websites and several provisions that "bolster a possible fair use defense," including the ability for content owners to opt out, honoring robots.txt exclusions, and the possibility of content removal under certain circumstances. However, he does note that these provisions can result in reduced usefulness of a web archive, as certain sites or pages may be completely omitted from the archive.
  • Digital Preservation Coalition Web Archiving Report
    • The report includes a section detailing legal challenges for web archives, and approaches used by institutions in the US and the UK.
  • Web Archiving for Music Librarians (Kent Underwood, Music Library Association, March 2016)
    • This presentation provides a helpful overview of intellectual property issues related to web archiving. Underwood notes that the practice of notifying website owners prior to capture is "not mandatory but advisable" and can facilitate collaboration.
  • An Overview of Web Archiving (article by Jinfang Niu in ​D-Lib Magazine)
    • The "Acquisitions" section of this article compares different approaches regarding rights. The scale of the web archiving effort may impact the decision to adopt an opt-in vs. an opt-out approach.
  • Archiving the Web: Working paper submitted to the CARL Committee on Research Dissemination (report by Canadian Association of Research Libraries)
    • The report outlines several challenges related to web archiving, including legal/intellectual property concerns. Restrictions may vary according to the copyright laws of specific countries, the presence of legal mandate or creative commons licenses, and the policies of specific institutions. The authors note, "Whichever the approach organizations take to web archiving, they need to consider data protection and citizens’ privacy rights."
  • International Internet Preservation Consortium (IIPC)
    • The IIPC provides information and resources regarding web archiving, including reports, presentations, recommendations for tools, case studies, and sample policies. The site also includes a section with information about legal considerations for web archiving.
  • Legal Issues in Web Archiving (Abbie Grotke, Library of Congress Signal blog post)
    • This post provides an overview of legal issues discussed at an IIPC Legal Issues Roundtable event and several ways that institutions may address these challenges.
  • National Digital Stewardship Alliance (NDSA) 2017 Web Archiving Survey
    • The NDSA has conducted four surveys of web archiving activities in the United States since 2011. The latest report (published in December 2018) noted that a majority (seventy percent) of the surveyed institutions did not notify content owners or obtain permission from them for web harvesting. Over ninety percent of the surveyed institutions responded that they had never received a request to cease harvesting or remove previously crawled content.
  • Web Archiving Resources (blog post by Jessica Venlet)
    • This post compiles a variety of web archiving resources, including links to introductory materials, tools, presentations, papers, and institutional policies.

Guidelines and Whitepapers

The guidelines below are frequently cited by institutions as reference for their specific web harvesting and archiving policies:

Case Studies

This section highlights web archiving initiatives at various institutions:

  • Archiving the Web: A Case Study from the University of Victoria
    • This case study of the web archiving effort started by the University of Victoria in 2013 includes an analysis of opt-in vs. opt-out approaches to web archiving, to support the University of Victoria's decision to implement an opt-out approach as "the only workable solution to web archiving."
  • Heather Slania, "Online Art Ephemera: Web Archiving at the National Museum of Women in the Arts," Art Documentation: Journal of the Art Libraries Society of North America 32, no. 1 (Spring 2013): 112-126. http://www.journals.uchicago.edu/doi/full/10.1086/669993
    • Slania gives an overview of the National Museum of Women in the Arts (NMWA) project to create the Contemporary Women Artists on the Web Collection. She discusses several challenges related to web archiving, including copyright, technical and selection issues. The NMWA adopted an opt-out approach for this project and honored robots.txt instructions.
  • Building the Contemporary Composers Web Archive (blog post by Samantha Abrams)
    • Abrams discusses the creation of the Contemporary Composers Web Archive (CCWA), an Ivy Plus Libraries Consortium collection. Prior to including websites in the collection, the Ivy Plus Libraries notified the composers (or the relevant estate or institution) of the intent to harvest. Content owners could opt out of the archive at any time. If the opt-out request was made before capture, the site would not be harvested. If a request was made post-capture, the Ivy Plus Libraries would remove public access to prior captures.
  • Establishing and growing a multi-institutional web archiving collaboration for the Collaborative Architecture, Urbanism and Sustainability Web Archive (CAUSEWAY) (slides by Anna Perricci)
    • Perricci describes the collaborative process to create this web archive. Website owners were notified of intent to harvest, and a majority gave permission for their sites to be harvested.

Approaches to Social Media and Third-Party Content

The resources in this section address some of the unique challenges associated with harvesting and preservation of social media data:

  • Duke University Archives Website and Social Media Collecting Policy 
    • This policy describes compliance with the social media platforms' terms of services and notes that social media content is collected as content, more than to preserve look and feel.
  • University of Michigan Bentley Historical Library Web Archives: Collection Development Policy
    • The Library policy discusses social media preservation as a potential future activity for which strategies are being investigated by archivists and notes, "in the event these strategies lead to the preservation of such materials, the Bentley Historical Library will adhere to the Fair Use exceptions of the Copyright Act as well as its standard practices for protecting the intellectual property rights of donors and content owners."
  • Digital Preservation Coalition Report on Preserving Social Media (Sara Day Thomson, February 2016)
    • This report addresses concerns and strategies for the archiving of social media. Social media presents several challenges for preservation, including the variety of platforms and differing terms of service, the volume of data, the need to harvest via API rather than more traditional means such as web crawlers, as well as legal and ethical considerations related to user-generated content.
  • Challenges of Web Archiving Social Media (UK Web Archive blog post by Jason Webber)
    • Webber discusses several challenges related to harvesting social media content for the UK Web Archive. Several of the challenges were technical, such as issues caused by shortened URLs and failure to capture certain types of content such as advertisements. However, other challenges result from the terms of service of the individual media platforms, which may restrict access to crawlers or require logins for access to pages.