Skip to Main Content

Resources for Digital Humanities: Digitization and OCR/HTR

Vocabulary, tools, advice, and library resources to start your Digital Humanities project or research.

Resources

Digitization

Digitization is, broadly speaking, the generation of a faithful digital representation of some object or piece of information. In digital humanities terms, this most often means the scanning of archival documents.

You might digitize materials in order to share them more broadly, to conduct further research using digital techniques (e.g., using optical character recognition (OCR) to convert them to machine-readable text and then applying text mining techniques to them), or to preserve ephemeral information.

If you have physical materials you'd like to digitize, the DHLab offers a Digitization Cube with equipment for your use, including a large flatbed scanner, an overhead book scanner, and a microfilm scanner. After a consultation with DHLab staff, you can book time in the Cube.

Optical Character Recognition (OCR)

OCR, optical character recognition, is the automated conversion of images of printed text to machine-readable and -searchable text.

You might, for example, OCR scans of printed books or typed letters so you can search or analyze their content.

There are two main pieces of software you might wish to use to perform OCR:

Adobe Acrobat
  • Works best for printed documents with very standard formatting (no tables) and no unusual characters
  • A good place to start if you're testing out OCR
  • Yale affiliates with active CAS logins can access Adobe Acrobat for free through the Adobe Creative Cloud via the Yale Software Library
ABBYY FineReader
  • Works best for printed documents that may include tables, other unusual formatting, or idiosyncratic characters or symbols
  • Users can add symbols or characters to be recognized
  • Yale affiliates with active CAS logins can use ABBYY FineReader on select computers in Marx Library, or by applying to use it remotely

Handwritten Text Recognition (HTR)

HTR, Handwritten Text Recognition, is the automatic conversion of pictures of handwritten manuscripts to machine-readable and -searchable texts. This is more specialized than OCR, given the variability of handwritten letters.

There is software that may be able to help with HTR:

Transkribus
  • Free to download, and free to use up to a certain number of pages of transcribed text
  • Allows the training of models for HTR on sets of transcribed handwritten text (at least 75 pages, in most cases)
  • There are some publicly available models already trained in a number of languages and scripts
  • Its accuracy is highly variable based on the thoroughness of its training and the level of similarity between the training data and the manuscripts being transcribed

Contact Us

For help with any stage of a digital humanities project, with any of the methods described here, or with any other questions, feel free to reach out or book a consultation with the DHLab.