Yale University Library Research Guides: Resources for Digital Humanities: Digitization and OCR/HTR

Resources

Digitization Cube
After a consultation with DHLab staff, you can reserve time in the Digitization Cube to use our equipment for your digitization project
Adobe Acrobat
Adobe Acrobat, which supports OCR of simply formatted printed documents, can be accessed for free by Yale affiliates with active CAS logins
ABBYY FineReader
ABBYY FineReader, which supports the OCR of printed documents with complicated formatting or unusual characters, can be used on select computers in Marx Library or accessed remotely by application
Transkribus
Transkribus, which supports the use and training of HTR models, can be downloaded for free and used for free up to a certain number of pages

Digitization

Digitization is, broadly speaking, the generation of a faithful digital representation of some object or piece of information. In digital humanities terms, this most often means the scanning of archival documents.

You might digitize materials in order to share them more broadly, to conduct further research using digital techniques (e.g., using optical character recognition (OCR) to convert them to machine-readable text and then applying text mining techniques to them), or to preserve ephemeral information.

If you have physical materials you'd like to digitize, the DHLab offers a Digitization Cube with equipment for your use, including a large flatbed scanner, an overhead book scanner, and a microfilm scanner. After a consultation with DHLab staff, you can book time in the Cube.

Optical Character Recognition (OCR)

OCR, optical character recognition, is the automated conversion of images of printed text to machine-readable and -searchable text.

You might, for example, OCR scans of printed books or typed letters so you can search or analyze their content.

There are two main pieces of software you might wish to use to perform OCR:

Adobe Acrobat

Works best for printed documents with very standard formatting (no tables) and no unusual characters
A good place to start if you're testing out OCR
Yale affiliates with active CAS logins can access Adobe Acrobat for free through the Adobe Creative Cloud via the Yale Software Library

ABBYY FineReader

Works best for printed documents that may include tables, other unusual formatting, or idiosyncratic characters or symbols
Users can add symbols or characters to be recognized
Yale affiliates with active CAS logins can use ABBYY FineReader on select computers in Marx Library, or by applying to use it remotely

Handwritten Text Recognition (HTR)

HTR, Handwritten Text Recognition, is the automatic conversion of pictures of handwritten manuscripts to machine-readable and -searchable texts. This is more specialized than OCR, given the variability of handwritten letters.

There is software that may be able to help with HTR:

Transkribus

Free to download, and free to use up to a certain number of pages of transcribed text
Allows the training of models for HTR on sets of transcribed handwritten text (at least 75 pages, in most cases)
There are some publicly available models already trained in a number of languages and scripts
Its accuracy is highly variable based on the thoroughness of its training and the level of similarity between the training data and the manuscripts being transcribed

Contact Us

For help with any stage of a digital humanities project, with any of the methods described here, or with any other questions, feel free to reach out or book a consultation with the DHLab.

Resources for Digital Humanities: Digitization and OCR/HTR