Before you get started finding texts or data for text and data mining, be aware that programmatically accessing information isn’t necessarily permitted everywhere that it is possible. Think about questions of copyright, how you’re accessing data, and who might have a stake in that data before you embark on a project.
For example, consider:
At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.
For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.
Yale already has license agreements with some database vendors for data that you might be able to use for text and data mining. Some are already set up with tools to make this easier—for example, ProQuest offers TDM studio. Find more information on vendor tools here.
You can find some of the texts and data available at Yale by searching the library catalogue.
When you’re running a search, you can narrow your results to data. (You can even search by format and genre alone!) By using the format “Data Sets,” you can narrow your search down to data. Then, under Subject (Genre), you can select a kind of data if you’d like to narrow down—including, for example, statistical data, text corpora, image sets, or geospatial data.
There are also a few keywords to find certain kinds of data. In Yale Library’s Quicksearch, you can find newspapers and magazines with the search term “yuldsetmediated.” For transcripts, recordings, and other linguistic data, you can use “yuldsettxt.” And for data more expansively, including text data but also geospatial, numeric, and image data, try “yuldset.”
Once you’ve found a data set, you may be able to access it directly by following a link. Make sure that you’ve identified any information about how you’re allowed to use the data, and, if applicable, how you should cite it once you’ve used it.
If there’s a specific dataset or set of texts available through a vendor that Yale doesn’t already offer, we may be able to secure access for you.
Web scraping is a way of collecting information directly from pages you find on the Internet. Web scraping is usually legal, but not always advisable. It is almost always going to be messier and more complicated than other routes, if other routes are available.
Web scraping is a good idea only when:
Do note that some publicly available websites may still take measures to try to prevent web scraping, which may include blocking your IP address.
If you're thinking of scraping the web, be sure to read the robots.txt file for the page you're interested in to see what access the page creator has granted, and any terms they've set for how you can scrape their content. Many people find Beautiful Soup, a Python library, useful in working with HTML files you’ve gotten from the web. Feel free to reach out to the DHLab for advice on how and whether to proceed with a web scraping project.
You can also use text and data mining tools on data or texts that you’ve generated yourself.
If you’re generating your own dataset, it may be useful to consider what tools you’ll want to use as you’re generating the data to begin with—for example, if you ultimately want to use Gephi to create a network graph, you might want to start out generating lists of nodes and edges. Or, if you’re not sure what tool you’ll use, making sure you’re being clear, consistent, and unambiguous will help you to process your data computationally later on.
If you want to use texts—or datasets in texts—that you have physical or digital copies of (and permission to use in this way), this may also be possible. The DHLab has a Digitization Cube that may be available by permission for scanning documents, books, microfiche, or other materials. Adobe Acrobat Pro (available as part of the Adobe Creative Cloud, provided free to Yale users) can facilitate OCR (optical character recognition) of basic printed texts. The library also offers access to ABBYY FineReader to support OCR of printed documents with more complicated formats (such as tables) than Adobe Acrobat allows. You can use ABBYY FineReader on select computers in the Marx Library, or apply for access to LibraryApps to use it remotely.
HTR (Handwritten Text Recognition) is starting to be more viable with a tool called Transkribus. Its accuracy is likely to vary a great deal, but can be improved by training it on your own data.