Skip to Main Content

Resources for Text and Data Mining: Text and Data Mining Methods

Contact Us

At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.

For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.

Table of Contents

Tasks and methods:

See initial statistics about a large set of texts or a large data set (say, a big CSV file) in order to better understand it as a whole

If you’re looking for quick impressions, there are some easy-to-use tools that can help you for both texts and datasets. These should be seen as ways of generating ideas that you might then follow up on in a more rigorous way rather than ways of generating firm answers.

If you just want to get a sense of the contents of a data file, Open Refine is a free, downloadable tool with a simple browser-based interface that can help you identify the overall structure of your data, the kinds of values a given field might have, and similar information.

For texts, Voyant Tools can be a great way to identify commonly occurring words and phrases, and even words and phrases that frequently appear together. You can input URLs or full texts, or upload text files. This might be a fast way to see what motifs appear together, for example.

For data sets, often generating a visualization is a quick way to see what you’re working with. There’s nothing wrong with using tools you’re already familiar with. Excel, which you have access to through Yale, offers conditional formatting and some simple visualizations (“charts”) that may help you see patterns in your data.

If you have network data—or data you can reformat as graph data—Gephi is a free desktop application that can visualize it relatively easily.

Identify places, people, and things in a set of data or texts (named entity recognition)

Named Entity Recognition is a Natural Language Processing (NLP) method for identifying named, unambiguous things—this might be a person, a place, an organization, a concept, or anything else where the word refers to one specific thing. So “Peter Salovey” would be a named entity—“university presidents” would not. Current Named Entity Recognition systems rely on machine learning.

Unfortunately, likely due to the training data—the sets of texts the systems were taught to identify entities based on—the results of Named Entity Recognition reflect demographic bias. For example, names more often associated with white men are recognized with greater accuracy than names more often associated with women and people of color.

You may decide that Named Entity Recognition is still helpful for your project. Although it will require some programming, there are tools to help you. Stanford CoreNLP (for Java) and spaCy (for Python) are both popular, and might be a good place to start.

Find recurring words or phrases in a text

If you’re trying to find frequently occurring words or phrases in a text, there are easy web-based ways to do this. WordCounter, from DataBasic, will generate a quick cloud map of words and phrases from a pasted text, a URL, or a file upload. If you want a little bit more information, or a little bit more control, Voyant Tools will also generate a cloud map as well as other data and charts if you paste in a text, type in a URL or set of URLS, or upload a file.

Be aware that, as with any text or data mining process, what you’re doing is getting a level of abstraction away from your texts so you can see things you might not otherwise—it shouldn’t be seen as an answer to a question. Similarly, you might get unexpected answers that “mean” something other than you expected—like finding that character names, for example, are the most-used words in an unedited theatrical script.

Identify relationships within data or texts (for example, words that often co-occur, characters who show up in the same scene, etc.)

There are a few ways you might go about identifying relationships within a data set or a group of texts.

  • If you’re looking for co-occurring words and phrases—they show up next to each other often—Voyant Tools is an easy way to see some of these relationships. You can plug information directly into the web interface and get some quick answers.
  • If you’d like to do topic modeling—identifying co-occurring words in a more sophisticated way that takes into account context and meaning—Mallet is a Java-based toolkit that can support doing so.
  • If you want to see relationships within a text more structurally—are there common sentence structures, for example, or frequently occurring word orders?—you might want to use NLTK, which requires some Python programming but is well documented online. It includes Natural Language Processing libraries that can help you process a text in a number of different ways.

When you want to see relationships in data or texts, you might also be thinking of something called a network graph. A network graph is a kind of data visualization that shows a map of things (vertices or nodes) and their relationships to each other (edges or links). If you want to create a network graph, you will likely have to work with your data—by hand or with a script—before you plug it into anything. This will involve generating your two lists—of what your things are (people, places, even concepts), and how they connect to each other. Tools like OpenRefine or even NLTK might help you get these two lists from your raw data. Once you’re set, Gephi is a great tool for visualizing this data.

See some information about tone in a passage of text (sentiment analysis)

Sentiment analysis is a way of detecting tone or emotional content in texts. It can, for example, tell you how satisfied customers are based on large numbers of reviews—one of the main use cases the technology was developed for.

As with many other Natural Language Processing tasks, sentiment analysis relies on machine learning. This comes with some downsides—based on the training data the given sentiment analysis tool has been provided, it may make unexpected, inaccurate, or biased assumptions about the text you use it on. For example, the tone or emotion associated with words in formal academic writing or eighteenth-century literature is different than the tone or emotion associated with the same words in a customer review.

Additionally, out-of-the-box sentiment analysis tools often exhibit demographic bias based on their training data—assigning different scores to the same text with names traditionally associated with white men as opposed to names traditionally associated with women and people of color, for example.

There are ways of working around both of these flaws, including training a sentiment analysis tool on your own corpus. Most sentiment analysis projects are likely to require at least some programming background.

Figure out a project that matches your research interests but requires less effort to get texts and tools to work together

Some vendors offer tools to help you do data projects with large sets of texts or data coming from their respective content. These offer support for a variety of tasks which may require no programming knowledge, a little programming knowledge, or a lot of programming knowledge.

  • Constellate, a tool created by ITHAKA (the people behind JSTOR) can be used for free by anyone up to a certain amount of data. Constellate supports no-programming data visualizations, or allows more involved calculations and analyses with some programming.
  • Current Yale faculty, students, and staff with a netID can create their own ProQuest TDM Studio account with their Yale email address. Proquest TDM supports both no-programming data visualizations and more complex analyses with some programming.
  • Current Yale faculty, students, and staff with a netID can use their CAS login to access Gale's Digital Scholar Lab, which offers a no-programming way to generate data visualizations and run natural language processing and other analyses on Gale primary sources.
  • Current Yale faculty, students, and staff with a netID can use the Institutional login to access SketchEngine, which supports the analysis and linguistic processing of a large number of prepared language corpuses, and can be used on user-generated corpuses. Does not require programming.