At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.
For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.
Tasks and methods:
If you’re looking for quick impressions, there are some easy-to-use tools that can help you for both texts and datasets. These should be seen as ways of generating ideas that you might then follow up on in a more rigorous way rather than ways of generating firm answers.
If you just want to get a sense of the contents of a data file, Open Refine is a free, downloadable tool with a simple browser-based interface that can help you identify the overall structure of your data, the kinds of values a given field might have, and similar information.
For texts, Voyant Tools can be a great way to identify commonly occurring words and phrases, and even words and phrases that frequently appear together. You can input URLs or full texts, or upload text files. This might be a fast way to see what motifs appear together, for example.
For data sets, often generating a visualization is a quick way to see what you’re working with. There’s nothing wrong with using tools you’re already familiar with. Excel, which you have access to through Yale, offers conditional formatting and some simple visualizations (“charts”) that may help you see patterns in your data.
If you have network data—or data you can reformat as graph data—Gephi is a free desktop application that can visualize it relatively easily.
Named Entity Recognition is a Natural Language Processing (NLP) method for identifying named, unambiguous things—this might be a person, a place, an organization, a concept, or anything else where the word refers to one specific thing. So “Peter Salovey” would be a named entity—“university presidents” would not. Current Named Entity Recognition systems rely on machine learning.
Unfortunately, likely due to the training data—the sets of texts the systems were taught to identify entities based on—the results of Named Entity Recognition reflect demographic bias. For example, names more often associated with white men are recognized with greater accuracy than names more often associated with women and people of color.
You may decide that Named Entity Recognition is still helpful for your project. Although it will require some programming, there are tools to help you. Stanford CoreNLP (for Java) and spaCy (for Python) are both popular, and might be a good place to start.
If you’re trying to find frequently occurring words or phrases in a text, there are easy web-based ways to do this. WordCounter, from DataBasic, will generate a quick cloud map of words and phrases from a pasted text, a URL, or a file upload. If you want a little bit more information, or a little bit more control, Voyant Tools will also generate a cloud map as well as other data and charts if you paste in a text, type in a URL or set of URLS, or upload a file.
Be aware that, as with any text or data mining process, what you’re doing is getting a level of abstraction away from your texts so you can see things you might not otherwise—it shouldn’t be seen as an answer to a question. Similarly, you might get unexpected answers that “mean” something other than you expected—like finding that character names, for example, are the most-used words in an unedited theatrical script.
There are a few ways you might go about identifying relationships within a data set or a group of texts.
When you want to see relationships in data or texts, you might also be thinking of something called a network graph. A network graph is a kind of data visualization that shows a map of things (vertices or nodes) and their relationships to each other (edges or links). If you want to create a network graph, you will likely have to work with your data—by hand or with a script—before you plug it into anything. This will involve generating your two lists—of what your things are (people, places, even concepts), and how they connect to each other. Tools like OpenRefine or even NLTK might help you get these two lists from your raw data. Once you’re set, Gephi is a great tool for visualizing this data.
Sentiment analysis is a way of detecting tone or emotional content in texts. It can, for example, tell you how satisfied customers are based on large numbers of reviews—one of the main use cases the technology was developed for.
As with many other Natural Language Processing tasks, sentiment analysis relies on machine learning. This comes with some downsides—based on the training data the given sentiment analysis tool has been provided, it may make unexpected, inaccurate, or biased assumptions about the text you use it on. For example, the tone or emotion associated with words in formal academic writing or eighteenth-century literature is different than the tone or emotion associated with the same words in a customer review.
Additionally, out-of-the-box sentiment analysis tools often exhibit demographic bias based on their training data—assigning different scores to the same text with names traditionally associated with white men as opposed to names traditionally associated with women and people of color, for example.
There are ways of working around both of these flaws, including training a sentiment analysis tool on your own corpus. Most sentiment analysis projects are likely to require at least some programming background.
Some vendors offer tools to help you do data projects with large sets of texts or data coming from their respective content. These offer support for a variety of tasks which may require no programming knowledge, a little programming knowledge, or a lot of programming knowledge.