Yale University Library Research Guides: Resources for Text and Data Mining: Practice Text or Data Mining

Contact Us

At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.

For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.

Follow one (or more than one) of these guided example projects:

Simple Text Mining A guided example of text mining that doesn't require programming
Complex Text Mining A guided example of text mining that requires some use of basic Python
Simple Data Mining A guided example of a basic graph visualization
Complex Data Mining A guided example of a more complex graph visualization

Simple Text Mining

Take, for example, MIT’s digitized full text of King Henry IV, part 1. We’ll plug this into Voyant Tools to generate some quick insights into the text.

Start by pasting the link to the text (http://shakespeare.mit.edu/1henryiv/full.html) into the box on Voyant Tools. You’ll notice that you can also paste multiple URLs, upload files, or even copy over text.

Then click “Reveal.”

Notice what words come up most in the cloud on the left-hand side. (You can show more or fewer words by changing the slider labeled “Terms.”) Largest are words like king, prince, lord, Henry, Falstaff, Hotspur. Many of the other largest terms are also names or titles.

Open the full text of the play, if you haven’t already. You might notice right away why names are so prominent in our representation: the structure of theatrical dialogue requires the name of the character speaking to be listed each time they speak. (And, furthermore, every time they enter or leave the room.)

This is a key concern in working with text and data mining: making sure that you’re including only the parts of your texts or data that you’re trying to study.

You could get just the spoken words of the play by going through in a text editor and removing everything you’re not interested in studying. However, Voyant Tools also offers options to do this work automatically.

Return to the first page of Voyant Tools (you can click the home button in the upper right, or reopen the link). Again enter the link to the play in the box available (http://shakespeare.mit.edu/1henryiv/full.html).

This time, before clicking “Reveal,” click the toggle switch in the upper right of the box. This will open an options panel. Click the arrow next to “HTML.” This will offer you text boxes in which to specify content, title, and other components of the play. If you click “see documentation,” it will tell you a bit more about how to specify these.

Depending on your knowledge of HTML and CSS, this might be fairly familiar for you, or entirely new. You can see that there are a number of ways to specify which parts of the HTML document you want to include as content.

In order to see how you might narrow the play down to just the dialogue and monologues by selecting HTML elements, begin by opening the play. Right click the page and select “View Page Source.”

You should see all the tags, attributes, and content that define this page. Looking at titles, subtitles, stage directions, and other paratext, you can see that each has some of its own structure in the HTML. For example, acts and scenes are in h3 tags, or third-level headers. What characterizes the dialogue and monologues, but no other elements of the text?

There are other links and anchors (a tags), and other block quotes, but only the spoken lines of the play are anchors within block quotes. Thus we can define the content with an ancestor child selector: blockquote a.

Return to the Voyant Tools page. For “Content” under HTML, type “blockquote a” and then click “okay.” Then click “Reveal.”

Now the word cloud should look more like you probably imagined it: some names and titles are still somewhat prominent—characters address each other by name, or refer to each other, not infrequently. However, you can also see a lot of other frequently appearing words: shall, I’ll, time, hath, love, day. These might feel more meaningful, either as indicators of the kinds of conversations happening (many describe intention, with shall or I’ll!) or perhaps themes, which words like “time” or “day” might indicate.

From this point, you can explore what else Voyant offers by clicking various options, selecting specific terms or phrases, and so on.

As you can see from this first example, there are quick and easy ways to use text mining tools to get some overview information on a text (or lots of texts!). However, as you discovered it’s still important to look at the output you get with a critical eye—text mining is a tool, not a definitive answer.

Complex Text Mining

Now let’s try something a little more complicated. You’ve gotten a wider sense of words and phrases that crop up a lot in Henry IV, part 1, and now you want to save yourself some work mapping a phrase or two.

We’ll use NLTK, Natural Language Toolkit, which facilitates the analysis of texts in a variety of ways, to look at parts of speech of words in a sentence in the play.

Begin by downloading Python 3. Follow the instructions here, including installing pip.

Once you’ve installed Python, install NLTK by running pip install --user -U nltk.

Then run pip install --user -U numpy.

Now we can begin trying out NLTK!

Start python by typing “python” into Terminal or Windows Power Shell.

Then type import nltk.

Next, type from nltk.tokenize import word_tokenize

Finally, type from nltk import pos_tag

Now we can play with our text!

Give our text a variable so you can use it more easily: exampleSentence = “””So shaken as we are, so wan with care, Find we a time for frighted peace to pant, And breathe short-winded accents of new broils To be commenced in strands afar remote.”””

NLTK can break this up into words for us easily: try word_tokenize(exampleSentence)

Then we might like to see parts of speech: try pos_tag(word_tokenize(exampleSentence))

You can see that for each word NLTK has provided an abbreviated label to describe what the word is doing: “So” is an adverb (RB), “shaken” is an adjective (JJ), and so on. These tags follow the Penn Treebank Project abbreviations.

As you can see, there are a lot of powerful tools for analyzing texts available to you, especially if you’re willing to do a little bit of programming. NLTK is just one of the libraries you can use in your work. As with Voyant Text, though, it’s useful to remember that these are all human-built tools with specific purposes in mind, which may have attending shortcomings.

Simple Data Mining

Let’s try a network graph—a kind of data representation, if you aren’t familiar, that emphasizes the relationships between data points. It’s a visualization that can potentially reveal new things about how those relationships operate: you may see distinct groups, for example, or individual points with substantially more connections than others.

Gephi is a desktop application that makes visualizing network graphs easy, and can be downloaded for free here. Start by downloading it and following the instructions to install.

When you open Gephi, it will offer a menu of how to start: for now, choose “New Project.”

You’ll see a blank page with columns to either side: this is where you’ll work with the network graph visually. For now, at the top of the page click “Data Laboratory.”

If you’re building your data in Gephi itself—which is mostly only going to work for small data sets—or if you want to make small corrections, changes, additions, or subtractions to your data, this will be the place to do it.

So that you understand what data Gephi requires, let’s build a small network graph by hand first.

Suppose you have learned from a friend about a group of world-class poets who were secretly time travelers. Curious about their friendships, you took notes as your friend described their interactions. Here’s what you learn:

Sappho is friends with Maya Angelou and William Butler Yeats
William Blake is friends with Shel Silverstein and Pablo Neruda
Rabindranath Tagore is friends with Maya Angelou, Langston Hughes, and Shel Silverstein
Maya Angelou is friends with Langston Hughes, Pablo Neruda, and Jalal al-Din Rumi
Langston Hughes is friends with Pablo Neruda
Shel Silverstein is friends with Jalal al-Din Rumi and Li Bai
Pablo Neruda is friends with Jalal al-Din Rumi and Li Bai

In order to better see the relationships between these poets, we’ll visualize this in Gephi. As you may see in the data laboratory, Gephi sees data in terms of “nodes” (the people, in this case) and “edges” (for this graph, the friendships). We’ll start with the nodes.

Make a list of the people described above. One strategy might be to write down each person at first occurrence, checking to make sure they’re not already on the list—you could even write them down alphabetically to make checking faster. There are also faster or more automated ways to do this for longer lists.

Your list should look something like this:

Maya Angelou
Li Bai
William Blake
Langston Hughes
Pablo Neruda
Jalal al-Din Rumi
Sappho
Shel Silverstein
Rabindranath Tagore
William Butler Yeats

Now we’ll start entering these into Gephi. Click “add node.” Type the label for the first node—that is, the person’s name—and click “Ok.” Do this for each of the ten poets.

Now click “Edges” on the toolbar of the data table. Here is where you can see the friendships you’ll be adding to the graph.

Click “Add edge.” You’ll want to leave “undirected” selected.

A directed edge is one where something is going FROM one node to the other (say, an email being sent, or a relationship like “is the parent of”). Undirected edges, meanwhile, are the same both directions—which we might hope these friendships are!

Now we’ll capture the first relationship described above. “Sappho is friends with Maya Angelou.” For the source node, find Sappho on your list. For the target node, find Maya Angelou. “Source” and “Target” here are equivalent—for a directed edge, it would matter which was which.

If there were different kinds of edges—say, some friendships and some rivalries—you might want to write something under “Edge Kind” to label this. As it is, we can leave this blank.

Click “Ok.”

Repeat this process for each of the friendships described in the text above.

You should end up with 15 edges. Now go back to the overview, which you can select at the top of the screen.

It may not look especially clear just yet, but there are a few things we can do to help us make sense of the graph. As a first step, we’ll turn on labels for the nodes.

Find the bold black “T” at the bottom of the screen. (If you hover over it, or any of the other buttons around the viewer, it will provide an explanation of what it does.) Click it to turn the labels on.

You can also rearrange the nodes to make the relationships between them a little clearer. There are ways to do this automatically using the “layout” panel on the left-hand side, but for a data set this small it might be easier to arrange them manually.

One thing that can help in looking at a network graph is to try to move the nodes so few, if any, of the edges cross. If you click and drag a node, you can move it: do this until you can start to see how the friendships are working.

You can see that there are some overlapping groups of friends, where a subset of people are each friends with each of the others, and no groups or individuals with no connections to the others. You might want to know who has the most friends within this group. You could count, but Gephi also offers a way to visualize this easily.

On the left-hand side, in the panel labeled “Appearance,” select “Nodes” (if it’s not selected already), then select the icon of circles of a variety of sizes, and then choose “Ranking.”

Under the drop-down menu, choose “Degree.” Degree refers, in the context of graphs, to the number of edges a node has—in this case, the number of friends a person has.

For min size, let’s start with 10, and for max size, try 40. Click “Apply.” Immediately, you can see that Maya Angelou and Pablo Neruda have a lot more friends in this group than, say, William Blake.

We’ll stop here with this example, but you can keep playing with this graph on your own. What might it reveal about this data? I’ve invented this story, and these friendships, whole cloth—can you tell from this graph? What might start to indicate it? If these were real friendships, what questions might you ask seeing this graph that didn’t occur to you reading the text?

Complex Data Mining

Example Node Data
An example of a list of nodes for a network of carbon exchange in a specific ecosystem in St Marks, prepared for Gephi.
Example Edge Data
An example of a list of edges for a network of carbon exchange in a specific ecosystem in St Marks, prepared for Gephi.

Now let’s try working with a much larger dataset.

I’ve edited down a set of network data from a 1998 paper called “Assessment of spatial and temporal variability in ecosystem attributes of the St Marks National Wildlife Refuge, Apalachee Bay, Florida” from Estuarine, Coastal, and Shelf Science by D. Baird, J. Luczkovich, and R. R. Christian. This subset of the data in the paper, representing the flow of carbon between organisms in an ecosystem, comes from Pajek data.

You can download the data as two CSV files, found above: a list of nodes, and a list of edges.

If you still have the graph of poets open, go to “File” > “New Project.” Otherwise, open Gephi and select “New Project.”

Go to “File” > “Import spreadsheet…” and find and select the node list. Click “Open.”

The general CSV options for this file are correct to begin with—if you were opening a CSV that were separated by, say, tabs or semicolons instead of commas, this would be the place to clarify that for Gephi.

Click “Next.”

For the import settings, you’ll want to make sure both “Id” and “Label” are checked so both are imported. Click “Finish.”

On the import report, select “Directed” for the Graph Type (there are no edges yet, but they’ll be directed—reflecting carbon flowing one way—when there are), and select “Append to existing workspace” so the nodes are added to the data table for the workspace you already have open. Then click “Okay.”

You may want to turn on node labels again, so you can see which critter is which. (Click the bold black ‘T’ in the bottom toolbar.) If the labels are too big, you can use the right-hand slider by the font options to scale them down.

Don’t worry too much about organizing the nodes just yet—we’ll get to that after the next step.

Go to “File” > “Import spreadsheet…” again. This time, open the edge list.

You can leave the general CSV options as they are and click “Next,” though do note that this edge list has, not just sources and targets, but also weights—this is how strong or important an edge is, and in this case is the amount of carbon flow.

On the next page, it will let you tell it what kind of data the “Weight” column should be considered as—if you work with Excel or have done any programming, this may be familiar. It automatically selected a double, which is a long decimal number. This works for the data we have, so you can leave it as it is and click “Finish.”

Again, you’ll want to make sure “Directed” is selected under “Graph Type,” and that “Append to existing workspace” is selected so the edges will combine with the node labels you imported in the previous step. Click “OK.”

You can see how entangled this graph is. It’s a little hard to read, but we’ll do a few things over the next steps to make it clearer. For now, if you want to see how a particular organism receives and provides carbon in this network, you can hover over a specific node. This will grey out everything that doesn’t directly connect to it.

Let’s try an automatic way to organize our data a little bit. On the left side, you’ll see a panel called “Layout.” In the dropdown list, select “Fruchterman Reingold.” This is an algorithm that will push all of the nodes apart, while drawing connected nodes together. Click “Run” to see what this does.

Once the motion has settled down, click “Stop.”

It should now be just a bit easier to see all the organisms, as they are more evenly distributed.

However, it’s hard to see which direction the carbon is flowing: because the edges are weighted, some are very small, and their arrows are almost invisible. We could scale the edges up, but this would make the largest edges so large they would obscure other parts of the graph.

Instead, we’ll repurpose a feature of Gephi to make this graph easier to read.

Under the “Data Laboratory” tab, go to the nodes table. At the bottom, click “Add column.” Call this column “Partition.”

A partition would normally be a meaningful subset of the nodes—so perhaps some nodes are individual people and some nodes are concepts they relate to, for example. Those would be two different partitions. Here, though, we’re just using partitions to assign each node its own color.

To make each node a unique partition, click “Copy data to other column” at the bottom. Select “Id,” since this is a unique number for each node. Then select “Partition” as the column you’re copying to and click “Ok.”

Now go back to the overview.

In the upper-left panel (called “Appearance”), make sure “Nodes” is selected, and that the little paint palette is selected, and then select “Partition.” In the drop-down menu, choose “Partition”—that’s the column you’ve just created. If you scroll down the list, you can see after a number of colors the rest of the nodes are grey—that’s because the standard palettes Gephi offers assume a smaller number of partitions.

To generate a color palette of the right size, click “Palette…” in the lower right corner of the panel. Then click “Generate…” in the drop-down list. In the menu that comes up, you can see that there are 54 values—so make the number of colors (on the right) 54, too. You can choose a color scheme, or “default” will give you the broadest range, which will likely be easiest to see. Then click “Generate.”

Some of the colors may be a bit similar, but this will at least differentiate our nodes a little. Click “Ok.” Then, at the bottom of the top left panel, click “Apply.”

Now when you hover over the nodes, you can see the colors of the edges will match the color of one or the other node it’s connected to: this indicates the source node. Thus whichever node the edge is colored after, that’s what’s providing carbon—and therefore, in most cases, being eaten.

Gephi can be a powerful tool for seeing the relationships in network data—the CSV files of nodes and edges are hard to read, where the visual web Gephi creates can quickly show you how interconnected your data is.

As with any visualization tool, it’s only as good as the data you import: make sure you look with a critical eye at what you’re putting in the data tables, and at what the output suggests to you.

Resources for Text and Data Mining: Practice Text or Data Mining

Contact Us

Table of Contents

Simple Text Mining

Complex Text Mining

Simple Data Mining

Complex Data Mining

Site Navigation

Yale's Libraries