Skip to Main Content

Resources for Text and Data Mining: Working with Text and Data Files

Working with Text and Data Files

You will likely need to do some work with your texts or data before you can plug them into the tools you're using for text and data mining. Tools like OpenRefine can help you reformat your data, while understanding the file format you're using can help you decide how to proceed. Sometimes there may be tools available online to help you convert your data—other times you might want to write a script.

Contact Us

At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.

For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.

Working with HTML or XML Documents

HTML and XML documents have the distinct advantage over plain text of being structured data—but that structure is only as useful as it’s been designed to be. A good first step is to read through the document you have and see if you can identify patterns to the data you’re looking for. (If you're accessing HTML on the Internet, you can right click on a page and select "View Page Source" to see this before you've done any further work.) Beautiful Soup is a popular Python library that can help in getting at the data in an XML or HTML file.

Working with CSV Files

CSV files—comma- or character-separated value files—are a simple way to store tabular data. They can be easily used by programs like Excel or Gephi, but also can be opened, edited, or even generated in Python or R scripts. They separate columns with (usually) commas, and rows with returns. A table might look like:

ID Name
059201 Roberta
938291 Frank

 

And in a CSV file, it would look like:

ID,Name
059201,Roberta
938291,Frank

 

CSVs can be fragile—if you miss a comma, or get values out of order, it can be hard to sort out—but they can also facilitate a lot of great data work.

If you’re trying to create a CSV file, Excel or Google Sheets are both easy places to start. You can work with the resulting file in Python or R the same way you would work with any text file.

If you have a CSV file and you want to quickly get some information about it, Open Refine is a free, downloadable tool with a simple interface that can help you sort through big data sets, CSV files included.

Working with JSON Files

JSON (JavaScript Object Notation) is a way of formatting data that you are likely to get from APIs. If you’re familiar with data structures from programming, it looks a lot like a dictionary. Here’s an example of a JSON object:

{“id”: “0192833”, “label”: “New Haven”, “coordinates”: “41°18′36″N 72°55′25″W”}

 

If you want to get a quick sense of a JSON file, you can try opening it in Firefox—it will automatically format it in an easy-to-read way. If you're working with a very large file, Open Refine is a free, downloadable tool with a simple interface you can use to get an overview of your data.

 

Both Python and R have packages you can use to make working with JSON easier. For Python, the package is built-in. The syntax for using it looks like:

import json
# turn a JSON object into a Python dictionary
dictionary = json.loads(JSONobject)
# turn a Python dictionary into a JSON object
JSONobject = json.loads(dictionary)

You can then work with that dictionary as you would any other.

 

For R, you’ll need to download a package. You can find several online—many people use jsonlite. The syntax for using it looks like:

library(jsonlite)
# turn a JSON object into a Data Frame
dataframe <- fromJSON(JSONobject)
# turn a Data Frame into a JSON object
JSONobject <- toJSON(dataframe)

Working with a Different File Format

If you’re looking to work with another format of data, you can often find helpful information if you run a web search on the file or data format and the name of the programming language you’re working in. If you need help, the Stat Lab and the DHLab both offer consultations.