You will likely need to do some work with your texts or data before you can plug them into the tools you're using for text and data mining. Tools like OpenRefine can help you reformat your data, while understanding the file format you're using can help you decide how to proceed. Sometimes there may be tools available online to help you convert your data—other times you might want to write a script.
At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.
For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.
HTML and XML documents have the distinct advantage over plain text of being structured data—but that structure is only as useful as it’s been designed to be. A good first step is to read through the document you have and see if you can identify patterns to the data you’re looking for. (If you're accessing HTML on the Internet, you can right click on a page and select "View Page Source" to see this before you've done any further work.) Beautiful Soup is a popular Python library that can help in getting at the data in an XML or HTML file.
CSV files—comma- or character-separated value files—are a simple way to store tabular data. They can be easily used by programs like Excel or Gephi, but also can be opened, edited, or even generated in Python or R scripts. They separate columns with (usually) commas, and rows with returns. A table might look like:
ID | Name |
059201 | Roberta |
938291 | Frank |
And in a CSV file, it would look like:
CSVs can be fragile—if you miss a comma, or get values out of order, it can be hard to sort out—but they can also facilitate a lot of great data work.
If you’re trying to create a CSV file, Excel or Google Sheets are both easy places to start. You can work with the resulting file in Python or R the same way you would work with any text file.
If you have a CSV file and you want to quickly get some information about it, Open Refine is a free, downloadable tool with a simple interface that can help you sort through big data sets, CSV files included.
JSON (JavaScript Object Notation) is a way of formatting data that you are likely to get from APIs. If you’re familiar with data structures from programming, it looks a lot like a dictionary. Here’s an example of a JSON object:
If you want to get a quick sense of a JSON file, you can try opening it in Firefox—it will automatically format it in an easy-to-read way. If you're working with a very large file, Open Refine is a free, downloadable tool with a simple interface you can use to get an overview of your data.
Both Python and R have packages you can use to make working with JSON easier. For Python, the package is built-in. The syntax for using it looks like:
You can then work with that dictionary as you would any other.
For R, you’ll need to download a package. You can find several online—many people use jsonlite. The syntax for using it looks like: