Skip to Main Content

Resources for Text and Data Mining: Home

Welcome

This is a brief guide to the steps of a text or data mining project, with advice, suggestions, and library resources targeted towards planning your project through each step. This is not meant to be a definitive or exhaustive explanation of all possible projects or techniques.

If you have questions, want help planning your project, or are looking for advice or information on a text or data mining topic not listed, please reach out to us.

Text and Data Mining: A Definition

Text and data mining describes a set of techniques or processes by which you can automatically take a large amount of information—either in a natural language such as English or in the form of statistical data—and structure and analyze it to identify patterns, relationships, and other new information about the texts or data.

If this sounds broad, that's because it is! There are some techniques that are more traditionally considered text or data mining techniques—and which they are can depend on what context you're working in—but don't worry too much about whether something you're doing counts.

Text and data mining projects do have some overlap in the basic steps you're likely to take, which include:

  1. Finding and retrieving a data set or set of texts—this could look like downloading a spreadsheet, scraping the web, or accessing an API
  2. Getting the data or texts ready to be used—this might involve removing extraneous information from texts or reformatting data to accommodate different formats, for example
  3. Mining the texts or data—figuring something new out about them as a whole using a script or tool

Contact Us

At any point in your project—from planning to acquiring data to troubleshooting code—you can reach out to us for help. For initial planning, we especially recommend you book a consultation with the DHLab to help you strategize.

For on-the-spot help, visit the StatLab's walk-in hours or book a consultation with the DHLab.