Linguistics: Data Sources

A guide to linguistics information for members of the Yale community.

Data Resources

Do you need data? We might have it!

First, there is a filter for Software & Datasets in QuickSearch Books+ — regardless of what you search for, you can navigate to that category under "Format" to see the data that we have. There's also a keyword for "Corpora (Linguistics)" that you can use when you're using a Subject search. It will display a combination of books about corpora and actual corpora. Again, narrow down to datasets on the left. The term text corpus, when in an All Fields search, will return 234 datasets. You can refine search terms and look at the subject classifications of specific records of interest (see the primer on how to do that on the (e)Books tab).

You can also use the keyword strings that Yale catalogers have created to make datasets more discoverable in the catalog. Searching on "all fields" in either the basic or advanced search will limit your results appropriately. Here are the keywords:

  • All datasets = yuldset
  • Mediated datasets (user must contact a staff person to access) = yuldsetmediated
  • Text datasets = yuldsettxt (If you do aForm/Genre search, write this instead: Text corpora)
  • Image datasets = yuldsetimg
  • Geospatial datasets = yuldsetgis (If you do a Form/Genre search, write this instead: Geospatial data)
  • Numeric datasets = yuldsetnum

Second, a lot of data is publicly available. This means searching the open web or catalogs that tell you where to find corpora on the open web.

Third, if we don't have something, please submit a purchase request via this form. We can investigate the data in more detail and get back to you about whether we can acquire it. Please note that, when it comes to licensed resources (which data typically is), it's best if you submit a request as early as possible after you identify what you need.

Where else can you go for corpora, though?

  • Do a scholarly search for what people in your research area are using. Corpora will be mentioned in both books/monographs and articles.
  • Use public internet resources.
    • Wikipedia maintains a list of text corpora in many languages. It is not a complete list, but it is useful. Wikipedia tends to be one of the more up-to-date resources when it comes to listing things like software and online information, although Wikipedia itself will skew towards publicly-available resources and not restricted datasets. It's perfectly OK to use Wikipedia for discovering things!
    • The Linguist List's Texts & Corpora page. There is some light browsing navigation by subject language, linguistic subfield, and language family.
  • The list below contains information about some existing resources; however, creating a complete list of all corpora would be very overwhelming. Some of these are not strictly linguistics corpora; they're text corpora that can easily be adapted for use in a corpus linguistics context. 
  • Sometimes, library databases will include text mining in the licenses. Talk to us if you see a library database of interest to you.

Accessing Linguistic Data Consortium Individually-Licensed Resources

The Linguistic Data Consortium (LDC) has many data sets that require individual user agreements. On request, we can help you get access to them. This is a list of the limited-access data sets available.

Each record will look like this:

Clicking on the "Member" link will show you what the license agreement looks like, and you can click on the dataset identifier (i.e., LDC2016S12, LDC2018T05) to view detailed information about each item.

Please email us to start the request process. Make sure you include the dataset identifier when you make your request.