Do you need data? We might have it!
First, there is a filter for Software & Datasets in QuickSearch Books+ — regardless of what you search for, you can navigate to that category under "Format" to see the data that we have. There's also a keyword for "Corpora (Linguistics)" that you can use when you're using a Subject search. It will display a combination of books about corpora and actual corpora. Again, narrow down to datasets on the left. The term text corpus, when in an All Fields search, will return 234 datasets. You can refine search terms and look at the subject classifications of specific records of interest (see the primer on how to do that on the (e)Books tab).
You can also use the keyword strings that Yale catalogers have created to make datasets more discoverable in the catalog. Searching on "all fields" in either the basic or advanced search will limit your results appropriately. Here are the keywords:
Second, a lot of data is publicly available. This means searching the open web or catalogs that tell you where to find corpora on the open web.
Third, if we don't have something, please submit a purchase request via this form. We can investigate the data in more detail and get back to you about whether we can acquire it. Please note that, when it comes to licensed resources (which data typically is), it's best if you submit a request as early as possible after you identify what you need.
Where else can you go for corpora, though?
The Linguistic Data Consortium (LDC) has many data sets that require individual user agreements. On request, we can help you get access to them. This is a list of the limited-access data sets available.
Each record will look like this:
Clicking on the "Member" link will show you what the license agreement looks like, and you can click on the dataset identifier (i.e., LDC2016S12, LDC2018T05) to view detailed information about each item.
Please email us to start the request process. Make sure you include the dataset identifier when you make your request.