Yale University Library Research Guides: Linguistics: Data Sources

Data Resources

Locating Corpus Data
Other Data and Statistics Resources

Do you need data? We might have it!

First, there is a filter for Software & Datasets in the library catalog — regardless of what you search for, you can navigate to that category under "Format" to see the data that we have. There's also a keyword for "Corpora (Linguistics)" that you can use when you're using a Subject search. It will display a combination of books about corpora and actual corpora. Again, narrow down to datasets on the left. The term text, when in a Data Set search, will return several hundred datasets. You can refine search terms and look at the subject classifications of specific records of interest (see the primer on how to do that on the (e)Books tab).

You can also use the keyword strings that Yale catalogers have created to make datasets more discoverable in the catalog. Searching on "all fields" in either the basic or advanced search will limit your results appropriately. Here are the keywords:

All datasets = yuldset
Mediated datasets (user must contact a staff person to access) = yuldsetmediated
Text datasets = yuldsettxt (If you do aForm/Genre search, write this instead: Text corpora)
Image datasets = yuldsetimg
Geospatial datasets = yuldsetgis (If you do a Form/Genre search, write this instead: Geospatial data)
Numeric datasets = yuldsetnum

Second, a lot of data is publicly available. This means searching the open web or catalogs that tell you where to find corpora on the open web.

Third, if we don't have something, please submit a purchase request via this form. We can investigate the data in more detail and get back to you about whether we can acquire it. Please note that, when it comes to licensed resources (which data typically is), it's best if you submit a request as early as possible after you identify what you need.

Where else can you go for corpora, though?

Do a scholarly search for what people in your research area are using. Corpora will be mentioned in both books/monographs and articles.
Use public internet resources.
- Wikipedia maintains a list of text corpora in many languages. It is not a complete list, but it is useful. Wikipedia tends to be one of the more up-to-date resources when it comes to listing things like software and online information, although Wikipedia itself will skew towards publicly available resources and unrestricted datasets. It's perfectly OK to use Wikipedia for discovering things!
- The Linguist List's Corpora Archive. You can browse by thread, subject, author, and date.
The list below contains information about some existing resources; however, creating a complete list of all corpora would be very overwhelming. Some of these are not strictly linguistics corpora; they're text corpora that can easily be adapted for use in a corpus linguistics context.
Sometimes, library databases will include text mining in the licenses. Talk to us if you see a library database of interest to you.

Aboriginal Studies Electronic Data Archive
Contains digital materials (1980s-2000s) about Australian Indigenous languages, such as dictionaries, grammars, and teaching materials.
Alex: Catalogue of Electronic Texts
Public domain and open access documents (mostly Western philosophy and English-language literature).
ALMA eBooks
African Language Materials Archive (ALMA). Both this page and the expanded web site (linked from the ebook list) have a variety of African-language materials. Bamanankan, Criol, Hausa, Wolof, and other languages are represented here.
ARTFL
A service that provides members with access to digitized French resources, with additional support for the "dictionaries d'autrefois" and other resources.
CELT: Corpus of Electronic Texts
A searchable online text base consisting of 19 million words across 1,629 documents from many time periods, with content ranging from literature and the arts to medicine.
Corpus Bambara de Référence
A reference corpus for the Bambara language, spoken in Mali. This corpus consists of a range of texts in the public and private spheres. In 2012, the corpus consisted of 1,100,000 words. (Note: the corpus documentation is in French.)
Corpus Cyrillo-Methodianum Helsingiense
A corpus of Old Church Slavonic (OCS) texts encoded in 7-bit ASCII.
Corpus Maninka de Référence
As of 2016, this corpus contained 3,105,879 words for N'Ko and 396,389 words for Maninka. (Note: The documentation for this resource is in French.)
Corpus of Contemporary American English
The Corpus of Contemporary American English (COCA) is a genre-balanced corpus of American English. The corpus contains more than one billion words of text (20 million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages.
ECI Multilingual Corpus
This is documentation about a CD-ROM from ELSNET that has been available since 1994. It contains some German, French, Dutch, English, and Spanish documents, primarily newspaper text and bulletins from the International Labor Organization.
First World War Poetry Digital Archive
Over 7,000 items of text, images, audio, and video available for research.
Gigaword Corpus
Icelandic corpus resource. All of the documentation is in Icelandic.
Goteborg Language Bank of Swedish
The Swedish Language Bank is a research unit that focuses on methodologies for handling the Swedish language and the development of linguistics resources/tools for Swedish. You can go here for the DReaM, ALZ-RJ, Culturomics, and Diabase tools. (Note: Some of the documentation is in Swedish.)
Humanities Text Initiative
This organization creates, delivers, and maintains a corpus of electronic texts.
International Corpus of English (ICE)
The Survey of English Usage gathers samples of naturally-occurring language to describe and analyze. ICE and ICE-GB (where "GB" means the British component) are available here. You can read more about the Survey of English Usage at this link.
Japanese Text Initiative
Japanese-language electronic texts. Note the conditions of use link on the main page.
Linguistic Data Consortium
The LDC is an open consortium of universities, libraries, corporations, and government research labs that was formed to improve data for language tech research and development. The catalog contains 100s of holdings, with corpora that either stand on their own or were used for specific projects. It has a helpful search page.
Maninka
This site is in the Cyrillic alphabet. It contains information on the Maninka language and a corpus.
Online Books Page
A resource curated by a digital library planner/researcher at UPenn that facilitates access to freely-readable books online.
Oxford Text Archive
Electronic literary and linguistics resources for use in higher education, including research, teaching, and learning. It will soon be moving to a new web interface, so stay tuned.
Penn Corpora of Historical English
The Penn Parsed Corpora of Historical English includes Middle English, Early Modern English, and Modern British English texts and text samples.
Perseus Digital Library
A Tufts University-supported resource in Classics. You can download source files for many Latin and Greek texts to analyze them.
Project Libellus
Some public-domain texts in classical Latin and Greek that can be used in research.
Sketch Engine
"Sketch Engine is the ultimate tool to explore how language works. Its algorithms analyze authentic texts of billions of words (text corpora) to identify instantly what is typical in language and what is rare, unusual or emerging usage. It is also designed for text analysis or text mining applications. Sketch Engine is used by linguists, lexicographers, translators, students and teachers. It is a first choice solution for publishers, universities, translation agencies and national language institutes throughout the world. Sketch Engine contains 600 ready-to-use corpora in 90+ languages, each having a size of up to 60 billion words to provide a truly representative sample of language."--Publisher's website
TDM Studio
TDM Studio is a text mining and visualization interface for ProQuest products. We receive many full-text databases through ProQuest, including ones that cover the news. See the link for information on how to use this.
Text & Data Mining at Yale
This is a brief guide to the steps of a text or data mining project, with advice, suggestions, and library resources targeted towards planning your project through each step.
TranslateFX Chinese-English Parallel Corpora
These Chinese-English parallel corpora downloads were developed by TranslateFX researchers and linguists for public use. The corpora is made of aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others. All the texts are from the Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong, and Hong Kong government websites.
Your Dictionary's Romance Languages Portal
A language dictionaries listing. Includes some dialects of various Romance languages.

CLICS³: Database of Cross-Linguistic Colexifications
CLICS³ is an online database of colexifications (polysemies or homophonies) in 3156 language varieties. This database builds on the original CLICS database to improve the data and make it more useful to the community. Among other research applications, CLICS³ is useful for studies on semantic change, patterns of conceptualization, and linguistic paleontology.
IDEA: International Dialects of English Archive
Primary-source recordings of English-language dialects and accents. They have over 1,500 samples from 120 countries and territories. Both native English speakers and those who speak English as a second language are included.
SIL International's Ethnologue
In February 2023, Ethnologue released its 26th edition of statistics of living languages of the world, including the number of speakers, places spoken, dialects, linguistic affiliations, and more.

Accessing Linguistic Data Consortium Individually-Licensed Resources

The Linguistic Data Consortium (LDC) has many data sets that require individual user agreements. On request, we can help you get access to them. This is a list of the limited-access data sets available.

Each record will look like this:

Clicking on the "Member" link will show you what the license agreement looks like, and you can click on the dataset identifier (i.e., LDC2016S12, LDC2018T05) to view detailed information about each item.

Please email us to start the request process. Make sure you include the dataset identifier when you make your request.

Finding Data in QuickSearch

Data Management

The Open Handbook of Linguistic Data Management by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister
ISBN: 9780262366076

Publication Date: 2022

A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice.

Linguistics: Data Sources

Data Resources

Accessing Linguistic Data Consortium Individually-Licensed Resources

Finding Data in QuickSearch

Data Management

Site Navigation

Yale's Libraries