Guides: Linguistics Guide: Corpora

Linguistic corpora

A linguistic corpus is a body of texts, usually selected through a method of sampling, that are meant to be representative of a certain form of a language, either in a certain time period or over time. While some require paid access, others are freely available online. Here are some that may be useful:

single corpora
- The Bank of English corpus can be accessed through Wordbanks Online.
- International Corpus of English – despite its name, this is actually a set of corpora for various national varieties of English
- International Corpus of Learner English
- UMBC webBase corpus – paragraphs in English from a 2007 crawl of the Web
- Corpus-DB – a corpus of texts in the public domain in the US which have been organized with basic metadata, allowing you to create a subcorpus for studying historical use of language
- OSCAR – 166 language-specific subsets of the Common Crawl corpus.
lists of corpora
- english-corpora.org – various corpora available for free (though limited) use online. UNT faculty, staff, and students can download certain full corpora through the UNT Digital Library (see below)
- Linguistic Data Consortium's catalog of corpora. UNT faculty, staff, and students can download certain full corpora through the UNT Digital Library (see below). Students can apply for a data scholarship for free access to others, but note that the deadlines are in advance of the semester in which the data would be used.
- Linguistic Corpora – part of DH Toychest: Digital Humanities Resources for Project Building
- Corpus-based Linguistics Links
- Text & Corpora – a list compiled by Linguist List
- The Big Bad NLP Database – a list of datasets that can be used for natural-language processing, speech recognition, and more (info)
- Chinese-English Parallel Corpora – These corpora cover the financial and legal domains in Hong Kong.
- Wikipedia's List of text corpora
corpora available to the UNT community:
- raw data for certain corpora from english-corpora.org:
  - Corpus of Contemporary American English (COCA)
  - Global Web-based English (GLoWbE)
  - News on the Web (NOW)
  - Corpus del Español
- data from the Linguistic Data Consortium
  - ETS Corpus of Non-Native Written English
  - TIPSTER Volume 1 – To access this corpus, create register for an LDC account associated with the University of North Texas. Once your account is approved, sign as "the applicant" on page 2 of special agreement for this corpus (see the "individual" agreement linked from the TIPSTER webpage) and submit to ldc@ldc.upenn.edu. LDC staff will then provide you access to the data through your user account.
  - American National Corpus (ANC) Second Release – To access this corpus, create register for an LDC account associated with the University of North Texas. Once your account is approved, sign as "the licensee" one or both of the special agreements for this corpus, depending on which data you plan to use (see licenses linked from the ANC webpage), and submit to ldc@ldc.upenn.edu. LDC staff will then provide you access to the data through your user account.
Corpora listserv

Text mining

If you are merely interested in studying word frequency over time, perhaps taking morphological inflection into account, consider these sources of data:

HTRC Analytics allows you study subsets of the HathiTrust corpus of digitized collections of research libraries.
Data for Research and the newer platform Constellate allow you to create subsets of JSTOR's corpus of scholarly literature for text mining.
If you are prepared to work with data directly (without using visualization tools), see the General Index, which contains n-grams of over 100 million journal articles.
Download the Television News Ngrams from the GDELT Project.
European Literary Text Collection (ELTeC) provides sets of novels published between 1840 and 1920 in their original languages.
RadioTalk contains transcriptions of talk radio in the US during a few months of 2018–2019.

Search Systems

Getting Started

Advanced Research Support

Checking Out Materials

Delivery Services

From Other Libraries

Equipment

Additional Information

Talk to an Expert

Scholarly/Professional Help

Help with Borrowed Items

Topical Help

Additional Needs

Help with Technology & Printing

Outside & Self Help

Course Support for Students

Other Learning Support

Writing, Citing, and More

For Faculty

Locations

Technology

Study & Reservations

Outside the Box

Rules & Policies

People

Get Involved

Administrative

Find Us

Contributing

Documentation & Forms

Linguistics Guide

Linguistic corpora

Text mining