Skip to Main Content

Linguistics

A subject guide for linguistics students and researchers

Linguistic corpora

A linguistic corpus is a body of texts, usually selected through a method of sampling, that are meant to be representative of a certain form of a language, either in a certain time period or over time.  While some require paid access, others are freely available online. Here are some that may be useful:

Text mining

If you are merely interested in studying word frequency over time, perhaps taking morphological inflection into account, consider these sources of data:

  • HTRC Analytics allows you study subsets of the HathiTrust corpus of digitized collections of research libraries.
  • Data for Research and the newer platform Constellate allow you to create subsets of JSTOR's corpus of scholarly literature for text mining.
  • If you are prepared to work with data directly (without using visualization tools), see the General Index, which contains n-grams of over 100 million journal articles.
  • Download the Television News Ngrams from the GDELT Project.
  • European Literary Text Collection (ELTeC) provides sets of novels published between 1840 and 1920 in their original languages.
  • RadioTalk contains transcriptions of talk radio in the US during a few months of 2018–2019.