Skip to Main Content

Linguistics

A subject guide for linguistics students and researchers

Linguistic corpora

A linguistic corpus is a body of texts, usually selected through a method of sampling, that are meant to be representative of a certain form of a language, either in a certain time period or over time.  While some require paid access, others are freely available online. Here are some that may be useful:

If you need access to a corpus for your research that's not freely available, please contact the subject librarian for linguistics so that the UNT Libraries can investigate acquiring access for you.

Text mining

If you are merely interested in studying word frequency over time, perhaps taking morphological inflection into account, consider these sources of data:

  • HTRC Analytics allows you study subsets of the HathiTrust corpus of digitized collections of research libraries.
  • Data for Research and the newer platform Constellate allow you to create subsets of JSTOR's corpus of scholarly literature for text mining.
  • If you are prepared to work with data directly (without using visualization tools), see the General Index, which contains n-grams of over 100 million journal articles.
  • Download the Television News Ngrams from the GDELT Project.
  • European Literary Text Collection (ELTeC) provides sets of novels published between 1840 and 1920 in their original languages.
  • RadioTalk contains transcriptions of talk radio in the US during a few months of 2018–2019.

Additional Links

top