A linguistic corpus is a body of texts, usually selected through a method of sampling, that are meant to be representative of a certain form of a language, either in a certain time period or over time. While some require paid access, others are freely available online. Here are some that may be useful:
- single corpora
- lists of corpora
- corpora available to the UNT community:
- raw data for certain corpora from english-corpora.org:
- data from the Linguistic Data Consortium
- ETS Corpus of Non-Native Written English
- TIPSTER Volume 1 – To access this corpus, create register for an LDC account associated with the University of North Texas. Once your account is approved, sign as "the applicant" on page 2 of special agreement for this corpus (see the "individual" agreement linked from the TIPSTER webpage) and submit to ldc@ldc.upenn.edu. LDC staff will then provide you access to the data through your user account.
- American National Corpus (ANC) Second Release – To access this corpus, create register for an LDC account associated with the University of North Texas. Once your account is approved, sign as "the licensee" one or both of the special agreements for this corpus, depending on which data you plan to use (see licenses linked from the ANC webpage), and submit to ldc@ldc.upenn.edu. LDC staff will then provide you access to the data through your user account.
- Web 1T 5-gram, 10 European Languages Version 1 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- Message Understanding Conference (MUC) 7 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- Web 1T 5-gram Version 1 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- Chinese-English Translation Lexicon Version 3.0 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- OntoNotes Release 5.0 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- TREC Spanish (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- TIMIT Acoustic-Phonetic Continuous Speech Corpus (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- TimeBank 1.2 (coming soon in the UNT Digital Library, but in the meantime you can register for an LDC account associated with the University of North Texas, and once your account is approved, request access to this corpus from LDC staff)
- places to look for texts and corpora; sites with useful information
If you need access to a corpus for your research that's not freely available, please contact the subject librarian for linguistics so that the UNT Libraries can investigate acquiring access for you.