Skip to Main Content

Content Analysis

Introduction to Content Analysis

Gathering documents for analysis

Though some systems, like most of those listed in the "Text Mining" box below, let you select a set of documents and analyze them all through a web browser, if you want to work with your own set of documents or use tools not available online, you might need to gather your own set of documents.

While documents that you are interested in studying may be available through Google Books, Google Magazines, or a database available through the UNT Libraries, these systems usually prevent you from downloading content in bulk for study using your own tools. If you need access to the content in a licensed database, contact your subject librarian to see if there is a way to arrange for special access for text mining.

Much of the content found in Google Books and Google Magazines is also available in HathiTrust. Since UNT is a partner institution, you can download PDFs of items that are in the public domain (no longer protected by copyright) in the United States, plus a few items for which the rightsholder has allowed this. In addition, it is possible to download public-domain works in bulk: see information on datasets. If you want to search across HathiTrust and other large collections for books for which the full text is freely available, you might try OpenTexts.world.

However, all HathiTrust content (even the content still protected by copyright) can be studied using HTRC Analytics. See that website for more information.

Text mining

If you are merely interested in studying word frequency over time, perhaps taking morphological inflection into account, consider these sources of data:

  • HTRC Analytics allows you study subsets of the HathiTrust corpus of digitized collections of research libraries.
  • Data for Research and the newer platform Constellate allow you to create subsets of JSTOR's corpus of scholarly literature for text mining.
  • If you are prepared to work with data directly (without using visualization tools), see the General Index, which contains n-grams of over 100 million journal articles.
  • Download the Television News Ngrams from the GDELT Project.
  • European Literary Text Collection (ELTeC) provides sets of novels published between 1840 and 1920 in their original languages.
  • RadioTalk contains transcriptions of talk radio in the US during a few months of 2018–2019.

Preparing documents for analysis

If the documents you are studying include scans from paper (whether in PDF or an image format), you will likely want to perform optical character recognition (OCR) on the documents first to allow for searching of the full text. NVivo's "Working with PDFs in NVivo" explains how you can use Microsoft OneNote or Microsoft Office Document Imaging to do this.  Other options include OmniPage and Adobe Acrobat.

Building web or social-media archives (datasets of webpages or posts for study)

Twitter provides a special way to gain access to their archive of tweets for academic research.

YouTube has also begun offering API access through its YouTube Researcher Program

The UNT Libraries can crawl the Web to collect news stories, social media posts, or other webpages related to certain topics, gathering the data for study by researchers. (For example, see a “Yes All Women” Twitter Dataset.) To request creation of a Web dataset, or if you have any questions about web archiving activities at UNT Libraries, contact Mark Phillips.

Alternatively, you can scrape the web on your own using a tool such as Web Scraper.

Perhaps, though, you don't need to build your own.  Some have been created by others and made available:

For more information

Additional Links

top