Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Content Analysis

Introduction to Content Analysis

Gathering documents for analysis

Though some systems, like HTRC Analytics and the Digital Scholar Workbench, let you select a set of documents and analyze them all through a web browser, if you want to work with your own set of documents or use tools not available online, you might need to gather your own set of documents.

While documents that you are interested in studying may be available through Google Books, Google Magazines, or a database available through the UNT Libraries, these systems usually prevent you from downloading content in bulk for study using your own tools. If you need access to the content in a licensed database, contact your subject librarian to see if there is a way to arrange for special access for text mining.

Much of the content found in Google Books and Google Magazines is also available in HathiTrust. Since UNT is a partner institution, you can download PDFs of items that are in the public domain (no longer protected by copyright) in the United States, plus a few items for which the rightsholder has allowed this. In addition, it is possible to download public-domain works in bulk: see information on datasets. If you want to search across HathiTrust and other large collections for books for which the full text is freely available, you might try OpenTexts.world.

However, all HathiTrust content (even the content still protected by copyright) can be studied using HTRC Analytics. See that website for more information.

Preparing documents for analysis

If the documents you are studying include scans from paper (whether in PDF or an image format), you will likely want to perform optical character recognition (OCR) on the documents first to allow for searching of the full text. NVivo's "Working with PDFs in NVivo" explains how you can use Microsoft OneNote or Microsoft Office Document Imaging to do this.  Other options include OmniPage and Adobe Acrobat.

Building web or social-media archives (datasets of webpages or posts for study)

The UNT Libraries can crawl the Web to collect news stories, social media posts, or other webpages related to certain topics, gathering the data for study by researchers. (For example, see a “Yes All Women” Twitter Dataset.) To request creation of a Web dataset, or if you have any questions about web archiving activities at UNT Libraries, contact Mark Phillips.

Alternatively, you can scrape the web on your own using a tool such as Web Scraper.

Perhaps, though, you don't need to build your own.  Some have been created by others and made available:

For more information

Additional Links

top