Though some systems, like most of those listed in the "Text Mining" box below, let you select a set of documents and analyze them all through a web browser, if you want to work with your own set of documents or use tools not available online, you might need to gather your own set of documents.
While documents that you are interested in studying may be available through Google Books, Google Magazines, or a database available through the UNT Libraries, these systems usually prevent you from downloading content in bulk for study using your own tools. If you need access to the content in a licensed database, contact your subject librarian to see if there is a way to arrange for special access for text mining.
Much of the content found in Google Books and Google Magazines is also available in HathiTrust. Since UNT is a partner institution, you can download PDFs of items that are in the public domain (no longer protected by copyright) in the United States, plus a few items for which the rightsholder has allowed this. In addition, it is possible to download public-domain works in bulk: see information on datasets. If you want to search across HathiTrust and other large collections for books for which the full text is freely available, you might try OpenTexts.world.
However, all HathiTrust content (even the content still protected by copyright) can be studied using HTRC Analytics. See that website for more information.
If you are merely interested in studying word frequency over time, perhaps taking morphological inflection into account, consider these sources of data:
If the documents you are studying include scans from paper (whether in PDF or an image format), you will likely want to perform optical character recognition (OCR) on the documents first to allow for searching of the full text. NVivo's "Working with PDFs in NVivo" explains how you can use Microsoft OneNote or Microsoft Office Document Imaging to do this. Other options include OmniPage and Adobe Acrobat.
Twitter provides a special way to gain access to their archive of tweets for academic research.
YouTube has also begun offering API access through its YouTube Researcher Program.
The UNT Libraries can crawl the Web to collect news stories, social media posts, or other webpages related to certain topics, gathering the data for study by researchers. (For example, see a “Yes All Women” Twitter Dataset.) To request creation of a Web dataset, or if you have any questions about web archiving activities at UNT Libraries, contact Mark Phillips.
Alternatively, you can scrape the web on your own using a tool such as Web Scraper.
Perhaps, though, you don't need to build your own. Some have been created by others and made available: