Research Guides: Text and Data Mining: Tools

Tools

Office of Information Technology Licensed Software

You can review OIT's licensed software on their Software Distribution page. Below are some selected software mining tools.

QDA Miner is a software package which aids in the qualitative analysis of text or images. Download under Provalis Research in the OIT software list. At UTK, you can schedule a one-on-one tutorial by calling the OIT HelpDesk at 865-974-9900. Tutorials are also available on QDA Miner's tutorial website.

NVivo is a qualitative analysis software package that supports both qualitative and mixed methods research. For UTK, OIT offers workshops on NVivo each semester, and you can schedule a one-on-one tutorial any time by calling the OIT HelpDesk at 865-974-9900. You can also visit NVivo's Support page or watch tutorials from QSR International's YouTube channel.

MATLAB and Simulink are computational software environments used to perform a variety of computational tasks such as in engineering, science, mathematics, statistics and finance. Includes optional Text Analytics Toolbox for textual data. The MATLAB Onramp course is available at no additional charge to registered users of the UT MATLAB site license. If you are not a registered user, log into the OIT Software Download Site, download the file under MathWorks, Inc., and follow the steps for creating a MathWorks account. Log in to take the 2-hour MATLAB Onramp course.

Other Free Software

OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

R Project is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Below are several R packages available.

Tidyverse for text mining and data science

rvest for web scraping

twitteR for twitter data

Python is a programming language that lets you work more quickly and integrate your systems more effectively. See their FAQs for more information.

HathiTrust Research Center

The HathiTrust Research Center (HTRC) enables computational analysis of the HathiTrust corpus. It is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with HathiTrust, to help meet the technical challenges researchers face when dealing with massive amounts of digital text. It develops cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

As a member of HathiTrust, University of Tennessee, Knoxville affiliates are able to create accounts to use the HTRC Analytics site. Researchers from member organizations have full access to the site and tools, and in fact have the benefit of being able to create an HTRC Data Capsule in which they are able to analyze datasets drawn from the full HathiTrust corpus. Those within the membership community are also eligible to apply for special support for their research via the HTRC's Advanced Collaborative Support program.

For more information, you can:

Read a brief overview of the HTRC's Collections and Tools
Find tutorials and detailed documentation in the HTRC Documentation
Review the code that makes it all run on the HTRC GitHub
See more documentation on Getting Started and Help!

Tools and Services

The HTRC offers a suite of tools for computational text analysis. These tools cover a wide variety of functions ranging from simple statistical analysis of words to complex algorithms relating concepts and meaning.

HTRC Analytics

HTRC Analytics is the primary site for interacting with HTRC. It provides access to HTRC worksets and off-the-shelf algorithms to analyze them. It also contains a dashboard where researchers can create a secure computing environment, called a Data Capsule (see below). Several of the HTRC algorithms are based off the Software Environment for the Advancement of Scholarly Research (SEASR, pronounced “Caesar”), a legacy project developed with funding by the Andrew W. Mellon Foundation.

HathiTrust+Bookworm

The HathiTrust+Bookworm visualization tool allows researchers to graph word trends across the HathiTrust corpus and facet their search by bibliographic metadata.

Data Capsules

The HTRC Data Capsules secure compute environment allows researchers to create a virtual machine desktop “capsule” that can be used to run customized research methods and tools not supported by the pre-built algorithms. Researchers control their research process while in a capsule, and only derived data may be released when they are finished

HTRC Technical Abilities

HTRC Tool	Technical skills	Rights status	Methods	Data format
*Web algorithms*	Low	Public domain	Off-the-shelf	Can’t see underlying data
*HT+Bookworm tool*	Low	All (13.7 million volumes)	Visualize trends	Can’t see underlying data
*Data Capsule environment*	Medium to high	Public domain	Your choice, including Voyant	Raw OCR
*Extracted Features dataset*	Medium to high	All (15.7 million volumes)	Any requiring bag-of-words	Words and word counts in structured file

Text and Data Mining

Need help?

Need help getting started on text and data mining?

Contact our data services team at

dataservices@utk.edu

Tools

Office of Information Technology Licensed Software

Other Free Software

HathiTrust Research Center

Tools and Services

HTRC Technical Abilities

HTRC Tool

Technical skills

Rights status

Methods

Data format

Web algorithms

HT+Bookworm tool

Data Capsule environment

Extracted Features dataset