Skip to Main Content

Text and Data Mining

Tools

Office of Information Technology Licensed Software

You can review OIT's licensed software on their Software Distribution page. Below are some selected software mining tools.

 

QDA Miner orange logo
QDA Miner is a software package which aids in the qualitative analysis of text or images. Download under Provalis Research in the OIT software list. At UTK, you can schedule a one-on-one tutorial by calling the OIT HelpDesk at 865-974-9900. Tutorials are also available on QDA Miner's tutorial website.

NVIVO blue logo
NVivo is a qualitative analysis software package that supports both qualitative and mixed methods research. For UTK, OIT offers workshops on NVivo each semester, and you can schedule a one-on-one tutorial any time by calling the OIT HelpDesk at 865-974-9900. You can also visit NVivo's Support page or watch tutorials from QSR International's YouTube channel.

Matlab orange and teal logo
MATLAB and Simulink are computational software environments used to perform a variety of computational tasks such as in engineering, science, mathematics, statistics and finance. Includes optional Text Analytics Toolbox for textual data. The MATLAB Onramp course is available at no additional charge to registered users of the UT MATLAB site license. If you are not a registered user, log into the OIT Software Download Site, download the file under MathWorks, Inc., and follow the steps for creating a MathWorks account. Log in to take the 2-hour MATLAB Onramp course.

 

Other Free Software

 

Open Refine dot org logo
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

 

R Project Logo
R Project is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Below are several R packages available.

Tidyverse for text mining and data science

rvest for web scraping

twitteR for twitter data


Python logo
Python is a programming language that lets you work more quickly and integrate your systems more effectively. See their FAQs for more information.

HathiTrust Research Center

The HathiTrust Research Center (HTRC) enables computational analysis of the HathiTrust corpus. It is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with HathiTrust, to help meet the technical challenges researchers face when dealing with massive amounts of digital text. It develops cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

As a member of HathiTrust, University of Tennessee, Knoxville affiliates are able to create accounts to use the HTRC Analytics site. Researchers from member organizations have full access to the site and tools, and in fact have the benefit of being able to create an HTRC Data Capsule in which they are able to analyze datasets drawn from the full HathiTrust corpus. Those within the membership community are also eligible to apply for special support for their research via the HTRC's Advanced Collaborative Support program.

 

For more information, you can:

 

Tools and Services

The HTRC offers a suite of tools for computational text analysis. These tools cover a wide variety of functions ranging from simple statistical analysis of words to complex algorithms relating concepts and meaning.

HTRC Analytics

HTRC Analytics is the primary site for interacting with HTRC. It provides access to HTRC worksets and off-the-shelf algorithms to analyze them. It also contains a dashboard where researchers can create a secure computing environment, called a Data Capsule (see below). Several of the HTRC algorithms are based off the Software Environment for the Advancement of Scholarly Research (SEASR, pronounced “Caesar”), a legacy project developed with funding by the Andrew W. Mellon Foundation.

HathiTrust+Bookworm

The HathiTrust+Bookworm visualization tool allows researchers to graph word trends across the HathiTrust corpus and facet their search by bibliographic metadata.

Data Capsules

The HTRC Data Capsules secure compute environment allows researchers to create a virtual machine desktop “capsule” that can be used to run customized research methods and tools not supported by the pre-built algorithms. Researchers control their research process while in a capsule, and only derived data may be released when they are finished

 

 

HTRC Technical Abilities

HTRC Tool
Technical skills
Rights status
Methods
Data format
Web algorithms

Low

Public domain

Off-the-shelf

Can’t see underlying data

HT+Bookworm tool

Low

All (13.7 million volumes)

Visualize trends

Can’t see underlying data

Data Capsule environment

Medium to high

Public domain

Your choice, including Voyant

Raw OCR

Extracted Features dataset
 

Medium to high

All (15.7 million volumes)

Any requiring bag-of-words

Words and word counts in structured file