Text analysis for teaching

On April 13th, Ben Johnston, Senior Educational Technology Specialist at the McGraw Center for Teaching and Learning, gave a faculty workshop on “Teaching with Text Analysis.”

Ben, who describes himself as a proponent of “Modest DH” (digital humanities), argued for the adoption of intuitive Web-based tools that students can use to analyze large amounts of text. While there are sophisticated tools that can also do this sort of work, free online tools can be an easy way to introduce the power of text analysis in course assignments, and have students do substantive work in examining textual content without the learning curve or cost of more complex software packages.

Ben began by introducing the fundamentals of basic text analysis:

  • Search and indexing tools
  • Frequency (word count and word comparison) tools
  • Co-location and classification tools (tools that place words in context in a text, and/or can clarify parts of speech).

Text analysis tools excel in giving the large view of a corpus of more text that can be read, perhaps even in a lifetime. Ben drew an analogy between this sort of “distant reading” of large text collections to “space archaeology,” where archaeologists use satellite photographs to find likely dig sites. By taking the view of “distant reading,” rather than the “close reading” techniques taught in so many courses in the humanities, text analysis can, like satellite photos, indicate areas worthy of study and questioning. Quoting Paul Fyfe from North Carolina State University, Ben proposed the following way to think of using text analysis tools in student assignments:

In the humanities, we do not teach the answers, Rather, we teach students how to ask good questions.

With this attitude in mind, Ben described certain characteristics he expected of the modest tools he feels are most effective in the classroom. Those tools share these characteristics

  • intuitive interface, and easy to learn
  • online, familiar and self evident–no software to install or learn
  • collaborate
  • don’t require the professor to spend time teaching the tool
  • exhaust simple resources before introducing unnecessary complexity to the task
  • can be used as a proof of concept — quickly determining if the tool is useful, and if not, moving on to a different tool
  • allow students to engage with primary texts and real data
  • perform analysis quickly, through iterative practice

The first tools examined in the workshop were ‘ngram’ tools, Google NGram Viewer and the New York Times Chronicle. These tools can plot the usage frequency of words over time using, respectively, the body of works scanned by Google, publications dating from 1800 to 2000, and  back issues of The New York Times. Although some scholars question the validity of ngrams, owing to variations in spelling and typography (for instance the long -s of the 18th century being interpreted by OCR software as an “f”), the tools are simple, quick, and can excite scholarly interest. As an example, Ben did a search for the word “funk,” expecting the results to indicate  the 20th-century musical style or mood, yet found a large use of the word around 1800, owing to accounts of ships that had “sunk.” Despite these limitations, Ben explained, the ngrams can be a simple and powerful way for students to begin questioning why certain words or phrases vary over time. As an example of this, Ben did a search for ‘The United States of America’ taking either the verb “are” or “is.” The use of the US as a collective noun, taking a singular verb, rapidly surpassed the more common plural in the 1870s–immediately after the Civil War.

The next tools examined were word-cloud generators, such as Wordle and TagCrowd, which show a visualization of the frequency of words in a text.

Wordclouds and more are included in Voyant Tools which not only shows word clouds, but map usage over time (in the case of a corpus of text) and also show co-location of certain words in context. Voyant allows for the uploading of .zip files that contain more than one work. Among the examples Ben used to look at these tools were the collected Sherlock Holmes short stories by Arthur Conan Doyle, and several works of 19th-century women authors who were writing novels in the same time period.

Juxta Commons is a tool that can compare two works by different authors for areas of similarity, and for unique phrases that characterize each. Such tools are often used to find common phrases that might indicate authorship in a work.

The workshop concluded with a quick look at some other more powerful desktop tools, such as AntConc, a concordance tool that can be used to make ngrams, find co-locations, and do frequency counts on works selected by the user. AntConc differs from the other tools described by requiring software to be installed on the user’s machine, and having a greater learning curve. However students used to the simpler online tools would have no problem making a transition to a program such as AntConc.

Ben also showed a project called ABC Books, which is an English course on children’s literature taught by Professor William Gleason. The students uploaded, catalogued and tagged children’s alphabet books in several languages, and used Voyant Tools to study the texts. The course, which has been offered for several years at Princeton, now has a large corpus of information about these ABC books, held in the Cotson Collection at Princeton, and is a model for using text analysis in the classroom.

Here is a link to the slides from Ben’s presentation.

Following are links to tools demonstrated in the workshop:

Google ngram viewer

New York Times Chronicle




Voyant Tools

ABC Books

Juxta Commons

Some text in the public domain to try (note some tools allow the upload of multiple texts in a .zip format):

Download Data

Posted by April 15, 2016