*Please note this course has been cancelled for 2016*
The application of computational tools to textual data is a growing area of inquiry in the humanities. Much of this work, however, relies on older techniques such as n-grams and bag-of-word models. Recent developments in computational linguists, which have attempted to mimic the complex process by which humans parse and interpret language, have so far failed to gain much wide-spread usage. A primary reason these methods have not enjoyed wider popularity is because many scholars have had limited opportunities to become exposed to them.
This workshop introduces the basic components of modern natural language processing and illustrates how they can be used to extract latent information from a corpus of text. Techniques include tokenization, lemmatization, part of speech tagging, dependency parsing, and coreference resolution. Students in the course will learn these concepts by way of a tutorial approach: everyone will be expected to follow along on their own machines as we work through increasingly involved examples. The tutorials use the open source statistical programming language ‘R’, however no prior programming experience is assumed. Necessary components of the programming language are introduced throughout the workshop. Our objects of study will be: (1) a collection of short stories, (2) a set of several dozen novels, and (3) a corpus of historical newspaper articles. There will be a chance on the final day of the workshop for students to explore their own data sources if they so choose.
The focus of the workshop is to gain both a conceptual understanding of these techniques as well as achieving the basic programming skills required to employ these ideas in future research projects. Material is adapted from the instructor’s recent textbook Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text.