Text Analysis Across Disciplines: Text Mining and Text Analysis

Term: 
Spring
Credits: 
2.0
Course Description: 

The goal of this course is to provide training in text mining simultaneously on two distinct levels. The material from both levels should be accessible and useful to social scientists as well as humanists, as strategies such as text mining, content analysis, sentiment analysis and entity extraction are becoming fundamental to research on large and diverse digital corpora.

The basic level is offered with no prerequisites, and is designed for students from the humanities or from qualitative social science backgrounds who are interested in learning the fundamentals of text analysis using computational methods. We will begin with typical source materials for qualitative research (archival records, interviews, print and online media, primary and secondary literature) and demonstrate how to collect and curate a full-text, machine-readable corpus, extract and standardize metadata, and then analyze and visualize the text. The last section of the course will look closely at how such techniques can be integrated with non-computational methods to create a balanced and nuanced analysis, which is informed by the ‘distant reading’ but does not sacrifice the complexities offered by close reading.

The more advanced level of the boot camp is designed for students who have some exposure to statistically-informed methods and query languages (SQL, python, R) and would like to apply these methods to the computational and statistical analysis of texts. Prerequisites will include Introduction to Statistics (or equivalent) and Introduction to R (or equivalent). We will also take students through the process of corpus creation, but at a much faster pace, as we will assume basic familiarity with scraping, OCR, and dataset curation. The more advanced level will offer more specific training in stylometry, entity extraction, and other features of natural language processing.

Topics covered:

  • corpus selection and cleaning
  • metadata collection
  • research question design
  • basic programming for textual analysis (R software environment and relevant packages: TM, stylo, ggplot2, topicmodels, klaR, etc.)
  • applied statistical evaluation of results
  • topic modeling
  • stylometry (authorship attribution and forensic authorship analysis)
  • frequency analysis and genre
  • classification, variable selection and discriminant function analysis
  • natural language processing, part of speech tagging, named entity recognition

Learning Outcomes: 

By the end of this course, students in the basic level will be able to:

  • create and clean a full-text corpus selection
  • extract relevant metadata
  • design research questions appropriate to textual analysis
  • use out-of-the-box tools for textual analysis (Voyant, stylo, and basic packaged in the R software environment)

Students in the more advanced level will be able to

  • create and clean a full-text corpus selection
  • extract relevant metadata
  • design research questions appropriate to textual analysis
  • use out-of-the-box tools for textual analysis (Voyant, stylo, and basic packaged in the R software environment)
  • carry out stylometric analysis (authorship attribution and forensic authorship analysis)
  • carry out classification, variable selection and discriminant function analysis
  • work with the basic features of the Natural Language Processing Toolkit, part of speech tagging, named entity recognition
Assessment: 

Attendance  (25% of the final grade) This is an intensive course (4 meetings/week) which will only be effective is students commit to coming to every class, barring a major emergency or other unforeseen circumstances.

Practice sessions (25% of the final grade) We will assign online exercises and quizzes in each of the relevant methodologies to ensure that students deepen their comprehension from the receptive level to the pragmatic level. These exercises will not be graded, but simply marked for completion.

Final Project (50% of the final grade) In the last week of the boot camp, students will develop a small-scale text analysis project in their own area of interest. Students can work individually, or in small collaborative teams, as long as each team-member’s responsibilities and contributions are clearly identified.