Text Analysis Across Disciplines: Text Mining Boot Camp

Term: 
Spring
Credits: 
2.0
Course Description: 

Schedule:

Tuesday, April 10-Friday April 13

Monday, April 16-Friday, April 20

Two lectures/workshops each morning (9:30-10:50; 11:00-12:40) -- mandatory

Lab session in the afternoon (13:30-15:10) -- optional

Classes will be held in Nádor 13: Rooms 302 and 303*

*except Thursday, April 18: Rooms 301 and 302

The goal of this course is to provide training in text mining simultaneously on two distinct levels. The material from both levels should be accessible and useful to social scientists as well as humanists, as strategies such as text mining, content analysis, sentiment analysis and entity extraction are becoming fundamental to research on large and diverse digital corpora.

The basic level is offered with no prerequisites, and is designed for students from the humanities or from qualitative social science backgrounds who are interested in learning the fundamentals of text analysis using computational methods. We will begin with typical source materials for qualitative research (archival records, interviews, print and online media, primary and secondary literature) and demonstrate how to collect and curate a full-text, machine-readable corpus, extract and standardize metadata, and then analyze and visualize the text. The last section of the course will look closely at how such techniques can be integrated with non-computational methods to create a balanced and nuanced analysis, which is informed by the ‘distant reading’ but does not sacrifice the complexities offered by close reading.

The more advanced level of the boot camp is designed for students who have some exposure to statistically-informed methods and query languages (SQL, python, R) and would like to apply these methods to the computational and statistical analysis of texts. Prerequisites will include Introduction to Statistics (or equivalent) and Introduction to R (or equivalent). We will also take students through the process of corpus creation, but at a much faster pace, as we will assume basic familiarity with scraping, OCR, and dataset curation. The more advanced level will offer more specific training in stylometry, entity extraction, and other features of natural language processing.

Topics covered:

  • corpus selection and cleaning
  • metadata collection
  • research question design
  • basic programming for textual analysis (R software environment and relevant packages: TM, stylo, ggplot2, topicmodels, klaR, etc.)
  • applied statistical evaluation of results
  • topic modeling
  • stylometry (authorship attribution and forensic authorship analysis)
  • frequency analysis and genre
  • classification, variable selection and discriminant function analysis
  • natural language processing, part of speech tagging, named entity recognition

Learning Outcomes: 

By the end of this course, students in the basic level will be able to:

  • create and clean a full-text corpus selection
  • extract relevant metadata
  • design research questions appropriate to textual analysis
  • use out-of-the-box tools for textual analysis (Voyant, stylo, and basic packaged in the R software environment)

Students in the more advanced level will be able to

  • create and clean a full-text corpus selection
  • extract relevant metadata
  • design research questions appropriate to textual analysis
  • use out-of-the-box tools for textual analysis (Voyant, stylo, and basic packaged in the R software environment)
  • carry out stylometric analysis (authorship attribution and forensic authorship analysis)
  • carry out classification, variable selection and discriminant function analysis
  • work with the basic features of the Natural Language Processing Toolkit, part of speech tagging, named entity recognition
Assessment: 

Attendance  (50% of the final grade) This is an intensive course (4 meetings/week) which will only be effective is students commit to coming to every class, barring a major emergency or other unforeseen circumstances.

Practice sessions (25% of the final grade) We will assign exercises in each of the relevant methodologies to ensure that students deepen their comprehension from the receptive level to the pragmatic level.

Final Evaluation (25% of the final grade) In the last week of the boot camp, students will be asked to evaluate which of the text mining methodologies seem most relevant to their future work, and identify areas or skill sets they would like to learn more about. We will distribute a survey with such quetions at the end of the course, and ask students to answer questions thoughtfully, in order to help us identify future possible areas for workshops and methods courses.