LING 226 Assignment 1, 2023
Short Program and Written Reflection 1 (25% of total grade)
The goal of this assignment is to develop a program to read in text data, perform. some preprocessing on the data, and then compare the effects of different preprocessing on various text metrics. You
should construct a program with these functions:
• a function to preprocess text data which can:
◦ remove punctuation
◦ remove stopwords
◦ lowercase all words
◦ remove words below/above a certain frequency
• a function (or functions) to calculate these text metrics:
◦ total number of words
◦ overall lexical diversity of the text
◦ average lexical diversity of text sentences
◦ top ten most frequent words
The course notebooks have everything you need to create these functions. You can reuse any and all of the functions in the course notebooks to create your program. After creating these functions, you need to conduct an experiment. The goal of your experiment is to compare the effects of
preprocessing on the different text metrics. To do so, you need to use data from at least two sources:
1. One of the built-in NLTK corpora resources (e.g., Brown, State Union)
2. Data from The Current (data from at least two questions)
Using this data, look for trends and consistent effects that preprocessing has on various text metrics. Also look to see if there are any texts more or less immune to the effects of preprocessing. After conducting your experiment, write a short report (500-600 words) reflecting on your results. You should detail the comparisons and analyses that you conducted, what results you found, and your interpretation of the results. Specifically, you should focus on what happens to these metrics under different preprocessing conditions, and focus on making conclusions about their implications for text analysis in general.
You should submit your assignment as a .ipynb notebook file in Canvas by the due date. Your notebook should have a text cell at the start with includes your name, your student ID, and whether you are attempting to complete the challenge (see below). The notebook should include all of the code cells, plus your written report as text cells. You are free to mix code and text cells as you deem appropriate.
Marking Guidelines
A-level papers will run a number of comparisons and report the differences between text categories in a clear and descriptive manner. The written reflection will be equally descriptive and include insightful reflections and deductions on how preprocessing affects these text metrics. These reflections and deductions will be clearly connected to the data and results from the student’s analysis. All of the code cells will work properly. The paper includes a successful attempt at the challenge.
B-level papers will run the comparisons and note the differences between text categories. The written report will be partially descriptive but also include reflections and deductions on how preprocessing affects these text metrics. There are some connections made to the results of the student’s analysis. All of the code cells will work properly. A challenge is attempted to limited success.
C-level papers will run few comparisons and make note of important differences between text categories. The written reflection will be mostly descriptive. All of the code cells will work properly. No challenge is attempted.
D-level papers will run one or few comparisons between texts, make note of some differences between the texts, and include a written reflection which is too short and too descriptive. Some of the code cells may not work properly.
A-level Challenge
A-level papers need to go above and beyond the rest. Students need a way to play to their strengths. The challenge provides that opportunity. Students can either flex their computer science skills, showcase their critical thinking abilities and/or domain knowledge outside of computer science, or some combination of both. In either case, you should be driven by a desire to have your assignment used as an exemplar for next year’s cohort of students.
I want to leverage my computer science skills:
Go nuts with your program, but in a way that stays within the confines of the assignment prompt. You might want to develop new text metrics or improve upon the ones used in the notebook. You might find a way to efficiently compare data from multiple sources, combining the results computationally as a way to more empirically demonstrate the effects of preprocessing on text. You would still write a report which meets the criteria of A.
I want to leverage my non-computer science knowledge:
Write a report which blows my mind in its ability to make connections between your results and the assignment prompt, but also goes a step further to consider other contexts and domains. You may want to draw from your domain knowledge in languages, linguistics, or other content areas to discuss what might happen in otherlanguages or domains beyond the data used in this assignment. You might even go out and find some additional research or papers on the topic and integrate them into your assignment.