讲解 STA 279: Data Analysis 1辅导 C/C++语言

STA 279: Data Analysis 1

The Goal

We have been learning a lot about working with text data! The goal of this assignment is to apply the text analysis techniques we have been learning to a real world text data set.

What are we actually submitting?

You will submit two documents.

• Data Analysis Formal Report: This will be the main component of the DA, which will include all of the sections outlined on the next page. Your Formal Report must be submitted as PDF or html. There must be no code or unformatted output showing on this document. This includes warnings and other stray code output – hide it all. Points will be taken off for this! You will be graded on spelling, grammar, formatting, and writing, as well as your stats. Make sure you use spell check.

• Code Book: This is the .Rmd file of your work which will show all of your code. It must be clear looking at this which code goes with each section of your report. Dr. Dalzell must be able to run your code and reproduce the results from your data analysis. The goal is that a person who reads your report, and wants to replicate your results, could access your code appendix and completely reproduce the results and figures in your formal report. One suggestion. It is easiest to simply work in Markdown to make your formal report. Annotate the code along the way, but use echo = FALSE on line 9 of your Markdown file to hide all the code. When you are done, knit the file to save your report. Then, submit both the .Rmd and PDF / html versions of the document!

Can we work with a partner?

You are welcome to work with one other person!! Each person must fully contribute, and you must email Dr. Dalzell before 9 PM on Wednesday February 12th confirming who you are working with. If you work with another person, their name must be included on your report submission for them to receive credit, and only one person should submit an assignment. If you do not report your partner’s name and your work with someone else, both parties will receive a 0.

Can I use AI?

You may not use AI tools of any kind on any part of this Data Analysis (this includes code and writing). Violating this in any way will result at minimum in a 0 on the assignment.

Section 1: Introduction

This first section can be fairly short. I just need you to tell me which data set you have chosen, and briefly explain why you chose this data set (there is no correct answer to this, I’m just curious!!!).

Section 2: Text Length

One of the first things we learned when working with data is that length can be an important thing to look at when comparing different text. Is certain text longer than another, and why might that be?

For this section, you will be comparing text length between at least two groups in your data set. This could be comparing how many lines of texts are spoken by different characters or how this changes across seasons, or how long essays tend to be if written by AI or a human, etc. Each of you has chosen a different data set, so what you choose will be slightly different.

Your task: State clearly what research question you are going to explore that relates to text length. Create an appropriate, well formatted, labelled graph to explore your research question, and discuss what the graph and results tell you about the answer to your research question.

Example: Comparing the number of words spoken by Lorelai, Rory, Emily, and Luke in Gilmore Girls in different seasons.

Section 3: Top Words

In the next section of the analysis, you will be looking at words themselves.

Your task: Create and state a research question that involves the most frequent words in text. Create an appropriate, well formatted, labelled graph to explore your research question, and discuss what the graph and results tell you about the answer to your research question.

Example: Looking at the top 10 words spoken by a specific character in Friends.

Section 4: TF-IDF

Your task: Create and state a research question that involves distinguishing 3 or more text documents by looking at the most frequent TF-IDF words. Create an appropriate, well formatted, labelled graph to explore your research question, and discuss what the graph and results tell you about the answer to your research question. If you chosen the AI data set, you only need to compare 2 documents, because your data set does not have 3.

Example: What are the top 10 words what most distinguish texts written by AI or humans?

Section 5: Sentiment Analysis

In this section, you will be comparing the sentiment in the lines of text.

Your task: Create and state a research question that involves analyzing the sentiment of text. Create an appropriate, well formatted, labelled graph to explore your research question, and discuss what the graph and results tell you about the answer to your research question.