首页 >
> 详细

Statistical and Predictive Modeling for Analytics I

Final Project (total 30 points)

The final project is worth 30% of the grade.

The final project utilizes the hypothesis testing framework and Least squares regression to answer a question in economics. Specifically, we would like to answer the question “By how much will another year of schooling raise one’s income”? We will use data collected from interviewing twins about their education, income and background. The data records contain information from genetically identical twins thus providing an excellent control for confounding variables. Your task is to use the statistical and predictive modeling techniques you have learned in class to test hypotheses about years of education and income. You will perform exploratory analysis, state hypotheses and test those hypotheses using simple tests as well as using simple and multiple linear regression. Based on your analysis you will provide your conclusions about the research question as well as estimate any impacts (such as what is the effect of a unit increase or decrease in years of education).

You will create a power point deck to report your findings and state your conclusion based on your results.

Data set and related information:

The dataset is available in the UCLA Statistics Course Datasets:

https://dataspace.princeton.edu/jspui/handle/88435/dsp012801pg35n

The dataset, labels and summary statistics are uploaded to the project content folder.

Read through the dataset information, variables information and relevant papers.

Note that you will need a method for handling missing data. Please refer to the “Some tips you will find useful” section for more information. Also refer to the Student’s Guide to R for methods of handling missing data.

Note that dataset twins.dat contains a total of 183 records and it includes data for both the twins. That is each record contains education level, hourly wage, demographics for twin 1 and twin 2.

Here is a link to the main paper that uses this dataset and describes the approaches used:

http://www.uh.edu/~adkugler/Ashenfelter&Krueger.pdf

Read the paper to understand some methods used.

The following is a checklist of the contents for each slide.

Slide 1 [3 points]

Name of presenter

Description of the research question

A high level description of how you would use statistical and predictive modeling (what you have learned in class) to answer the research question.

Slide 2-4 [3 points]

Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data

Also compute some statistics of the variables that you think are important

Plot some scatter plots showing the bivariate scatter of variables

Slides 5-6 [4 points]

Describe any abnormalities in the data (such as missing data)

Explain how you addressed these abnormalities and the resulting dataset

Slides 7 [4 points]

State the hypotheses related to the research question

Slides 8-9 [4 points]

Report the results of t-tests that will prove or disprove your hypotheses

State what assumptions need to be satisfied for the t-test and whether they are satisfied

Slides 10-11 [4 points]

Perform a simple linear regression and report the results

Interpret the coefficients

Report on any hypothesis tests

State assumptions and whether they are satisfied

Slides 12-13 [4 points]

Perform a multiple linear regression and report the results

Interpret the coefficients

State assumptions and whether they are satisfied

Slide 14-15 [4 points]

State your conclusions about the research question based on any evidence from your analysis

Some tips you will find useful

1.You might find the following page as a starting point for handling missing data:

http://www.statmethods.net/input/missingdata.html

2.Converting a factor to numeric variables: This stackoverflow page has some tips on how to convert a factor to numeric variables:

https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information

3.You will do two t-tests in support of the analysis related to this study. Assuming the data on the twins to be paired data, you will state an appropriate hypothesis and you can do a t-test to analyze the difference in hourly wages as well as the difference in education in years.

4.You will fit a simple linear regression of hourly wage of twin 2 against self-reported education of twin 2.

5.You will fit a multiple linear regression of log wages against own education, age, age squared, male and white.

6.You can fit any other regressions that you deem necessary and fit to answer the research question.

Final Term Project Rubric

Slides Exemplary Proficient

Incomplete

Incorrect or Unacceptable

1 Name along with a clear description of the research question is given. A clear description of how statistical and predictive modeling can be used to answer the research question. (3) Name along with a clear description of the research question is given. Mostly clear description of how statistical and predictive modeling can be used to answer the research question. (2) Name along with a clear description of the research question is given. An incomplete description of how statistical and predictive modeling can be used to answer the research question. (1) Description of research problem is incorrect or missing. Description of how statistical and predictive modeling can be used to answer the research question is missing or incorrect. (0)

2-4 Histograms, boxplots and scatterplots are correct. Statistics computed are correct and meaningful. Scatterplots showing bivariate scatter are correct. (4) Histograms, boxplots and scatterplots are correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter are mostly correct. (3) Histograms, boxplots and scatterplots are mostly correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter may be incorrect or incomplete. (2) Some plots are correct and some statistics are correct. Others are mostly wrong. (0-1)

5-6 Any abnormalities are clearly identified. Clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (4) Any abnormalities are clearly identified. Mostly clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (3) Any abnormalities are clearly identified. Somewhat clear or incomplete explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (2) Abnormalities identified are incorrect and explanation is missing or incorrect. (0-1)

7 Hypotheses are clearly stated and correct. (4) Hypotheses are clearly stated and mostly correct (3) Hypotheses are incomplete (2) Hypotheses are missing or incorrect (0-1)

8-9 Results or the t-test are reported correctly. Assumptions that need to be satisfied is clearly stated along with whether they were satisfied. (4) Results or the t-test are reported correctly. Assumptions stated are correct and an explanation of whether they were satisfied is mostly correct (3) Results or the t-test are reported correctly. Assumptions are incomplete and the explanation is also incomplete (2) Results of the t-test are incorrect. (0-1)

10-11 Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately and assumptions along with whether they were satisfied are stated. (4) Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately. Assumptions are incomplete (3) Simple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, hypotheses tests or assumptions (2) Simple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)

12-13 Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, and assumptions along with whether they were satisfied are stated. (4) Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly. Assumptions are incomplete (3) Multiple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, or assumptions (2) Multiple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)

14-15 Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is presented clearly. (4) Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is mostly presented clearly. (3) Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is incomplete. (2) Conclusions are incorrect or poorly stated (0-1)

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- 代写cs3014 Google Analytics Customer Rev 2020-01-21
- 代写cmpsc121 Structs代写留学生c/C++实验... 2020-01-21
- 代写mis6326 Data Management调试存储过程作业、数据库编 2020-01-21
- 代写msci 581作业、代做marketing Analytics作业、P 2020-01-20
- Software课程作业代做、代写java，C/C++程序设计作业、Pyth 2020-01-20
- Tcss 372作业代做、代写python，Java编程语言作业、代做c/C 2020-01-20
- Emergency Facilities作业代写、代写r编程设计作业、R课程 2020-01-18
- Cis 413/513作业代做、代写data Structures作业、Ja 2020-01-18
- 代写ia626留学生作业、Python程序设计作业调试、代做data课程作业 2020-01-18
- Mat00027i作业代写、Java程序语言作业调试、Mathematica 2020-01-17
- 代做kt Model作业、代写java，Python编程设计作业、代做c/C 2020-01-17
- Data Set课程作业代做、代写r程序语言作业、Ltcret留学生作业代做 2020-01-17
- 代写rstudio留学生作业、代做r编程设计作业、代写r课程设计作业代做数据 2020-01-17
- 代写cs2250 Delimiter Matching代做数据结... 2020-01-16
- 代写cs12b Edit Distance帮写java实验作业... 2020-01-16
- 代写mins325 Filereader And Filewriter代... 2020-01-16
- 代写cosi131 Tunnels帮写java实验作业 2020-01-16
- 代写inm312 Balancebit Software代写留学... 2020-01-16
- 代写cs61b Maze Solver代写java课程设计 2020-01-16
- Program留学生作业代做、C/C++编程语言作业代写、代做java，Py 2020-01-14