首页 >
> 详细

Statistical and Predictive Modeling for Analytics I

Final Project (total 30 points)

The final project is worth 30% of the grade.

The final project utilizes the hypothesis testing framework and Least squares regression to answer a question in economics. Specifically, we would like to answer the question “By how much will another year of schooling raise one’s income”? We will use data collected from interviewing twins about their education, income and background. The data records contain information from genetically identical twins thus providing an excellent control for confounding variables. Your task is to use the statistical and predictive modeling techniques you have learned in class to test hypotheses about years of education and income. You will perform exploratory analysis, state hypotheses and test those hypotheses using simple tests as well as using simple and multiple linear regression. Based on your analysis you will provide your conclusions about the research question as well as estimate any impacts (such as what is the effect of a unit increase or decrease in years of education).

You will create a power point deck to report your findings and state your conclusion based on your results.

Data set and related information:

The dataset is available in the UCLA Statistics Course Datasets:

https://dataspace.princeton.edu/jspui/handle/88435/dsp012801pg35n

The dataset, labels and summary statistics are uploaded to the project content folder.

Read through the dataset information, variables information and relevant papers.

Note that you will need a method for handling missing data. Please refer to the “Some tips you will find useful” section for more information. Also refer to the Student’s Guide to R for methods of handling missing data.

Note that dataset twins.dat contains a total of 183 records and it includes data for both the twins. That is each record contains education level, hourly wage, demographics for twin 1 and twin 2.

Here is a link to the main paper that uses this dataset and describes the approaches used:

http://www.uh.edu/~adkugler/Ashenfelter&Krueger.pdf

Read the paper to understand some methods used.

The following is a checklist of the contents for each slide.

Slide 1 [3 points]

Name of presenter

Description of the research question

A high level description of how you would use statistical and predictive modeling (what you have learned in class) to answer the research question.

Slide 2-4 [3 points]

Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data

Also compute some statistics of the variables that you think are important

Plot some scatter plots showing the bivariate scatter of variables

Slides 5-6 [4 points]

Describe any abnormalities in the data (such as missing data)

Explain how you addressed these abnormalities and the resulting dataset

Slides 7 [4 points]

State the hypotheses related to the research question

Slides 8-9 [4 points]

Report the results of t-tests that will prove or disprove your hypotheses

State what assumptions need to be satisfied for the t-test and whether they are satisfied

Slides 10-11 [4 points]

Perform a simple linear regression and report the results

Interpret the coefficients

Report on any hypothesis tests

State assumptions and whether they are satisfied

Slides 12-13 [4 points]

Perform a multiple linear regression and report the results

Interpret the coefficients

State assumptions and whether they are satisfied

Slide 14-15 [4 points]

State your conclusions about the research question based on any evidence from your analysis

Some tips you will find useful

1.You might find the following page as a starting point for handling missing data:

http://www.statmethods.net/input/missingdata.html

2.Converting a factor to numeric variables: This stackoverflow page has some tips on how to convert a factor to numeric variables:

https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information

3.You will do two t-tests in support of the analysis related to this study. Assuming the data on the twins to be paired data, you will state an appropriate hypothesis and you can do a t-test to analyze the difference in hourly wages as well as the difference in education in years.

4.You will fit a simple linear regression of hourly wage of twin 2 against self-reported education of twin 2.

5.You will fit a multiple linear regression of log wages against own education, age, age squared, male and white.

6.You can fit any other regressions that you deem necessary and fit to answer the research question.

Final Term Project Rubric

Slides Exemplary Proficient

Incomplete

Incorrect or Unacceptable

1 Name along with a clear description of the research question is given. A clear description of how statistical and predictive modeling can be used to answer the research question. (3) Name along with a clear description of the research question is given. Mostly clear description of how statistical and predictive modeling can be used to answer the research question. (2) Name along with a clear description of the research question is given. An incomplete description of how statistical and predictive modeling can be used to answer the research question. (1) Description of research problem is incorrect or missing. Description of how statistical and predictive modeling can be used to answer the research question is missing or incorrect. (0)

2-4 Histograms, boxplots and scatterplots are correct. Statistics computed are correct and meaningful. Scatterplots showing bivariate scatter are correct. (4) Histograms, boxplots and scatterplots are correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter are mostly correct. (3) Histograms, boxplots and scatterplots are mostly correct. Statistics computed are mostly correct and meaningful. Scatterplots showing bivariate scatter may be incorrect or incomplete. (2) Some plots are correct and some statistics are correct. Others are mostly wrong. (0-1)

5-6 Any abnormalities are clearly identified. Clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (4) Any abnormalities are clearly identified. Mostly clear explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (3) Any abnormalities are clearly identified. Somewhat clear or incomplete explanation of how the abnormalities were addressed is presented along with a description of the final resulting dataset (2) Abnormalities identified are incorrect and explanation is missing or incorrect. (0-1)

7 Hypotheses are clearly stated and correct. (4) Hypotheses are clearly stated and mostly correct (3) Hypotheses are incomplete (2) Hypotheses are missing or incorrect (0-1)

8-9 Results or the t-test are reported correctly. Assumptions that need to be satisfied is clearly stated along with whether they were satisfied. (4) Results or the t-test are reported correctly. Assumptions stated are correct and an explanation of whether they were satisfied is mostly correct (3) Results or the t-test are reported correctly. Assumptions are incomplete and the explanation is also incomplete (2) Results of the t-test are incorrect. (0-1)

10-11 Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately and assumptions along with whether they were satisfied are stated. (4) Simple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, hypotheses tests are reported accurately. Assumptions are incomplete (3) Simple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, hypotheses tests or assumptions (2) Simple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)

12-13 Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly, and assumptions along with whether they were satisfied are stated. (4) Multiple linear regression is performed correctly and reported correctly. Coefficients are interpreted correctly. Assumptions are incomplete (3) Multiple linear regression is performed correctly and reported correctly. There are some issues with coefficient interpretation, or assumptions (2) Multiple linear regression is incorrect and subsequently all other answers are also incorrect (0-1)

14-15 Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is presented clearly. (4) Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is mostly presented clearly. (3) Conclusions about the research question are clearly stated and correct. Evidence for the conclusions is incomplete. (2) Conclusions are incorrect or poorly stated (0-1)

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Tsp课程作业代写、代做algorithms留学生作业、代做java，C/C 2020-06-23
- Kit107留学生作业代做、C++编程语言作业调试、Data课程作业代写、代 2020-06-23
- Sta302h1f作业代做、代写r课程设计作业、代写r编程语言作业、代做da 2020-06-22
- 代写seng 474作业、代做data Mining作业、Python，Ja 2020-06-22
- Cmpsci 187 Binary Search Trees 2020-06-21
- Comp226 Assignment 2: Strategy 2020-06-21
- Math 504 Homework 12 2020-06-21
- Math4007 Assessed Coursework 2 2020-06-21
- Optimization In Machine Learning Assig... 2020-06-21
- Homework 1 – Math 104B 2020-06-20
- Comp1000 Unix And C Programming 2020-06-20
- General Specifications Use Python In T... 2020-06-20
- Comp-206 Mini Assignment 6 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Aps 105 Lab 9: Search And Link 2020-06-20
- Mech 203 – End-Of-Semester Project 2020-06-20
- Ms980 Business Analytics 2020-06-20
- Cs952 Database And Web Systems Develop... 2020-06-20
- Homework 4 Using Data From The China H... 2020-06-20
- Assignment 1 Build A Shopping Cart 2020-06-20