#
Data Analysis作业代做、代写R实验作业、代写data留学生作业、代做R编程设计作业
代做留学生Prolog|代写Python程序

Data Analysis Assignment 2 (DAE 2)

Instructions:

• Your assignment should be typed and submitted as a PDF or Word file in CANVAS. Assignments

not in these formats will be assigned a maximum grade of 50%.

• Any necessary output should be copied and pasted directly into a word processing document (not

appended to the end in an appendix). Hint: you may need to reformat the output so that it is neat

and professional. All output should be clearly labeled for reference in your written responses. Only

include output that is necessary to answer the questions. Do not include an output dump (meaning do

not include output with no explanation or reference onto what it means).

• You must use R for the analysis

• FOR ALL THE ANALYSIS USE AN α = 0.10

Problem Formulation: Human Resources Analytics

This project involves analyzing data for an anonymous IT company. The main question the company asks is:

Why are our best and more experienced employees leaving us? Employee data was collected with the main

variables described below:

• Employee Name: Employee’s full name

• EmpID: Employee ID is unique to each employee

• PayRate: The person’s hourly pay rate. All salaries are converted to hourly pay rate

• Position: The text name/title of the position the person has

• State: The state that the person lives in

• Zip: The zip code for the employee

• DOB: Date of Birth for the employee

• Sex: Sex - M or F

• MaritalDesc: The marital status of the person (divorced, single, widowed, separated, etc)

• CitizenDesc: Label for whether the person is a Citizen or Eligible NonCitizen

• RaceDesc: Description/text of the race the person identifies with

• DateofHire Date the person was hired

• DateofTermination: Date the person was terminated, only populated if, in fact, Termd = 1

• TermReason A text reason / description for why the person was terminated

• EmploymentStatus: A description/category of the person’s employment status. Anyone currently

working full time = Active

• Department: Name of the department that the person works in

• ManagerName: The name of the person’s immediate manager

• RecruitmentSource: The name of the recruitment source where the employee was recruited from

• PerformanceScore: Performance Score text/category (Fully Meets, Needs Improvement, Exceeds or PIP

(means under Disciplinary Action, Deficient)

• EngagementSurvey: Results from the last engagement survey, managed by our external partner

• SpecialProjectsCount: The number of special projects that the employee worked on during the last 6

months

• EmpSatisfaction: A basic satisfaction score obtained via different metrics between 0 and 60 (the higher

the more satisfied the employee), as reported on the employee documents.

Your goal is to create an effective model to predict EmpSatisfaction.

1

Pre-processing Steps (10 points) —- PUT IN THE APPENDIX

The data set contains variables and observations that need to be dropped before creating an effective multiple

linear regression model.

Steps Needed:

1) You need to perform the analysis only on employees who are active. In other words, the observations

for which the variable EmploymentStatus is equal to Active.

2) Drop all unneccessary variables. In this part, you should drop

• Status Variables: EmploymentStatus, DateofTermination, and TermReason since we only keep “Active”

employees.

• All date variables: DOB, DateofHire(we will learn later how to effectively include them)

• ID variables: EmpID

• Names: Employee_Name, ManagerName

• Zip Code and State: State, Zip (we will learn later how to effective include them)

TIP: You can use either subsetting, e.g., data = data[,-colnumber] to delete the unneccessary variables

or data$variablename = NULL.

To show you completed successfully these steps paste the code below and the str() of your dataset.

Analysis Step 1 (10 Points) —- PUT IN THE APPENDIX. INCLUDE CONCLUSIONS

IN THE MEMO

The IT company does not want you to include the following variables in the model: department,

RecruitmentSource, Positionand Race. Instead, the company wants you to describe if you find any

indications that there is statistical evidence of dissatisfaction considering the information included in them.

Deliverable:

Include outputs of the code and regression(s) you create along with the answers to these questions:

• Are employees for specific deparments more disatisfied than others?

• Is there any indication that the Recruitment Source (RecruitmentSource) matters in how satisfied

employees are?

• Is there any indication that any particular position (Position) is more likely to be dissatisfied than

others?

• Are employees of specific races more dissatisfied than others?

Analysis Step 2 (60 Points) —- PUT IN THE APPENDIX. INCLUDE CONCLUSIONS

IN THE MEMO

First-order Regression Model (30 points)

1. Fit the multiple regression model, regressing Y (EmpSatisfaction) on X1 (PayRate), X2 (Sex), X3

(CitizenDesc), X4 (PerformanceScore), X5 (EngagementSurvey) and X6(SpecialProjectsCount).

a) Conduct a test of overall model significance. State H0 and Ha, the F-statistic, degrees of freedom,

p-value, and conclusion. (10 pts)

b) Conduct tests of partial significance for each slope coefficient (X1, X2, X3, X4, X5, X6). For each, state

H0 and Ha, the test statistic, degrees of freedom, p-value, and conclusion. (10 pts)

2

c) Dropping any insignificant variable at α = 0.1 (one by one starting by the more complex and insignificant

variables, and then by significance level), what is the final first-order least squares regression fit? (5

pts)

d) What percentage of the variation in EmpSatisfaction can be explained by using your first-order final

model in c)? (5 pts)

e) Explain the effect that each of the predictors in final model c) has on the response. (5 points)

Higher-order Regression Model (30 points)

2. Fit the multiple regression model, regressing Y (EmpSatisfaction) on X1 (PayRate), X2 (Sex), X3

(CitizenDesc), X4 (PerformanceScore), X5 (EngagementSurvey), X6(SpecialProjectsCount), X7

(Quadratic of Special Projects Count: I(SpecialProjectsCountˆ2), X8 (Interaction Between Pay

Rate and Special Projects Count: PayRate:SpecialProjectsCount) .

a) Start dropping insignificant variables at α = 0.10 (one by one starting by the more complex and

insignificant variables, and then by significance level) until you get a model with all significant terms.

(10 pts)

b) State your final higher-order model. (5 pts)

c) Explain the effect of each predictor on the response (10 pts)

d) What percentage of the variation in satisfaction_level can be explained by using your higher-order

final model in b)? (5 pts)

Conclusions and Memo (20 points)

Currently, the anonymous company predicts the satisfaction level of an employee by only using their

Performance Score. The company estimates that a reduction of 0.10 on the average standard error around

the satisfaction level prediction will result in $100,000 savings in expenses (training new employees because of

better detection on those likely to leave). Write a memo to Mr. I. Luv Reg, President of the anonymous

company detailing your best selected model (first-order or higher-order model you found to be best), its

associated error and how this model might be better for the company than the method that they currently

use.

Mr. I. Luv Reg does not understand regression terminology so you should explain your findings in plain

english business terms. Include information on if there is statistical evidence of dissatisfaction considering the

information the variables not included (Analysis Step 1) and the effects of your model that are important

findings and have direct explanation in the language of the problem.

3