首页 > > 详细

Data Analysis讲解、辅导R、辅导data留学生、讲解R编程设计 讲解留学生Prolog|辅导Python程序

Data Analysis Assignment 2 (DAE 2)
Instructions:
• Your assignment should be typed and submitted as a PDF or Word file in CANVAS. Assignments
not in these formats will be assigned a maximum grade of 50%.
• Any necessary output should be copied and pasted directly into a word processing document (not
appended to the end in an appendix). Hint: you may need to reformat the output so that it is neat
and professional. All output should be clearly labeled for reference in your written responses. Only
include output that is necessary to answer the questions. Do not include an output dump (meaning do
not include output with no explanation or reference onto what it means).
• You must use R for the analysis
• FOR ALL THE ANALYSIS USE AN α = 0.10
Problem Formulation: Human Resources Analytics
This project involves analyzing data for an anonymous IT company. The main question the company asks is:
Why are our best and more experienced employees leaving us? Employee data was collected with the main
variables described below:
• Employee Name: Employee’s full name
• EmpID: Employee ID is unique to each employee
• PayRate: The person’s hourly pay rate. All salaries are converted to hourly pay rate
• Position: The text name/title of the position the person has
• State: The state that the person lives in
• Zip: The zip code for the employee
• DOB: Date of Birth for the employee
• Sex: Sex - M or F
• MaritalDesc: The marital status of the person (divorced, single, widowed, separated, etc)
• CitizenDesc: Label for whether the person is a Citizen or Eligible NonCitizen
• RaceDesc: Description/text of the race the person identifies with
• DateofHire Date the person was hired
• DateofTermination: Date the person was terminated, only populated if, in fact, Termd = 1
• TermReason A text reason / description for why the person was terminated
• EmploymentStatus: A description/category of the person’s employment status. Anyone currently
working full time = Active
• Department: Name of the department that the person works in
• ManagerName: The name of the person’s immediate manager
• RecruitmentSource: The name of the recruitment source where the employee was recruited from
• PerformanceScore: Performance Score text/category (Fully Meets, Needs Improvement, Exceeds or PIP
(means under Disciplinary Action, Deficient)
• EngagementSurvey: Results from the last engagement survey, managed by our external partner
• SpecialProjectsCount: The number of special projects that the employee worked on during the last 6
months
• EmpSatisfaction: A basic satisfaction score obtained via different metrics between 0 and 60 (the higher
the more satisfied the employee), as reported on the employee documents.
Your goal is to create an effective model to predict EmpSatisfaction.
1
Pre-processing Steps (10 points) —- PUT IN THE APPENDIX
The data set contains variables and observations that need to be dropped before creating an effective multiple
linear regression model.
Steps Needed:
1) You need to perform the analysis only on employees who are active. In other words, the observations
for which the variable EmploymentStatus is equal to Active.
2) Drop all unneccessary variables. In this part, you should drop
• Status Variables: EmploymentStatus, DateofTermination, and TermReason since we only keep “Active”
employees.
• All date variables: DOB, DateofHire(we will learn later how to effectively include them)
• ID variables: EmpID
• Names: Employee_Name, ManagerName
• Zip Code and State: State, Zip (we will learn later how to effective include them)
TIP: You can use either subsetting, e.g., data = data[,-colnumber] to delete the unneccessary variables
or data$variablename = NULL.
To show you completed successfully these steps paste the code below and the str() of your dataset.
Analysis Step 1 (10 Points) —- PUT IN THE APPENDIX. INCLUDE CONCLUSIONS
IN THE MEMO
The IT company does not want you to include the following variables in the model: department,
RecruitmentSource, Positionand Race. Instead, the company wants you to describe if you find any
indications that there is statistical evidence of dissatisfaction considering the information included in them.
Deliverable:
Include outputs of the code and regression(s) you create along with the answers to these questions:
• Are employees for specific deparments more disatisfied than others?
• Is there any indication that the Recruitment Source (RecruitmentSource) matters in how satisfied
employees are?
• Is there any indication that any particular position (Position) is more likely to be dissatisfied than
others?
• Are employees of specific races more dissatisfied than others?
Analysis Step 2 (60 Points) —- PUT IN THE APPENDIX. INCLUDE CONCLUSIONS
IN THE MEMO
First-order Regression Model (30 points)
1. Fit the multiple regression model, regressing Y (EmpSatisfaction) on X1 (PayRate), X2 (Sex), X3
(CitizenDesc), X4 (PerformanceScore), X5 (EngagementSurvey) and X6(SpecialProjectsCount).
a) Conduct a test of overall model significance. State H0 and Ha, the F-statistic, degrees of freedom,
p-value, and conclusion. (10 pts)
b) Conduct tests of partial significance for each slope coefficient (X1, X2, X3, X4, X5, X6). For each, state
H0 and Ha, the test statistic, degrees of freedom, p-value, and conclusion. (10 pts)
2
c) Dropping any insignificant variable at α = 0.1 (one by one starting by the more complex and insignificant
variables, and then by significance level), what is the final first-order least squares regression fit? (5
pts)
d) What percentage of the variation in EmpSatisfaction can be explained by using your first-order final
model in c)? (5 pts)
e) Explain the effect that each of the predictors in final model c) has on the response. (5 points)
Higher-order Regression Model (30 points)
2. Fit the multiple regression model, regressing Y (EmpSatisfaction) on X1 (PayRate), X2 (Sex), X3
(CitizenDesc), X4 (PerformanceScore), X5 (EngagementSurvey), X6(SpecialProjectsCount), X7
(Quadratic of Special Projects Count: I(SpecialProjectsCountˆ2), X8 (Interaction Between Pay
Rate and Special Projects Count: PayRate:SpecialProjectsCount) .
a) Start dropping insignificant variables at α = 0.10 (one by one starting by the more complex and
insignificant variables, and then by significance level) until you get a model with all significant terms.
(10 pts)
b) State your final higher-order model. (5 pts)
c) Explain the effect of each predictor on the response (10 pts)
d) What percentage of the variation in satisfaction_level can be explained by using your higher-order
final model in b)? (5 pts)
Conclusions and Memo (20 points)
Currently, the anonymous company predicts the satisfaction level of an employee by only using their
Performance Score. The company estimates that a reduction of 0.10 on the average standard error around
the satisfaction level prediction will result in $100,000 savings in expenses (training new employees because of
better detection on those likely to leave). Write a memo to Mr. I. Luv Reg, President of the anonymous
company detailing your best selected model (first-order or higher-order model you found to be best), its
associated error and how this model might be better for the company than the method that they currently
use.
Mr. I. Luv Reg does not understand regression terminology so you should explain your findings in plain
english business terms. Include information on if there is statistical evidence of dissatisfaction considering the
information the variables not included (Analysis Step 1) and the effects of your model that are important
findings and have direct explanation in the language of the problem.
3

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!