Stat 462 - Individual Project 2
Due May 10, 2020
Regression Analysis
This project is to be completed individually. You may submit pdf only (Rmd is not needed and you can use
another word processing tool if you like).
We will use the Ames Housing dataset, which has 82 variables and 2930 observations (AmesHousing.txt).
The 82 features include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. You may just use 20
continuous variables in this project.
Your goal is to predict the sale price. Exclude the Order, PID, and of course SalesPrice variables
from your predictors. You may want to combine variables (e.g. summing square feet) and perform various
manipulations (e.g. transformations for nonlinearity) we have learned about.
You should try at least multiple linear regression (with and without transformations) and weighted least
squares (or generalized least squares), but you are welcome to explore more! You need to preprocess the
dataset. After the preprocessing, use the following code to define your test set, and its complement the
training set.
set.seed(2020)
testindices = sample(2930, round(2930/4)) ## train indices are the rest
Write up your results in a professional report, like you would present to a client or internal customer for
your analysis. The report should be no more than 4 double-spaced pages long and submitted in
PDF format. You should put important tables/figures in the report and put additional tables/figures in
the appendix.
It should include an appropriate analysis of the performance of the models you consder, and the reasons
for your final choice of model(s). Include any other details from your analysis that you feel are worthy of
mention.
The report should have four sections (Introduction, Analysis, Results, Conclusion) and provide sufficient
details that anyone with a reasonable statistics background could understand exactly what you have done and
what you concluded. In the introduction part, you should present some background and motivation to analyze
the housing data. In the analysis part, you should outline the analysis and some necessary methodological
details. In the result part, you should use tables/figures to summarize your results and explain your findings.
In the conclusion part, you should connect your analysis and results to your motivation and discuss some
possible future work. Do not embed R code in the body of your report (if you are using rmarkdown, use
{r echo=FALSE} to supress the printing of the r code), but instead attach the code in an appendix. The
appendix does not count towards the page limit.
10 points: fulfilling the project requirements. You may want to remove variables or observations with missing
values and combine variables (e.g. summing square feet)
5 points: the quality of your report (including: clarity of writing, organization, and layout; appropriate use of
Project Requirements
Requirement 1:
Preprocess the dataset (e.g., checking missing values, combining highly correlated features). Perform some
exploratory data analysis such as summary statistics, boxplot, correlation plot, and so on.
Requirement 2:
Fit the regression model on the training dataset. Perform the appropriate diagnostics for your regression
analysis (checking the regression assumption, influential observations, outliers, collinearity). You may need to
remove some collinear variables and/or outliers.
Requirement 3:
After removing the collinear variables and/or outliers, you may fit the following regression models: (i) the
full model, (ii) the sub model chosen by AIC, (iii) the sub model chosen by BIC, (iv) the sub model chosen
by Lasso, and (v) the sub model chosen by the Elastic Net. Summarize the fit of these models and compare
their coefficient estimates.
Requirement 4:
For each model, calculate the “mean prediction error” in the testing dateset.