Assignment 3
Your name (netid)
In this assignment, you are asked to predict the sales of a store based on the
following information. Note that the store is closed on a few days. The sales on closed
days are imputed by average sales.
Variables Description
sales Daily sales on the log scale (dependent variable)
closed Whether the store is closed
oil_price Oil price ((the country heavily relies on oil export)
promotions The number of items under promotion in the store
comp_prom The average number of items under promotion in competing stores
is_national Is it a national holiday
is_local Is it a local holiday
This dataset has 1,172 observations, in which the sales for the last 200 days are
missing. The goal of this assignment is to predict the sales on these 200 days. For Q1-
Q4, please holdout sales on the last 180 days with available sales for validation. The
evaluation metric for this task is RMSE. However, to avoid overfitting, you may also
want to take a look at SBC.
Please provide relevant screenshots while answering the questions below. Missing
important screenshots can have a negative impact on your grade.
2
1. Data Exploration (1.5 Point)
Import (SAS Menu FileImport Data) train.csv and save it as a SAS dataset.
SAS allows you to explore the series in multiple ways. Sometimes, they may give you
different suggestions on whether the series has trend or seasonality. Now examine the
following three plots. (1.5 point)
a) Does the Series plot exhibit seasonality? You may want to zoom in to see better.
b) Do the Autocorrelation plots reveal potential seasonality? You may want to zoom out
to see more lags.
c) Does the Seasonal Root Test provide statistical evidence for seasonality?
Explain your answer to each question.
2. Linear Regression Models (5 points)
2.1 Fit a model in which all explanatory variables are used as ordinary regressors.
Examine the Parameter Estimates, which variables have significant effects on sales at 5%
significance level? (1 point)
2.2 Apply any type of transformation (e.g., log) that makes sense to you on at least two
explanatory variables. You may want to check the distribution of each variable first. (2
point)
a) Explain why you want to make certain transformations on certain regressors. You are
encouraged to Google on this. If so, please provide links you find helpful.
3
b) Does the transformation improve the RMSE of the model on the hold out sample?
c) Does the transformation make any insignificant variable become significant?
Hint: to transform. a variable, you need to add it as a dynamic regressor, even though
you may not specify any transfer function for it.
2.3 Based on your best model above, try applying some transfer functions on two
regressors. (2 points)
a) Explain in a few sentences why you choose certain types of transfer function for
certain regressors.
b) Do the transfer functions improve RMSE as planned?
It is fine that the transfer function does not improve model performance. Please try at
least four different models in this step.
2.4 What is your best model in terms of RMSE so far? Duplicate the model and change
its name to “BEST2”. (0 point, just for your own record)
3. ARIMA Model + Explanatory Variables (5 points)
3.1 Duplicate your best model in Q2.4 and do the following, one at a time. (2.5 points)
i) Add Seasonal Dummies into the model
ii) Add both Linear Trend and Seasonal Dummies into your model
iii) Apply Seasonal Difference on sales
iv) Apply First and Seasonal Differences on sales
Name the models properly, so that they can be easily understood. For example
4
Please answer the following two questions.
a) Provide a screenshot of the “Statistics of Fit” for each model. Which model produces
the smallest RMSE on the holdout sample?
b) Summarize at least three findings from the comparison of the four models. Explain
these findings or discuss what you learned from them. (grading will be based on how
insightful your findings and discussions are)
3.2 Examine the diagnostic plots on the residuals of your best model in Q3.1. Now fit a
Seasonal ARIMA model ARIMA(p,d,q)(P,1,Q)s based on your understanding of the data.
Please add the regressors that deem to be helpful as well. Only need to consider p<=2,
d<=1, q<=2, P+Q<=1. Different people may arrive at different models.
This is an iterative process: fit models examine problems with residuals fit new
models. Explain your thought process and provide necessary screenshots (grading will
be partially based on the clarity of your explanation). Try at least 3~4 models (2.5 points)
3.3 What is your best model in terms of RMSE so far? Duplicate the model and change
its name to “BEST3”. (0 point, just for your own record)
4. Modeling Christmas and Events (4.5 points)
The sales increase substantially around Christmas. However, modeling the effect of
Christmas on sales can be more complicated than it appears.
4.1 There are two potential ways to deal with it: a) treating Christmas as an event; b)
modeling Christmas with a regressor (i.e., a dummy variable for Christmas). In principle,
which way is more helpful for prediction? Explain. No need to try any model for this
question. (1 points)
4.2 Regardless of you answer in Q4.1, now model the effect of Christmas using a
dummy variable. The dummy variables can be defined based on different rationales. For
example, you may code the dummy variable to be one on Dec 25 and zero otherwise,
because Dec 25 is the official date for Christmas. Alternatively, you may code the
5
dummy to be one on Dec 23 as the sales peak on that day. You may also code the
dummy to be one on some other day. (2 points)
a) In the csv file, generate three dummy variables for Christmas based on three different
days: Dec 23, Dec 25, and one day of your own choice (explain why you choose that
day). Name them as Christmas25, Christmas23, and so forth. Import this new CSV file
as a SAS dataset.
b) Add each of the three dummies as an ordinary regressor separately into the best
model in Q3.3 and see which dummy leads to the smallest RMSE. Examine the effects
of the three dummies and explain why a particular one performs the best.
4.3 Examine the “Prediction Errors for Sales” plot (second button on the right) of your
best model so far, identify events on at least two dates. Try to mode at least two events
with appropriate transfer functions. Explain your rationale behind these transfer
functions. Does accounting for events improve your model performance? (1.5 points)
5. Further tune your models for performance points (4 points)
You may fit any models using any tools for this question. You are recommended to save
models in this question as a separate project in the same catalog. You may submit
predictions from five models. Please save the predictions as SAS datasets and name
them as pred1, pred2, and so forth.
a) If you use tools other than SAS to generate your final predictions, please simply
replace the missing values in the train.csv with your predicted values and then submit
the csv file.
b) You may want to re-train your best model on the whole sample
c) To make sure your best model is robust, you may vary the holdout sample size, refit
the model, and then see whether it remains to be the best model.
d) It’s NOT recommended to select five models which are highly similar to each other,
as they may suffer from the same issue.
6
e) TSFS allows you to combine multiple models for prediction.
f) You may impute sales on closed days differently, such as average sales in past week.
g) Make sure your predictions have 1172 observations.
Your performance points are proportional to the ranking of your best RMSE on
the test set, namely the last 200 observations (not the holdout sample).
Please submit the following for your assignment
1) The saved catalog file.
2) Predictions from 5 different models of your own choice.
3) This document.