QBUS2810, S2 2022
Statistical Modelling for Business
Group Assignment
This group assignment will contribute 20% towards your final result in the
unit. The deadline is 11:59pm Friday 4th November, 2022. Submission is
via Canvas.
This assignment must be completed in your Canvas group. It is entirely
students responsibility to form and/or join a group in the People section of
the 2810 Canvas site. Groups consist of precisely 3 students only.
Maximum Length: There is no maximum page length for this assignment. If you
have something interesting and worthwhile to include, then please do so without worry-
ing about a page limit. However, irrelevant or overly long-winded material may reduce
your overall mark (as well as the marker’s enjoyment of life). As a guideline, in pre-
vious runs of this class the typical report had between 20-25 pages, excluding Python
code.
Notes on Marking:
The assignment will initially be marked out of 64.
Up to an additional five (5) marks will be awarded based on the overall pre-
sentation quality of your report. Thus, you will receive a total mark for this
assignment out of 69. You will lose some of these 5 presentation marks for poor,
inefficient, unclear and/or unprofessional presentation. You will be rewarded for
professional, efficient and clear presentation methods. I expect your final report
to be done in a professional editing package and to be submitted in pdf only.
Html files of jupyter notebooks are not suitable.
You must use Python for this assignment. You are being assessed on how
well you can use Python to complete the assignment tasks. NB: You can use
Excel for simple data manipulations and clean-up; but Python is better at these
2tasks too! All plots and statistical output in the assignment must have been
produced in Python, though you can of course make nicer tables in a text editor
to include in your assignment. Please include an appendix in your assignment
that contains the Python code your group used to produce ALL outputs in your
assignment. A heavy penalty will apply if the Python code is not supplied (or
the code supplied does not run or work when the marker tries to run it).
Key requirements:
Pre-analysis instructions for data:
Please include the python code from the Jupyter notebook file “grp assnt gendata.ipynb”
in your Jupyter notebook file to input and clean the data. Collect the student ID num-
bers for the members of your group and then add these numbers together. Input the
result into the python code where instructed. Run the subsequent code to generate
two datasets: “train” and “test”. Most analysis you do will only use the “train” data
set. Any forecasting your group does will only use the “test” dataset. The purpose
of these commands is to ensure that each group receives different randomly selected
datasets for “train”ing and “test”ing purposes. Two other python codes are included
in case you need it: forward selection.py and backword selection.py
Business problem:
The US Department of Energy Office of Energy Efficiency runs a website www.fueleconomy.gov
which is the official source for fuel economy information for consumers and organisa-
tions in the US. The US government is interested in understanding the drivers of fuel
economy in a large range of vehicles for private consumer, organisational and govern-
ment use in the US. In particular, they are very interested in the effect of a variable
called engine displacement, which is the total volume of all the cylinders in an engine,
on fuel economy in vehicles. They wish to build a model that can accurately predict
the level of fuel economy for the cars in their database, so they can improve their un-
derstanding and communicate this, and also make better recommendations, on their
website. Your group has been commissioned to research on and analyse the data pro-
vided and then report back to the Department of Energy Office of Energy Efficiency,
3principally regarding the major goals they are interested in.
Data and Description:
Please see the file Fueleconomy.pdf for information on the variables and data collected.
The data used here are from a wide range of cars manufactured in the years 1984-2023
and is available at https://www.fueleconomy.gov/feg/ws/index.shtml. The dataset at
this site is in the file “vehicles.csv”. Please see Fueleconomy.pdf for descriptions of
the variables in the study and for more information. The measure of fuel economy to
be used is the average miles per gallon MPG achieved over various tested journeys for
each car, labelled comb08 in the dataset.
Goals and primary questions:
There are three primary goals that the Department of Energy Office of Energy Effi-
ciency would like your group to focus on:
(a) Understand the relationship between fuel economy and primarily engine dis-
placement;
(b) Develop a causal model for fuel economy, that includes engine displacement;
(c) Develop an optimal model for predicting fuel economy, as well understand the
relationship between fuel economy and the optimal set of useful explanatory
variables.
(d) Understand how the useful predictor variables interact to help explain the vari-
ations in fuel economy.
The focus is on vehicles that use either only a single fuel, being only petrol or only
diesel: cars that employ electricity or gas to power them (solely or hybrid) are not to be
considered in your analysis. Only cars made in the years 1984-2022 should be included.
As in many real data sets, there are many extraneous variables here, including other
potential response variables, all of which are not suitable to be included as explanatory
variables in any predictive or causal models for fuel economy. This includes several
variables to do with electric or gas or hybrid cars, and many others, all of which should
be ignored. These variables are removed by the code in “grp assnt gendata.ipynb”.
4Tasks:
1. (6 marks) Conduct a suitable exploratory analysis on this dataset; specifically one
that is relevant to the goals of this study.
2. (6 marks) Analyse the relationship between fuel economy (MPG) and displacement
and test the significance of this relationship using a suitably chosen SLR. Include a
discussion of whether the assumptions of your analysis and test could hold for this data
and whether and how strongly the data actually fits the model.
3. (12 marks)
a. Discuss which variables in the dataset could be causing omitted variable bias in
your analysis in task 2, and justify clearly why you think that. (3 marks)
b. Include these omitted variables, together with displacement, in a standard MLR
model, without any transformations or interactions or nonlinear effects; then fit
the model and present and interpret the estimated model. (3 marks)
c. Assess the (partial) relationship between MPG and displacement, and include
a discussion of whether the assumptions could hold for this data and whether
and how well the data actually fits the model. (3 marks)
d. Also discuss the level and sources of multi-collinearity present and whether you
think this is problematic, or not, and why; and if so, problematic for what? (3
marks)
4. (6 marks) Conduct a variable and model selection exercise, including some poten-
tial interaction effects and also considering some transformations/nonlinear effects for
the regressors and/or response variable. You must properly motivate and discuss all
your choices here.
5. (6 marks) Provide a summary of the comparison of the strength of model fits
over at least 5 different models/transformations/variable sets that you tried, all while
forcing displacement to stay in the model in some form. Discuss your findings.
56. (6 marks) Fully report a diagnostic analysis on the final ”optimal” model, as well
as briefly discussing any collinearity issues it may have. Also, if there are any nonlinear
effects in this model, clearly discuss and illustrate their effects on MPG.
7. (6 marks) Discuss your results and conclusions regarding the overall goals of this
study, in light of the results from your overall analysis of the “train” dataset. Be
technical but clear here. Also, interpret the effect of displacement on fuel economy,
using (at least) the optimal model.
8. (6 marks) Using (at least) the 3 best model specifications considered so far (and any
others you think relevant), generate forecast predictions in the“test” dataset for MPG.
Present a summary table, and suitable plot(s), of the forecasts and their accuracy,
using the forecast measures RMSE, MAD and forecast R2.
9. (5 marks) Re-discuss your results and conclusions regarding the overall goals of
this study, in light of these results and your overall analysis. Be technical but clear
here.
10. (5 marks) Write a final report, in as close to plain English as is practical and
possible, that discusses and summarises your analysis above and gives conclusions on
the overall goals of this study. Address the report to, and write it at a level appropriate
for, the Department of Energy Office of Energy Efficiency, who may not be that savvy in
business analytics. Include in your report a recommendation for what the Department
should spend money on in order to increase efficiency of road transport in general; plus
any suggestions for future studies they should do to better achieve the goals they have.