BUSS6002 Assignment 1
Semester 2, 2022
Instructions
Due: at 23:59 on Friday, September 16, 2022 (end of week 7).
You must submit a Jupyter Notebook (.ipynb) file with the following filename format,
replacing STUDENTID with your own student ID: BUSS6002 A1 STUDENTID.ipynb.
There is a limit of 1000 words for your submission (excluding code, tables, and captions).
Do not include any more Python output than necessary and include only concise discussions.
Each task must be clearly labelled with the corresponding question (and sub-question) num-
ber so that the marker can spot your solution easily.
The submitted .ipynb file must be free of any errors, and the results must be reproducible.
All figures must be appropriately sized (by setting figsize) and have readable axis labels
and legends (where applicable).
Use plt.show() instead of plt.savefig(‘plot.png’) to display each figure.
Libraries needed: numpy, pandas, matplotlib, statsmodels.
You may submit multiple times but only your last submission will be marked.
A late penalty applies if you submit your assignment late without a successful special con-
sideration. See the Unit Outline for more details.
Rubric
This assignment is worth 20% of the unit’s marks. The assessment is designed to test your technical
ability and statistical knowledge in performing important basic tasks associated with an exploratory
data analysis (or EDA) of a real-world dataset.
Assessment Item Goal Marks
Question 1 Overall summary of the dataset 7
Question 2 Univariate analysis 14
Question 3 Multivariate analysis 18
Jupyter Notebook Logical and clear presentation 1
Total 40
Table 1: Assessment Items and Mark Allocation
1
Overview
Being able to accurately predict the sale prices of residential properties is crucial to many aspects
of the economy. Some companies base their entire business models on providing their clients
with predictions of property sale prices. As a data-scientist-in-training, you will analyse data on
residential home sales in Ames, a city in the state of Iowa of the United States. The dataset
contains sale prices between 2006 and 2010 of all residential properties in Ames, as well as many
numerical and categorical features (i.e., variables) associated with each dwelling. The following
downloadable files are available on Canvas.
File Description
AmesHousing.txt Data file containing 2,930 observations and 82 variables
DataDocumentation.txt Data dictionary containing description of each variable
BUSS6002 A1 STUDENTID.ipynb A Jupyter Notebook template for getting you started
AmesResidential.pdf A map of Ames
Table 2: Files Provided
Question 1
Place the data file AmesHousing.txt in the same location (i.e., directory) as your Jupyter Notebook
file (.ipynb), and then read the data into a pandas DataFrame object using exactly the following
code.
import pandas as pd
data = pd.read_csv(
’AmesHousing.txt’,
sep=’\t’,
keep_default_na=False,
na_values=[’’])
Once the data file is successfully read in, complete the following tasks.
(a) (3 marks) Write some code to automatically print out the column names of the variables
with missing values, as well as the number of missing observations associated with each of
those variables. The output should be sorted by the number of missing observations from
most to least. Note that a missing value is represented by the special numpy constant nan;
the ‘NA’ value of a categorical variable (e.g., ‘Alley’) is not considered missing. Hint: you
may find the .isna() method of a DataFrame object useful.
(b) (1 mark) Briefly discuss your finding in part (a).
(c) (3 marks) Construct a DataFrame that contains the five-number-summaries of all the nu-
merical variables in the dataset, excluding the variable ‘Order’. Round each value of the
DataFrame to its nearest integer. The resulting DataFrame should have a shape of (k, 5)
(i.e., k rows and 5 columns), where k is the number of numerical variables in the dataset. The
rows of your DataFrame should be indexed by variable names, and the columns should be
named as: min, 25%, 50%, 75%, and max, respectively. Print out the constructed DataFrame.
2
Question 2
(a) (4 marks) Graphically summarise the distributions of the variables ‘SalePrice’ and ‘Lot
Area’, one at a time, and briefly discuss the distributional characteristics of the two variables.
Your discussion should also connect the distributional characteristics to the domain-specific
context of these variables.
(b) (2 marks) Create two new Python variables (of pandas type Series), called log saleprice
and log lotarea, that contain the log-transformed values of ‘SalePrice’ and ‘Lot Area’,
respectively. To be clear, we say that a is a log-transformed value of b if a = log(b), where
log(·) is the natural logarithm function, that is, b = exp(a) := ea.
(c) (3 marks) Graphically summarise the distributions of the new variables log saleprice and
log lotarea (created in part (b)), and briefly state the observed differences in distributions
between the log-transformed and the original variables.
(d) (1 mark) Create another new variable (of pandas type Series), called log saleprice 01,
that contains the standardised values of log saleprice such that log saleprice 01 has
zero mean and unit variance. To confirm, print out the mean and variance of the new
variable and round the output to 2 decimal places.
(e) (2 marks) Create a Q-Q plot of the standardised variable log saleprice 01 to check whether
the variable is normally distributed. Give your conclusion regarding normality based on the
Q-Q plot. Hint: you may find the qqplot function from the statsmodels library use-
ful: statsmodels.graphics.gofplots.qqplot. The documentation of this function can
be accessed via the URL: www.statsmodels.org/dev/generated/statsmodels.graphics.
gofplots.qqplot.html.
(f) (2 marks) Graphically summarise the distribution of the variable ‘Neighborhood’ and briefly
discuss what you observe based on the graphical summary constructed.
Question 3
(a) (3 marks) Print out the correlation coefficient between ‘SalePrice’ and each of other nu-
merical variables in the dataset, excluding the variable ‘Order’. The output should contain
both the variable names and their corresponding correlations. It should also be sorted by
the value of the correlation coefficient in descending order and rounded to 2 decimal places.
(b) (2 marks) Construct an appropriate plot that can help visualise the correlations in part (a).
(c) (1 mark) Briefly discuss the correlation coefficients in parts (a) and (b) in the context of
predicting ‘SalePrice’.
(d) (2 marks) Suppose that ‘Gr Liv Area’ is used to predict ‘SalePrice’. With the goal of
predicting ‘SalePrice’ in mind, construct an appropriate plot that can help visualise the
systematic relationship between these two variables.
(e) (2 marks) Briefly discuss the relationship between ‘Gr Liv Area’ and ‘SalePrice’ based
on the plot you created in part (d).
(f) (2 marks) Print out all the unique categories of the variable ‘Lot Shape’ together with the
number of observations falling into each category. This is called a frequency table. Based on
3
the obtained frequency table, briefly discuss why it could be a good idea to combine some of
the categories in ‘Lot Shape’.
(g) (2 marks) Create a new variable (of pandas type Series), called lotshape binary, by
combining the categories {‘IR1’, ‘IR2’, ‘IR3’} of ‘Lot Shape’ into a single category named
‘IR’, so that the new variable has two categories {‘Reg’, ‘IR’}. Print out the frequency
table for lotshape binary to confirm.
(h) (4 marks) Create a single plot that allows one to visually examine the effect of the new
variable lotshape binary on the relationship between ‘Gr Liv Area’ and ‘SalePrice’,
and briefly discuss what you observe from the plot. Hint: see the “Spending and salary by
gender” exercise in the Week 4 tutorial.