CVEN9407-Transport Modelling
Project Brief
Introduction
This document explains the final project of CVEN9407. This project is an individual project and group
submission is not accepted. The purpose of the project is familiarising students with practical
econometrics analysis and guiding student on how to draw statistical inference. The project is worth
50% of the final grade. Students are evaluated based on their submitted progress reports and their
final report. This brief discusses the data, recommended software, the process of data analysis and
developing models, format of the repots and submission dates are provided in this brief.
Guidance and assistance
Students are advised to self-monitor their progress on the project and seek for assistance if needed.
Students can gauge their performance based on their progress reports feedback.
Students can use their workshop hours to discuss issues with the course demonstrator. If further
assistance is needed, students can ask for consultation with the course coordinator.
Software
To accomplish this project, assistance of a statistical software package is required. The statistical
software package of R is the recommended software package in this study. R is a free statistical
package which can be download from this website. To facilitate using R, it is recommended to
download R studio as well. R studio can be downloaded from this website.
Basic introduction to the software will be provided in the lectures, and sample codes for completing
most of workshop questions will be provided to students.
Note that, using R is not mandatory, and students can work with other statistical software packages
if they wish.
Data
The dataset of this study is obtained from the survey of Household Income and Labour Dynamics in
Australia (HILDA). HILDA is a longitudinal survey which started from 2001 and is planned to continue
until 2021 (for more information about this survey refer to this website). HILDA contains socio-
demographic information of people. Moreover, it contains respondents’ rates on their satisfaction in
different domains. The main purpose of this study is investigating the impact of transport related
variables on life satisfaction. HILDA is a confidential dataset and students must put request to
DataVerse to obtain it. A separate documentation will be uploaded on Moodle to guide you how to
get access to HILDA.
*** The very first step of this project is obtaining access to HILDA ***
After obtaining access to HILDA, the dataset of this project will be share with you. Due to
confidentiality issues, all the personal information is removed from this dataset.
Every student is supposed to focus on one aspect of life satisfaction in a specific year. To achieve
your personalised dataset in this project, filter the dataset that is shared with you based on the
allocated “year” and “variable of interest” given in the following table.
You must keep only one of the 9 variables life satisfaction variables in your dataset, which is going
to be the dependent variable of your study. Note that, life satisfaction variables should not be
considered as independent variables.
The variable of interest in this table is your dependent variable in this project, where throughout the
project, the potential impact of other explanatory variables on this variable will be investigated.
Row Student id Year Variable of interest Row Student id Year Variable of interest
1 z5192376 2009 losathl 21 z5110147 2007 losatfs
2 z5144622 2009 losateo 22 z5102112 2007 losatsf
3 z3308951 2009 losatfs 23 z5090555 2007 losatlc
4 z5138879 2009 losatsf 24 z5140179 2007 losatyh
5 z5129920 2009 losatlc 25 z5136189 2007 losatnl
6 z5175600 2009 losatyh 26 z5148502 2007 losatft
7 z5133619 2009 losatnl 27 z5166585 2007 losat
8 z5142872 2009 losatft 28 z5174616 2006 losathl
9 z5001400 2009 losat 29 z5186077 2006 losateo
10 z5101208 2008 losathl 30 z5135901 2006 losatfs
11 z5175981 2008 losateo 31 z5173840 2006 losatsf
12 z5093489 2008 losatfs 32 z5107253 2006 losatlc
13 z5140059 2008 losatsf
14 z5128066 2008 losatlc
15 z5142327 2008 losatyh
16 z5103509 2008 losatnl
17 z5203012 2008 losatft
18 z5136336 2008 losat
19 z5130037 2007 losathl
20 z5097809 2007 losateo
Variables definition
The definition of most of the variables is provided here. However, some of the fields in the
processed data set do not exist in the HILDA Data Dictionary. Below you can find the definition of
these variables.
The last 40 variable in this list shows the land use variable of individuals’ residences. There are four
indexes available which describe the socio demographic condition of zones. These indexes are
generated by Australian Bureau of statistics and are referred to as Socio Economic Indexes for Areas
(SEIFA). SEIFA variables include:
The Index of Relative Socio-Economic Disadvantage (IRSD)
The Index of Relative Socio-Economic Advantage and Disadvantage (IRSAD)
The Index of Education and Occupation (IEO)
The Index of Economic Resources (IER).
For more information please visit this webpage.
Variable Definition
Female Binary variable indicating gender (female =1)
Married Binary variable indicating marital status (married =1)
ESL Binary variable indicating if English is the second language
Le_mar Binary variable indicating if the individual has experienced the life event of marriage last year
Le_sep Binary variable indicating if the individual has experienced the life event of separation last year
Le_job Binary variable indicating if the individual has experienced the life event of job change last year
Le_bth Binary variable indicating if the individual has experienced the life event of giving birth to a child last year
Le_prg Binary variable indicating if the individual has experienced the life event of becoming pregnant last year
Le_death Binary variable indicating if the individual has experienced the life event of death of spouse/child/close friend/relative last year
Le_fni Binary variable indicating if the individual has experienced major improvement in financeS last year
Le_fnw Binary variable indicating if the individual has experienced worsening in finance last year
Le_frd Binary variable indicating if the individual has been fired or redundant last year
Le_prm Binary variable indicating if the individual has been promoted last year
Le_rtr Binary variable indicating if the individual has been retired last year
Le_ins Binary variable indicating if the individual had serious personal enjerys last year
Mltpljob Binary variable indicating if the individual is employed in multiple jobs
Manager Binary variable indicating if the job type is managerial
Professional Binary variable indicating if the job type is professional
Technician Binary variable indicating if the job type is technician
ServiceWorker Binary variable indicating if the job type is service work
Administrative Binary variable indicating if the job type is administrative
SalesWorker Binary variable indicating if the job type is sales worker
MachineryOperator Binary variable indicating if the job type is machinery
Labour Binary variable indicating if the job type is labour
FlxWork Binary variable indicating if the individual has flexible working hours
HmWork Binary variable indicating if the individual can work from home
PrtStudy Binary variable indicating if the individual is doing part time studies
FullStudy Binary variable indicating if the individual is doing full time studies
Postgrad Binary variable indicating education level (postgraduate =1)
Bachelor Binary variable indicating education level (Bachelor=1)
CoupleWo Binary variable indicating if family structure is couple without children
CoupleW Binary variable indicating if family structure is couple with children
LoneW Binary variable indicating if family structure is single parent
Single Binary variable indicating if family structure is single person
Renter Binary variable indicating if the individual is renting his/her living place
hhad10_1 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 1
hhad10_2 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 2
hhad10_3 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 3
Variable Definition
hhad10_4 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 4
hhad10_5 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 5
hhad10_6 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 6
hhad10_7 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 7
hhad10_8 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 8
hhad10_9 Binary variable indicating if the 'IRSAD' index of the home zone is greater than 9
hhda10_1 Binary variable indicating if the 'IRSD' index of the home zone is greater than 1
hhda10_2 Binary variable indicating if the 'IRSD' index of the home zone is greater than 2
hhda10_3 Binary variable indicating if the 'IRSD' index of the home zone is greater than 3
hhda10_4 Binary variable indicating if the 'IRSD' index of the home zone is greater than 4
hhda10_5 Binary variable indicating if the 'IRSD' index of the home zone is greater than 5
hhda10_6 Binary variable indicating if the 'IRSD' index of the home zone is greater than 6
hhda10_7 Binary variable indicating if the 'IRSD' index of the home zone is greater than 7
hhda10_8 Binary variable indicating if the 'IRSD' index of the home zone is greater than 8
hhda10_9 Binary variable indicating if the 'IRSD' index of the home zone is greater than 9
hhec10_1 Binary variable indicating if the 'IER' index of the home zone is greater than 1
hhec10_2 Binary variable indicating if the 'IER' index of the home zone is greater than 2
hhec10_3 Binary variable indicating if the 'IER' index of the home zone is greater than 3
hhec10_4 Binary variable indicating if the 'IER' index of the home zone is greater than 4
hhec10_5 Binary variable indicating if the 'IER' index of the home zone is greater than 5
hhec10_6 Binary variable indicating if the 'IER' index of the home zone is greater than 6
hhec10_7 Binary variable indicating if the 'IER' index of the home zone is greater than 7
hhec10_8 Binary variable indicating if the 'IER' index of the home zone is greater than 8
hhec10_9 Binary variable indicating if the 'IER' index of the home zone is greater than 9
hhed10_1 Binary variable indicating if the 'IEO' index of the home zone is greater than 1
hhed10_2 Binary variable indicating if the 'IEO' index of the home zone is greater than 2
hhed10_3 Binary variable indicating if the 'IEO' index of the home zone is greater than 3
hhed10_4 Binary variable indicating if the 'IEO' index of the home zone is greater than 4
hhed10_5 Binary variable indicating if the 'IEO' index of the home zone is greater than 5
hhed10_6 Binary variable indicating if the 'IEO' index of the home zone is greater than 6
hhed10_7 Binary variable indicating if the 'IEO' index of the home zone is greater than 7
hhed10_8 Binary variable indicating if the 'IEO' index of the home zone is greater than 8
hhed10_9 Binary variable indicating if the 'IEO' index of the home zone is greater than 9
Analysis
1. Data analysis
1.1. The first step is to familiarise yourself with the data. For that purpose
Check the definition of variables
Check for any missing values in the data
Calculate the mean and the standard deviations of continuous variables
Calculate the frequencies for discrete variables
If needed, plot the data to see the variations in variables
Check the range of variables and see if it makes sense to you
1.2. The relationship between variables
Calculate the correlation matrix for available variables
Highlight the strong correlations in the matrix
Justify your observation. Explain potential reasons behind strong correlations.
Are there cases which you expect to see strong correlations, but data shows
otherwise? Discuss these cases.
Is there any variable that you expect to have a non-linear relationship with the
dependent variable? If you are not sure, plot the dependent variable against it and see
if you can detect any pattern.
For the variables which you are suspect of non-linear relationships, define new
independent variables with appropriate transformation (logarithmic, exponential,
second or third power, etc.).
Include the new independent variables in the correlation matrix and discuss the
results.
2. Regression analysis
2.1. It is always recommended to divide the dataset into test and train sub-datasets. The train
dataset, containing 80 percent of records, is used to estimate the parameters of the model
and the test dataset, containing the remaining 20 percent, is used to validate the model.
Use sample() function in R to randomly divide the dataset into test and train datasets.
Even, if you choose to use other statistical packages for this project, this step should be
completed using R (This is because the marker will be using R to check your analysis).
To avoid making a purely random selection, set the seed number to your student ID.
In this method, although you randomly divide data into test and trains sub-datasets,
but the process can be repeated. The command in R to fix the seed number is
set.seed()
2.2. Selecting the set of explanatory variables to be included in the model
The main purpose of this study is examining the relationship between transport
related variables and the level of satisfaction. The dependent variable is the level of
satisfaction and the rest of variables forms the set of independent variables.
The available transport related variables in this study are:
o lscom: Travel time to/from paid work per week
o mvcval: Current worth of vehicles
For each variable run a separate regression model with only one variable and discuss
the estimated coefficient.
For the 3 combinations of transport related variables run a regression model and select
the best model. The best model has the highest goodness-of-fit, while all the included
variables are statistically significant.
Use the forward stepwise method to add other independent variables to the model
o Use Bayesian Information Criterion (BIC) index as the improvement criteria in the
stepwise method. In each step, add one variable to the model. This variable
should be statistically significant and improve BIC the most.
o Continue the process until either all the variables are exhausted or none of the
remaining variables can improve BIC any further.
The model that you have developed so far is achieved from a mechanical process and
theory did not play a role. At this stage you should examine the model to see if fulfils
existing theories in the field. There are two issues to be taken into consideration. Frist,
exploring the theories on life satisfaction is out of the scope of this subject. So, as a
simplifying solution, we only rely on our common sense (Note that in real project our
reference must be accepted theories). Second, from this point, the process becomes
somehow subjective. In previous steps, BIC and adjusted R square could help you with
selecting the best model and making modifications on that. However, from this point,
you need to use your judgment to decide how much of goodness-of-fit can be
compromised to include or exclude variables based on your expectations (or theories).
Different modellers have different judgments and different approaches in
implementing their opinions. So, get ready to grow your own modelling judgment.
o Justify included variables and the sign of their coefficients. Is there any of the
variables that you cannot justify, or its sign is counterintuitive?
o On the other hand, is there any of the remaining variables which you expected to
be included in your model?
o Improve your model by putting aside unreasonable variables and including new
variables from the leftovers that you expected to be included. Most likely, this
practice deteriorates the model goodness-of-fit. This is where you should decide
how much you are willing to compromise the goodness-of-fit to improve
justifiability
o Note that in this study you want to investigate the relationship between
transport related variables and level of satisfaction. So transport related variables
should have a higher priority to be included in the model.
2.3. Testing the assumptions of Classic Linear Regression Model
List all the assumptions behind CLRM and the statistical test that you prefer to use to
validate the assumptions.
Test your model to see if it satisfies all the assumptions.
If your model does not satisfy one, or some of the assumptions, double check the set
of your independent variables. Sometimes excluding unnecessary variables solves the
issue.
If the problem still exists, use standard methods to rectify the problem.
2.4. Validation.
To validate the accuracy of the model, simulate the dependent variable for the test
dataset and compare the results with the observed values. Discuss the model
prediction ability.
2.5. Regarding your report, as you see there is a long process behind developing a regression
model. However, you do not need to report all the work you have done. Think what would
be interesting for readers to learn from your endeavour and how to efficiently convey
highlights of your study. For instance, you can provide a plot on BIC variations in step 2.2.
which summarises the stepwise process. Your report should include the final model which
satisfy all the CLRM assumptions and your justification for the coefficients and their signs.
3. Discrete choice analysis
3.1. Selecting the right model specification
The first step in developing a discrete choice model is deciding about the model
specification. The initial decision on model specification is mainly based on the
dependent variable. Note that this decision might change along the way.
3.2. Defining choices and setting up the utility functions
Discrete choice models, as the name implies, are developed to model the outcome of
selecting one option out of multiple available alternatives. The output of discrete
choice models is the probability of selecting each of the alternatives. In this study, we
extend the application of discrete choice models to probability of belonging to a
category, rather than selecting a category. In fact, in our study people do not make a
decision about their level of satisfaction, but they feel belonging to a certain category.
Although it does not resemble a choice setup, by modifying our definition of utility
function we can still use discrete choice models for this context.
To simplify the model, aggregate the range of your dependent variable into three
categories of: unsatisfied, moderate, and satisfied. The dependent variable varies from
0 to 10. Assume values below 5 to indicate dissatisfaction and values above 7 to
indicate complete satisfaction. Based on this assumption, define a new dependent
variable which should have three levels. Then calculate the “market share” of each of
the categories for the test dataset, train dataset and overall.
Based on the nature of available independent variables, discuss your alternative
specific variables and generic variables, then derive a mathematical formulation for the
utility functions.
3.3. Estimating the parameters of the model
According to the selected model specification, and the defined utility function, run a
discrete choice mode with the same set of independent variables which you concluded
in your regression model.
Check model’s goodness-of-fit, statistical significance of the coefficients and the
interpretation of them.
Exclude insignificant variables from the model one by one. Each time that you exclude
a variable, run the mode again and check the significance of the remaining variables.
When you no longer have any insignificant variable, check if you can include any other
variables that you expected to have an impact on your dependent variable.
Similar to the regression modelling of this project, this process is also subjective and
there is no single correct solution. Remember to prioritise transport related variables,
aim for higher goodness-of-fit, and keep an eye on the significance of coefficients.
3.4. Examining the assumptions behind the selected model specification
At this stage you should verify that your model satisfies all the assumptions behind
your model specification. First, list all the assumptions that need to be tested and
provide a legitimate statistical test to validate the assumptions.
If your model does not satisfy one or a few of the assumptions double check the list of
independent variables. Sometimes excluding an unimportant variable fixes the issue.
If the problem still exists, use standard methods to rectify the problem.
3.5. Validation.
To validate the accuracy of the model, simulate the dependent variable for the test
dataset and compare the results with the observed values. Calculate average share of
each category from the model and compare it with the observed shares. Discuss the
model prediction ability.
Deliverables
This project is an individual project and no group submission is accepted. Students are required to
submit two progress reports and one final report. All the reports should be typed and submitted to
Moodle as a PDF file. Late submission is accepted but 10% of the mark will be deducted for each day
of late submission.
The details of each report and the due date for them is provided in the following table.
Report Items to be covered Details Due date
Progress
report 1
Data Analysis A maximum three-pager report (excluding the
cover page and reference page if necessary)
presenting the progress made on the specified
items
Fri 23 Mar,
16:00pm
Progress
report 2
Regression analysis
A maximum ten-pager report (excluding the
cover page and reference page if necessary)
presenting the progress made on the specified
items
Fri 04 May,
16:00pm
Final
report
All the required items
according to the
analysis section
A concise report on the project finding. The
report should include introduction, data analysis,
modelling practice, discussion and conclusion.
The report should not exceed 30 pages (excluding
the cover page, table of contents and reference
page if necessary).
Fri 08 Jun,
16:00pm
Assessment
The project is worth 50% of the final grade. Students’ performance is evaluated based on their
submitted reports. Each of the two progress reports is worth 20% of the project’s total mark and the
final report is worth 60% of the project’s total mark. The reports will be assessed based on the
following criteria.
Report Assessment criteria Total credit
Progress
report 1
The structure of the report. Satisfying page limitation while addressing all the
required items (2 points)
Providing standard descriptive statistics for dependent and independent variables
(5 points)
Identifying potential issues with data (3 points)
Providing the correlation matrix (2 points)
Discussing correlation between variables, identifying correlated and uncorrelated
variables and justifying (5 points)
Investigating transformed versions of variables (5 points)
20
Progress
report 2
The structure of the report. Satisfying page limitation while addressing all the
required items (2 points)
Reporting the selected multi-variable regression model and discussing the findings
(8 points)
Validating the assumptions of CLRM (7 points)
Validating the model against test dataset (7 points)
20
Final
report
The structure of the report. Satisfying page limitation while addressing all the
required items (3 points)
Extended discussion on the range and other descriptive statistics of explanatory
variables (2 points)
Explaining potential variable transformation (2 points)
Providing correlation matrix and justifying poor and strong correlations (3 points)
Explaining the process of selecting the best regression model (3 points)
Validation of CLRM assumptions (2 points)
Discussing the findings in the regression analysis (3 points)
Validating the model (2 points)
Selecting a suitable discrete choice model specification and the utility functions (7
points)
Reporting the final model and discussing the estimated parameters (13 points)
Validating the assumptions behind the selected model specification (5 points)
Validating the model (5 points)
Conclusion of the study on the relationship between variables (10 points)