FIT5197 2019 S1 Assignment
FIT5197 2019 S1 Assignment 2 (25 marks)
Wray Buntine
22 May 2019
Contents
1 Details 2
2 Hypothesis Testing (4 marks) 3
2.1 Generation (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Confidence in sampling (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Hypothesis testing (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Logistic Regression (8 marks) 4
3.1 Loading the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Run glm() on the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Compare predictions (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Stepwise regression with backwards selection (3 marks) . . . . . . . . . . . . . . . . . . . . . . 4
3.5 Discussion (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Linear Regression (8 marks) 6
4.1 Run lm() on the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Compare predictions (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Single variable model (1 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 Stepwise regression with forwards selection (3 marks) . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 Analysis and Discussion (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Submission due date: by 11:59pm on Wednesday 22 May 2019 (middle of Week 11)
1
1 Details
Marks
This assignment contains 3 questions. There are 25 marks in total for the assignment and it counts for 25%
of your overall grade for the unit. Also, 3 of the 25 marks are awarded for code quality and 2 of the marks
awarded for presentation of results, for instance use of plots. That leaves 20 marks for individual answers.
You must show all working, so we can see how you obtain your answer. Marks are assigned for working as
well as the correct answer.
Your solutions
Please put your name and student number on the first page of your solutions. Do not copy questions in
your solutions. Only include question numbers. If you use any external resources for developing your results,
please remember to provide the link to the source.
Special consideration and late submissions
Please contact unit CE Wray Buntine Wray.Buntine@monash.edu for special consideration as per Monash
policy. If an extension has been given then submission after the due date is allowed with no penalty being
incurred. If no extension has been given then assignments submitted after the due date, there will be penalised
5% per day up to a maximum of 10 days late.
Submitting your assignment on Moodle
Please submit your assignment through Moodle via upload a Word or PDF document as well as R markdown
you used to generate results.
• If you choose to use R markdown file to directly knit Word/PDF document, you would need to type in
LaTex equations for Question 1,2 and 5. Find more information about using LaTex in R markdown
files here. You may also find the R markdown cheatsheet useful. WARNING: Full LaTex is challenging.
• You can also work with Word and R markdown separately. In this case you would need to type your
answers in Word and also copy R code (using the format: Courier New), results and figures to the
Word document.
We will mark your submission mostly using your Word/PDF document. However, you need to make sure
your R markdown file is executable in case we need to check your code.
Code quality marks
Your R code will be reviewed for conciseness, efficiency, explainability and quality. Inline documentation, for
instance, should demarcate key sections and explain more obtuse operations, but shouldn’t be over verbose.
Out of the 25 marks, 3 will be awarded for code quality.
• basic embedded comments to support understandability
• reasonable variable names
• good use of standard functions and libraries
• writing own functions (e.g., limited repeated code)
Presentation marks
Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering
and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R.
• creative use of plotting
• somewhat readable printing of results (i.e., tables should roughly appear as tables, not as 10 numbers
one after another)
• text descriptions embedded in results and good labels on plots
2
2 Hypothesis Testing (4 marks)
We have the same spike data as for Assignment 1.
spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956,
0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116,
1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515,
0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421,
0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461)
You are to test the hypothesis that the spike distribution is Weibull with shape=2.5 and scale=1.0.
You will create a hypothesis testing algorithm by sampling. So:
H0 : spikes ~ Weibull(shape=2.5, scale=1.0)
The test will consist of generating samples of size n = 30 and seeing how often their likelihood is less than the
likelihood of the sample spikes above. To not reject H0 with a confidence of 90%, their likelihood should be
less than the spikes likelihood more than 10% of the time.
2.1 Generation (1 mark)
Write the sample generator generator() and the log likelihood function loglikelihood(datavec) in R.
The generator function should output a single sample of size 30. The log likelhood function should input a
single data vector like spikes and return its log likelihood under H0. These use the rweibull and dweibull
functions. We work with the log likelihood since the likelihood function may suffer from numeric underflow.
Thus to reject the null hypothesis, more than 90% of calls to loglikelihood(generator()) should return a
value greater than the data likelihood loglikelihood(spikes).
2.2 Confidence in sampling (1 mark)
Thus to reject the null hypothesis, more than 90% of calls to loglikelihood(generator()) should return a
value greater than loglikelihood(spikes). Thus we have a Bernoulli sampling problem. How many calls
to loglikelihood(generator()) should we do in estimating what proportion are greater than the data
likelihood loglikelihood(spikes).
Make sure you generate enough samples so that you can be confident of accept/reject. Let us use 95% as the
level for this, and note this 95% is for our confidence at the meta-level. The 90% is for our null hypothesis.
Come up with a reasonable justification using confidence intervals, and remember this is a Bernoulli sampling
problem. First, if you generated n samples, and k of them had a likelihood greater than your data likelihood,
then as long as n > 100, you can reasonably use the CLT to get 95% confidence bounds for the “true”
proportion of samples. If this excludes the 90% value, then you have enough samples. Otherwise, generate
more samples. Describe a heuristic algorithm so that when you finish you have the critical 90% level outside
the confidence bound.
\marking{I expect all sorts of wacky variations here. Perhaps send more bizarre ones to me to check. Moderate
effort gets half, good effort gets one.}
2.3 Hypothesis testing (2 marks)
Use the two functions in Question 2.1 to create a hypothesis testing algorithm. No other knowledge of the
Weibull is needed, no other statistics. Run the sampler for enough samples using the method in Question 2.2
to evaluate the hypothesis. Report your results.
3
3 Logistic Regression (8 marks)
We’re going to explore the Indian Liver Patient dataset from Kaggle
• https://www.kaggle.com/uciml/indian-liver-patient-records
The first 400 cases are used as training data, and the remaining as test data. The dataset and features are
explained at the Kaggle site. In this question, we will explore the data with R’s glm().
3.1 Loading the data (1 mark)
First load the data into a data frame using:
il <- read.csv("indian_liver_patient_train.csv", header=TRUE, sep=",")
summary(il)
What issues can you see that could cause problems with using R on this data, for instance to use glm() to
predict the presence of liver disease given by the Dataset variable? How would you overcome them? Write a
short R script to do this.
3.2 Run glm() on the data (1 mark)
Try building a basic model with glm() with something like:
fit <- glm(target ., family=binomial, data=il)
where target is the variable indicating liver disease. Print the summary of the model to get the R diagnostics.
Briefly explain the statistics in the summary, e.g. Z-value, standard error, etc. What does this imply about
the features for your model?
3.3 Compare predictions (1 mark)
Then get fitted probabilities for the model using:
pld <- predict(fit, type="response")
Compare the fitted probabilities with the truth values graphically. What do you see?
3.4 Stepwise regression with backwards selection (3 marks)
Having built a model with all features, let us modify this by removing one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore these and write
our own.
You will have to identify a way of selecting which feature to remove, or whether to stop removing and
quit For this purpose you could run an additional glm() model, or you could grab diagnostics using R
and test them somehow. The following are available, where fit is the returned structure from glm(), and
sum.fit=summary(fit):
• run names(fit), and see variables such as fit$deviance
• full content of sum.fit can be obtained by print(sum.fit) but description of data structure got with
src(sum.fit)
• try expressions such as coef(sum.fit)["Age", 2]
• or coef(sum.fit)[“Total_Proteins”, 3]
4
• attr(coef(sumfit),"dimnames")[[1]] gives the names of the features in the coefficient vectors.
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to remove
# => you should write a better one!
selectVar <- function(fset,fit,sumfit) {
maxscore = 0.7 # what is your initialisation?
maxfeat = "STOP"
# loop through coefficients
for ( var in attr(coef(sumfit),"dimnames")[[1]] ) {
if ( startsWith(var,"(inter") ) {
# but ignore the intercept
next
}#
what score does this feature have?
fval = runif(1,0,1)
if ( fval > maxscore ) {
# save best score so far
maxscore = fval
maxfeat = var
}
}
if ( startsWith(maxfeat,"Gender") ) {
# Gender is a factor, so its coefficient name is different
maxfeat = "Gender"
}#
at this stage, if no feature had a score beating the initialisation,
# then chosen feature will be "STOP"
print(paste("Removing ",maxfeat," with value ",maxscore))
return(maxfeat)
}
# this is the active predictor list
nil <- names(il)
# delete the target variables from it
nil <- nil[(!nil=="LD") & (!nil=="Dataset")]
print(paste("STARTING: ", paste(nil,collapse="+")))
# run with initial list
fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il)
sum.fit <- summary(fit)
for (loop in 1:10) {
if ( length(nil) == 0) {
# removed everything so quit
break
}#
var is to be tested
var <- selectVar(nil, fit, sum.fit)
if ( var == "STOP") {
break
}
5
# remove from list
nil <- nil[!nil==var]
print(paste("REMOVED: ", var))
# report
print(paste("RUNNING: ", paste(nil,collapse="+")))
# now run with modified list
fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il)
sum.fit <- summary(fit)
# do something, print some metric, but what?
print(paste("GLM out: ", sum.fit$df.residual))
}
# report the final fit
summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the backward selection
working. You may want to print diagnostics at the end of the loop.
3.5 Discussion (2 marks)
Report on the initial and final fits produced by your algorithm. The initial fit used all predictors, but reported
in the diagnostics that not all were useful. Compare these with your final model. How do the predictors
actually used compare with their statistics in the initial model.
Read in the test data “indian_liver_patient_test.csv” and test the resultant models on the test data as well,
and report the error rate of prediction. Has the backward selection worked? If not, how wrong was it, and
why?
4 Linear Regression (8 marks)
We will repeat the exercise with linear regression. So read in the training dataset afresh, and this time the
target to predict is Total_Proteins .
4.1 Run lm() on the data (1 mark)
Try building a basic model with lm() with something like:
fit <- lm(Total_Proteins ., data=il)
Print the summary of the model to get the R diagnostics. Explain the statistics in the summary, e.g. standard
error, t value. Explaion how the probability (a p-value) in the table is computed. What does this imply
about the features for your model? What does the vaue of multiple R-squared imply?
4.2 Compare predictions (1 mark)
Then get fitted values from the model using:
predict.Total_Proteins <- predict(fit, type="response")
6
Compare the fitted probabilities with the truth values graphically. What do you see?
4.3 Single variable model (1 marks)
Consider using just one predictor, so you have a model like Total_Proteins Albumin or Total_Proteins
Total_Bilirubin .
What is a way to pick the best single predictor for this purpose? Explain an approach (but you don’t have to
code it or run it).
4.4 Stepwise regression with forwards selection (3 marks)
Starting with a model with no features, let us modify this by adding one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore this and write our
own.
You will have to identify a way of selecting which feature to add, or whether to stop adding and quit. For this
purpose you could run an additional lm() model, or you could grab diagnostics using R and test them somehow.
The following are available, when fit is the returned structure from lm() and sum.fit=summary(fit):
• content of sum.fit can be obtained as before
• try expressions such as coef(sum.fit)["Age", 2] (or, 1, 3, 4?)
• or sum.fit$r.squared
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to add
# => you should write a better one!
selectAddVar <- function(fset,fit,sum.fit) {
# fset = the existing set of features to select from
if ( runif(1,0,1)>0.9 | length(fset)==0) {
# your decision of when to stop in here
return("STOP")
}#
your method to select the best feature here
d <- as.integer(runif(1,0,length(fset)))
return(fset[d])
}
# this is the active predictor list
nil <- names(il)
# delete the target variables from set
nil <- nil[(!nil=="LD") & (!nil=="Total_Proteins")]
print(paste("STARTING: ", paste(nil,collapse="+")))
# start with the empty set of features for lm()
fset <- NULL
# now run with initial list
fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il)
sum.fit <- summary(fit)
for (loop in 1:10) {
if ( length(nil)==0 ) {
7
# quit if none more to add
break
}#
var is to be tested
var <- selectAddVar(nil,fit,sum.fit)
if ( var == "STOP") {
# quit as told to stop
break
}
# remove from list
nil <- nil[!nil==var]
fset <- append(fset,var)
print(paste("ADDED: ", var))
# report
print(paste("RUNNING: ", paste(fset,collapse="+")))
# now run with modified list
fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il)
sum.fit <- summary(fit)
print(paste("LM Rsquared out: ", sum.fit$r.squared))
}
# report the final fit
summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the forward selection working.
4.5 Analysis and Discussion (2 marks)
Report on the initial and final fits produced by your algorithm, their diagnostics, as well as the final set of
variables selected. How do they compare?
Read in the test data “indian_liver_patient_test.csv” and test the resultant models on the test data as well,
and report the mean square error. Has the forward selection worked? If not, how wrong was it, and why?