FIT5197 2019 S1 Assignment

FIT5197 2019 S1 Assignment 2 (25 marks) Wray Buntine 22 May 2019 Contents 1 Details 2 2 Hypothesis Testing (4 marks) 3 2.1 Generation (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Confidence in sampling (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Hypothesis testing (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Logistic Regression (8 marks) 4 3.1 Loading the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Run glm() on the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Compare predictions (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.4 Stepwise regression with backwards selection (3 marks) . . . . . . . . . . . . . . . . . . . . . . 4 3.5 Discussion (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Linear Regression (8 marks) 6 4.1 Run lm() on the data (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 Compare predictions (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Single variable model (1 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.4 Stepwise regression with forwards selection (3 marks) . . . . . . . . . . . . . . . . . . . . . . . 7 4.5 Analysis and Discussion (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Submission due date: by 11:59pm on Wednesday 22 May 2019 (middle of Week 11) 1 1 Details Marks This assignment contains 3 questions. There are 25 marks in total for the assignment and it counts for 25% of your overall grade for the unit. Also, 3 of the 25 marks are awarded for code quality and 2 of the marks awarded for presentation of results, for instance use of plots. That leaves 20 marks for individual answers. You must show all working, so we can see how you obtain your answer. Marks are assigned for working as well as the correct answer. Your solutions Please put your name and student number on the first page of your solutions. Do not copy questions in your solutions. Only include question numbers. If you use any external resources for developing your results, please remember to provide the link to the source. Special consideration and late submissions Please contact unit CE Wray Buntine Wray.Buntine@monash.edu for special consideration as per Monash policy. If an extension has been given then submission after the due date is allowed with no penalty being incurred. If no extension has been given then assignments submitted after the due date, there will be penalised 5% per day up to a maximum of 10 days late. Submitting your assignment on Moodle Please submit your assignment through Moodle via upload a Word or PDF document as well as R markdown you used to generate results. • If you choose to use R markdown file to directly knit Word/PDF document, you would need to type in LaTex equations for Question 1,2 and 5. Find more information about using LaTex in R markdown files here. You may also find the R markdown cheatsheet useful. WARNING: Full LaTex is challenging. • You can also work with Word and R markdown separately. In this case you would need to type your answers in Word and also copy R code (using the format: Courier New), results and figures to the Word document. We will mark your submission mostly using your Word/PDF document. However, you need to make sure your R markdown file is executable in case we need to check your code. Code quality marks Your R code will be reviewed for conciseness, efficiency, explainability and quality. Inline documentation, for instance, should demarcate key sections and explain more obtuse operations, but shouldn’t be over verbose. Out of the 25 marks, 3 will be awarded for code quality. • basic embedded comments to support understandability • reasonable variable names • good use of standard functions and libraries • writing own functions (e.g., limited repeated code) Presentation marks Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R. • creative use of plotting • somewhat readable printing of results (i.e., tables should roughly appear as tables, not as 10 numbers one after another) • text descriptions embedded in results and good labels on plots 2 2 Hypothesis Testing (4 marks) We have the same spike data as for Assignment 1. spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956, 0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116, 1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515, 0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421, 0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461) You are to test the hypothesis that the spike distribution is Weibull with shape=2.5 and scale=1.0. You will create a hypothesis testing algorithm by sampling. So: H0 : spikes ~ Weibull(shape=2.5, scale=1.0) The test will consist of generating samples of size n = 30 and seeing how often their likelihood is less than the likelihood of the sample spikes above. To not reject H0 with a confidence of 90%, their likelihood should be less than the spikes likelihood more than 10% of the time. 2.1 Generation (1 mark) Write the sample generator generator() and the log likelihood function loglikelihood(datavec) in R. The generator function should output a single sample of size 30. The log likelhood function should input a single data vector like spikes and return its log likelihood under H0. These use the rweibull and dweibull functions. We work with the log likelihood since the likelihood function may suffer from numeric underflow. Thus to reject the null hypothesis, more than 90% of calls to loglikelihood(generator()) should return a value greater than the data likelihood loglikelihood(spikes). 2.2 Confidence in sampling (1 mark) Thus to reject the null hypothesis, more than 90% of calls to loglikelihood(generator()) should return a value greater than loglikelihood(spikes). Thus we have a Bernoulli sampling problem. How many calls to loglikelihood(generator()) should we do in estimating what proportion are greater than the data likelihood loglikelihood(spikes). Make sure you generate enough samples so that you can be confident of accept/reject. Let us use 95% as the level for this, and note this 95% is for our confidence at the meta-level. The 90% is for our null hypothesis. Come up with a reasonable justification using confidence intervals, and remember this is a Bernoulli sampling problem. First, if you generated n samples, and k of them had a likelihood greater than your data likelihood, then as long as n > 100, you can reasonably use the CLT to get 95% confidence bounds for the “true” proportion of samples. If this excludes the 90% value, then you have enough samples. Otherwise, generate more samples. Describe a heuristic algorithm so that when you finish you have the critical 90% level outside the confidence bound. \marking{I expect all sorts of wacky variations here. Perhaps send more bizarre ones to me to check. Moderate effort gets half, good effort gets one.} 2.3 Hypothesis testing (2 marks) Use the two functions in Question 2.1 to create a hypothesis testing algorithm. No other knowledge of the Weibull is needed, no other statistics. Run the sampler for enough samples using the method in Question 2.2 to evaluate the hypothesis. Report your results. 3 3 Logistic Regression (8 marks) We’re going to explore the Indian Liver Patient dataset from Kaggle • https://www.kaggle.com/uciml/indian-liver-patient-records The first 400 cases are used as training data, and the remaining as test data. The dataset and features are explained at the Kaggle site. In this question, we will explore the data with R’s glm(). 3.1 Loading the data (1 mark) First load the data into a data frame using: il <- read.csv("indian_liver_patient_train.csv", header=TRUE, sep=",") summary(il) What issues can you see that could cause problems with using R on this data, for instance to use glm() to predict the presence of liver disease given by the Dataset variable? How would you overcome them? Write a short R script to do this. 3.2 Run glm() on the data (1 mark) Try building a basic model with glm() with something like: fit <- glm(target ., family=binomial, data=il) where target is the variable indicating liver disease. Print the summary of the model to get the R diagnostics. Briefly explain the statistics in the summary, e.g. Z-value, standard error, etc. What does this imply about the features for your model? 3.3 Compare predictions (1 mark) Then get fitted probabilities for the model using: pld <- predict(fit, type="response") Compare the fitted probabilities with the truth values graphically. What do you see? 3.4 Stepwise regression with backwards selection (3 marks) Having built a model with all features, let us modify this by removing one feature at a time. R provides all sorts of sophisticated routines for doing this like step(), but we will ignore these and write our own. You will have to identify a way of selecting which feature to remove, or whether to stop removing and quit For this purpose you could run an additional glm() model, or you could grab diagnostics using R and test them somehow. The following are available, where fit is the returned structure from glm(), and sum.fit=summary(fit): • run names(fit), and see variables such as fit$deviance • full content of sum.fit can be obtained by print(sum.fit) but description of data structure got with src(sum.fit) • try expressions such as coef(sum.fit)["Age", 2] • or coef(sum.fit)[“Total_Proteins”, 3] 4 • attr(coef(sumfit),"dimnames")[[1]] gives the names of the features in the coefficient vectors. To help you do this, code to run through all the features and test a formula is as follows: # here is a DUMMY random function to select a feature to remove # => you should write a better one! selectVar <- function(fset,fit,sumfit) { maxscore = 0.7 # what is your initialisation? maxfeat = "STOP" # loop through coefficients for ( var in attr(coef(sumfit),"dimnames")[[1]] ) { if ( startsWith(var,"(inter") ) { # but ignore the intercept next }# what score does this feature have? fval = runif(1,0,1) if ( fval > maxscore ) { # save best score so far maxscore = fval maxfeat = var } } if ( startsWith(maxfeat,"Gender") ) { # Gender is a factor, so its coefficient name is different maxfeat = "Gender" }# at this stage, if no feature had a score beating the initialisation, # then chosen feature will be "STOP" print(paste("Removing ",maxfeat," with value ",maxscore)) return(maxfeat) } # this is the active predictor list nil <- names(il) # delete the target variables from it nil <- nil[(!nil=="LD") & (!nil=="Dataset")] print(paste("STARTING: ", paste(nil,collapse="+"))) # run with initial list fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il) sum.fit <- summary(fit) for (loop in 1:10) { if ( length(nil) == 0) { # removed everything so quit break }# var is to be tested var <- selectVar(nil, fit, sum.fit) if ( var == "STOP") { break } 5 # remove from list nil <- nil[!nil==var] print(paste("REMOVED: ", var)) # report print(paste("RUNNING: ", paste(nil,collapse="+"))) # now run with modified list fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il) sum.fit <- summary(fit) # do something, print some metric, but what? print(paste("GLM out: ", sum.fit$df.residual)) } # report the final fit summary(fit) Your task is to rewrite the function selectVar and anything else needed to get the backward selection working. You may want to print diagnostics at the end of the loop. 3.5 Discussion (2 marks) Report on the initial and final fits produced by your algorithm. The initial fit used all predictors, but reported in the diagnostics that not all were useful. Compare these with your final model. How do the predictors actually used compare with their statistics in the initial model. Read in the test data “indian_liver_patient_test.csv” and test the resultant models on the test data as well, and report the error rate of prediction. Has the backward selection worked? If not, how wrong was it, and why? 4 Linear Regression (8 marks) We will repeat the exercise with linear regression. So read in the training dataset afresh, and this time the target to predict is Total_Proteins . 4.1 Run lm() on the data (1 mark) Try building a basic model with lm() with something like: fit <- lm(Total_Proteins ., data=il) Print the summary of the model to get the R diagnostics. Explain the statistics in the summary, e.g. standard error, t value. Explaion how the probability (a p-value) in the table is computed. What does this imply about the features for your model? What does the vaue of multiple R-squared imply? 4.2 Compare predictions (1 mark) Then get fitted values from the model using: predict.Total_Proteins <- predict(fit, type="response") 6 Compare the fitted probabilities with the truth values graphically. What do you see? 4.3 Single variable model (1 marks) Consider using just one predictor, so you have a model like Total_Proteins Albumin or Total_Proteins Total_Bilirubin . What is a way to pick the best single predictor for this purpose? Explain an approach (but you don’t have to code it or run it). 4.4 Stepwise regression with forwards selection (3 marks) Starting with a model with no features, let us modify this by adding one feature at a time. R provides all sorts of sophisticated routines for doing this like step(), but we will ignore this and write our own. You will have to identify a way of selecting which feature to add, or whether to stop adding and quit. For this purpose you could run an additional lm() model, or you could grab diagnostics using R and test them somehow. The following are available, when fit is the returned structure from lm() and sum.fit=summary(fit): • content of sum.fit can be obtained as before • try expressions such as coef(sum.fit)["Age", 2] (or, 1, 3, 4?) • or sum.fit$r.squared To help you do this, code to run through all the features and test a formula is as follows: # here is a DUMMY random function to select a feature to add # => you should write a better one! selectAddVar <- function(fset,fit,sum.fit) { # fset = the existing set of features to select from if ( runif(1,0,1)>0.9 | length(fset)==0) { # your decision of when to stop in here return("STOP") }# your method to select the best feature here d <- as.integer(runif(1,0,length(fset))) return(fset[d]) } # this is the active predictor list nil <- names(il) # delete the target variables from set nil <- nil[(!nil=="LD") & (!nil=="Total_Proteins")] print(paste("STARTING: ", paste(nil,collapse="+"))) # start with the empty set of features for lm() fset <- NULL # now run with initial list fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il) sum.fit <- summary(fit) for (loop in 1:10) { if ( length(nil)==0 ) { 7 # quit if none more to add break }# var is to be tested var <- selectAddVar(nil,fit,sum.fit) if ( var == "STOP") { # quit as told to stop break } # remove from list nil <- nil[!nil==var] fset <- append(fset,var) print(paste("ADDED: ", var)) # report print(paste("RUNNING: ", paste(fset,collapse="+"))) # now run with modified list fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il) sum.fit <- summary(fit) print(paste("LM Rsquared out: ", sum.fit$r.squared)) } # report the final fit summary(fit) Your task is to rewrite the function selectVar and anything else needed to get the forward selection working. 4.5 Analysis and Discussion (2 marks) Report on the initial and final fits produced by your algorithm, their diagnostics, as well as the final set of variables selected. How do they compare? Read in the test data “indian_liver_patient_test.csv” and test the resultant models on the test data as well, and report the mean square error. Has the forward selection worked? If not, how wrong was it, and why?