#
ECOM151作业代写、Data Applications作业代做、代写R编程设计作业、代做R程序语言作业
代做SPSS|代做SPSS

ECOM151: Big Data Applications in Finance

Individual Assignment

Vimal Balasubramaniam

25 February 2020

Details

Grading: 20% of your final grade.

Deadline: 31 March, 2020. Time: 22:00hrs.

Submission mode: QMplus

Submission files: 1) An assignment results.R file, and 2) A “Prediction Report” with maximum of

5-pages with the interpretation of your results.

Please refrain from copying code from each other. Your code and output submission will provide

us sufficient information to detect such practices and it will be penalized.

What am I grading you on?

I am looking to test your ability to use models learned during the first half of the course to both

apply it to real data and to synthesize insights into a report.

This assignment requires you to apply all the skills learned during your first half of the term,

and the models learned during the lectures. Your ability to estimate the model is only half of the

challenge. Your interpretation of the findings matter equally.

1

Can I ask assignment related questions during office hours?

The questions that your TA and I will answer are only to clarify the meaning of a question. We

will not provide any clue on whether what you are doing is “right”, or “wrong”, or suggest ways to

improve your code. However, if you have doubts on estimating a model, we then ask you to phrase

your questions on the tutorial datasets and models, so that we are able to provide guidance.

The dataset provided to you should not have many challenges to resolve before you estimate a

model. However, if you face any challenges while using the data on Rstudio.cloud, do reach out to

us. I strongly recommend that you work on your assignment in your computer, and not on the cloud

so that you have complete control over the file. Any excuses that the file was not “saved”, or went

“missing” on the cloud will not be entertained.

Submission files

You are expected to create an R file as part of your submission. The R file should be selfcontained,

i.e., I should be able to run the code on my computer loading the same datafile without

any error.

In addition to all your code, you are expected to store your answers in objects named as

specifically instructed in each question. The name of object is provided in blue colour next to each

question. All “R objects” or columns in objects will also be referred in blue in the text below.

In addition to the R code, you are expected to create a PDF submission of no more than 5 pages,

that present your evaluation of the models.

The Assignment

LendingClub is an American peer-to-peer lending company, currently the world’s largest

platform that allows for individuals to both invest and borrow on the platform. Borrowers can obtain

unsecured personal loans from the platform, and this assignment is set up for you to assess your

ability to predict defaulters in the data using the predictors provided in the data.

The data is a random sample of loans issued on the platform between 2007−2015, including

2

the loan status, and payment information. The data also contains a number of predictors that have

been documented in the variables description file provided to you named “ECOM151-AssignmentVariableDescription.xlsx”.

For tractability, your assignment focuses only on a small set of variables

available for prediction.

You have been provided with on RData file named “ECOM151-Assignment-Data.rda” that

contain three objects:

trainData: This is the dataset on which you will train all your models.

testData: This is the dataset on which you will evaluate your model’s fit.

varDescription: This is a replication of the variable description available in the excel spreadsheet

provided to you.

Question A (10 points)

This question expects you to estimate five different class of models to identify the best model to

predict default on the LendingClub platform.

Set up the data

Load the RData file provided to you on to your work environment.

Questions:

1. Create a new variable in trainData called “y” which takes the value = 1 if loan status is

“Charged Off” and 0 otherwise.

2. All variables provided to you other than loan status are referred to as “predictors”.

3. Find the top 10 positively correlated variables with y and store it as a4.

4. Find the top 10 negatively correlated variables with y and store it as a5.

Now, we are ready to run the five models of interest. Spend time, and visualize the data. Inspect

for potential reasons why the model may not be estimated.

3

Pay particular attention to whether you would like to transform your variables (For example, a

logarithmic transformation). This may also help with interpretting coefficients in your “Prediction

Report”.

You may also want to consider converting some of the categorical variables (for example,

emp length) into a continuous variable.

6. LINEAR REGRESSION MODEL: Fit a linear regression model to the trainData, with y as the

outcome variable, with the predictors.

(a) What is the Mean squared error for the training data? Store this value in object named

m1.1.

(b) What is the Mean squared error for the testing data? Store this value in object named

m1.2.

7. BEST SUBSET MODEL: Fit a “Best subset selection” model to the trainData, with y as the

outcome variable, with the predictors.

Explore all approaches: “forward”, “backward”, and “unconstrained” best subset. Note that it

may take some time to execute the model in R, and with all of the predictors.

(a) What is the Mean squared error for the “best” model of this class for the training data?

Store this value in object named m2.1.

(b) What is the Mean squared error for the “best” model of this class for the test data? Store

this value in object named m2.2.

(c) What are the variables in the “best” model of this class? Store this in object named

m2.3.

8. RIDGE REGRESSION MODEL: Fit a ridge regression model to the trainData, with y as the

outcome variable, with the predictors.

4

Explore all values of lambda ( 1010 to 10−2

), setting alpha = 0, as in the tutorials. Hint:

glmnet() cannot handle “factor” or categorical variables. You will have to convert them into

dummies to be used.

(a) What is the Mean squared error for the “best” model of this class for the training data?

Store this value in object named m3.1.

(b) What is the Mean squared error for the “best” model of this class for the test data? Store

this value in object named m3.2.

(c) What are the 10 most important variables in the “best” model of this class? Store this in

object named m3.3.

9. LASSO: Fit a LASSO to the trainData, with y as the outcome variable, with the predictors.

Explore all values of lambda, setting the alpha parameter to 1 (the lasso penalty) in glmnet().

Hint: glmnet() cannot handle “factor” or categorical variables. You will have to convert them

into dummies to be used.

(a) What is the Mean squared error for the “best” model of this class for the training data?

Store this value in object named m4.1.

(b) What is the Mean squared error for the “best” model of this class for the test data? Store

this value in object named m4.2.

(c) What are the 10 most important variables in the “best” model of this class? Store this in

object named m4.3.

10. RANDOMFOREST: Fit a randomForest to the trainData, with y as the outcome variable, with

the predictors.

Explore and fit the best model of this class. Hint: randomForest() cannot handle “factor” or

categorical variables. You will have to convert them into dummies to be used.

5

(a) What is the Mean squared error for the “best” model of this class for the training data?

Store this value in object named m5.1.

(b) What is the Mean squared error for the “best” model of this class for the test data? Store

this value in object named m5.2.

(c) How important are the variables in predicting default? Store this value in object named

m5.3.

11. Compare and contrast the predictive power of all approaches and identify the best model to

predict default in the LendingClub data. Store this model in object named finalModel.

Question B (10 points)

You are required to synthesize all the work in Question A to submit a “Prediction Report” to

your manager on your ability to predict default for borrowers on the LendingClub platform. Utilize

all the information you have generated to write a report no longer than 5 pages and present your

best model to your manager. Pay attention to explain why it is the best model, in terms of its out of

sample predictive power, and visualize the model’s predictive power compared to the other models

on hand.

6