Actuarial Data and Analysis, S2 2018
Assignment
Due time: Monday 22 October 11.55 am (sharp)
1 Skills developed
This assignment provides you with an opportunity to apply techniques you have learnt in the course lectures
to a business task involving data. In addition, your skills in understanding/applying advanced research works
(including the text and any additional reference material you consider) will be developed via this assignment.
Communication of the results of your investigations and analysis is also an important skill developed.
2 Task
You are an analyst at a data analytics consulting firm. Your manager has currently tasked you in providing a
report to an American client called the Lending Club (https://www.lendingclub.com/). The Lending Club is
a real US peer-to-peer lending company headquartered in San Francisco, California.
The client is interested in developing a predictive model that helps them identify which of their current loans
are bad or risky loans which are likely to default.
Your main tasks will involve data cleaning, modelling (with associated documentation), as well as a report and
recommendations. In particular, out of 20000 current loans in the LCdata_Eval.csv evaluation dataset you
must identify 5000 loans which you believe are the most likely to default. Your client would also appreciate a
non-technical description of the characteristics of the bad loans that could assist in the development of a risk
mitigation strategy.
The client is familiar with the basics of statistical learning. Note that all your modelling results should be
included (mostly in the appendix).
3 Additional information and mark allocation
3.1 Technical modelling and Results (76 marks)
For the data you have (see section on data for details), develop predictive models using various methods such
us (but not limited to) - logistic regression, k-nearest neighbours, logistic regression with lasso and ridge,
classification trees and their extensions, support vector machines and PCA. Provide the results and analysis
associated with each of these methods in the technical appendix; this should include discussion on the choice
of the tuning parameter(s). A very brief summary of each approach should also be included.
Note that you should also provide in the main report the results and (detailed) analysis using your selected
predictive model, along with justification of why the particular model was chosen.
Note that, as in any commercial situation, there are many alternative valid approaches that can be used, and
so you can choose how to perform. the task as long as it is justifiable and justified – what is important is the
rationale for your chosen method and recommendations.
However, you may also wish to engage in extra research beyond these works – please feel free to do so.
Although the marks for each component of the assignment are capped, innovations will be encouraged and
1
will potentially offset issues if present. Note however that it is possible to attain full marks without significant
innovation.
Finally, you must also provide a csv file following the sample format in Moodle, indicating the 5000 loans (out
of the 20000 current evaluation loans) which you believe are the most likely to default. (See the submission
section for details)
Important: Note that sufficient detail (e.g. programs, calculations) must be provided (e.g. in the appendix) for
the reviewer to follow what you do. Mark allocation for the assignment can be found in the rubrics attached.
3.2 Presentation Format (24 marks)
Communication of quantitative results in a concise and easy-to-read manner is a skill that is vital in practice.
As such, marks will be given for the presentation of your results. In order to maximize your marks for
presentation you may wish to consider issues such as: table size/readability, figure axis/formatting, ease
of reading, grammar/spelling, and report structure. You may also wish to consider the use of executive
summaries and appendixes, where appropriate. Provide sufficient details to the reader so that they can judge
what you are doing, using appendices for non-essential but useful results for the report as necessary.
Note that sufficient detail must be provided (in either the report body and/or appendices) so that the reviewer
can follow all the steps and derivations required in your work.
Note that a maximum page limit of 6 pages (excluding tables and graphs) is applicable to the main body of
the report.1 You should also consider the rubric for the presentation component (on the course webpage).
There is no limit to the size of the appendix.
3.3 Data
This dataset is based on real data from the Lending Club available at Kaggle (https://www.kaggle.com/
wendykan/lending-club-loan-data). The training dataset (LCdata_Train.csv) contains data on 100000 loans.
Per loan, 21 attributes are available giving information about the terms of the loan, the borrower, etc. In the
csv file the first column named id is a unique loans identifier, while the second column named bad_loans is
the target variable of interest indicating whether a loan is bad (risky) or good (safe). A description of all the
variables can be seen in file LCDataDictionary.xlsx available in Moodle. Further description of some of
the variables is also available in the Lending Club website (https://www.lendingclub.com/). There are no
missing or noisy data.
One challenging characteristic of the data is that they are unbalanced: only 18% to 19% of the loans in the
training data set are bad loans. You may wish to have a look at the paper by He and Garcia (2009) available
in Moodle (or other sources) for strategies for dealing with unbalanced data in classification n problems.
The evaluation dataset (LCdata_Eval.csv) comprises 20000 current loans for which the client has asked
you to select 5000 loans which you believe are more likely to default (bad loans). This evaluation dataset
has the same format as the training dataset but doesn’t include the column bad_loans. I know the column
bad_loans for the dataset and I will release it after the due date of the assignment.
The training and evaluation data along with the data dictionary can be downloaded from the Moodle website.
3.4 Accuracy marks
The accuracy of your predictions on this evaluation data will have a (minor) impact on your mark. More
specifically, 10 of the 76 technical marks will be associated to the accuracy of your predictions. The marks
you will get for the accuracy criterion will be given by:
1Please kindly note that this is a maximum - you should feel free to use less pages if it is sufficient!
2
Marks =
braceleftBigg
5
950 ×No. of bad loans identified if No. of bad loans identified < 950
5 + 5C - 950(No. of bad loans identified - 950) if No. of bad loans identified ≥ 950
where C is maximum number of loans identified by a student in the class.
Note that 950 = 5000×0.19 is the average number of loans one would correctly identify in the evaluation
dataset if one were to pick the bad loans at random. Therefore, if your prediction is below the average
number of correct bad loans by random selection, your mark will scale from (0,0) to (no. of bad loans by
random selection, 5). If our our prediction is above the average number of correct bad loans by random
selection, then your mark is scaled from (no. of bad loans by random selection, 5) to (max no. of bad loans
by a student, 10).
3.5 Software
You may choose which software package to use, however, nearly every function you will be required to use for
this task is available in R. Note also that code enabling you to perform. most of the modelling can be found
in the learning activities of the course. Note that simplifying assumptions must be clearly identified and
justified (if used).
3.6 Assignment submission procedure
3.6.1 Turnitin submission
Your assignment report must be uploaded as a unique document and all parts must be in portrait format.
As long as the due date is still future, you can resubmit your work; the previous version of your assignment
will be replaced by the new version.
Assignments must be submitted via the Turnitin submission box that is available on the course Moodle
website. Turnitin reports on any similarities between their own cohort’s assignments, and also with regard to
other sources (such as the internet or all assignments submitted all around the world via Turnitin). More
information is available at: [click]. Please read this page, as we will assume that you are familiar with its
content.
Please also attach any programming code or sample spread sheet output used in your analysis as
a separate file in the dedicated “code_sample” Moodle assignment box on the course webpage. These will be
referred to by the marker only if needed, and in particular the main assignment (with appendix) should be
self contained.
In the dedicated “selected_bad_loans” Moodle assignment box, please also attach the csv file containing
the list of 5000 loans to which you recommend offering Caravan insurance. Your csv file must
contain 5000 id numbers identifying the loans you believe are more likely to be bad loans. See the file
“Bad_Loans_sample.csv” for a sample of how your file of potential bad loans should look like.
3.6.2 Late submission
Please note that it is School policy that late submission of assignments will incur in a penalty.
A penalty of 25% of the mark the student would otherwise have obtained, for each full (or part) day of lateness
(e.g., 0 day 1 minute = 25% penalty, 2 days 21 hours = 75% penalty). Students who are late must submit
their assignment to the LIC via e-mail. The LIC will then upload documents to the relevant submission
boxes. The date and time of reception of the e-mail determines the submission time for the purposes of
calculating the penalty.
3
You need to check your document once it is submitted (check it on-screen). We will not mark assignments
that cannot be read on screen.
Students are reminded of the risk that technical issues may delay or even prevent their submission (such
as internet connection and/or computer breakdowns). Students should then consider either submitting
their assignment from the university computer rooms or allow enough time (at least 24 hours is
recommended) between their submission and the due time. The Turnitin module will not let you
submit a late report. No paper copy will be either accepted or graded.
3.6.3 Plagiarism awareness
Students are reminded that the work they submit must be their own. While we have no problem with
students working together on the assignment problems, the material students submit for assessment must be
their own.
Students should make sure they understand what plagiarism is—cases of plagiarism have a very high
probability of being discovered. For issues of collective work, having different persons marking the assignment
does not decrease this probability.
References
He, Haibo, and Edwardo A. Garcia. 2009. “Learning from imbalanced data.” IEEE Transactions on
Knowledge and Data Engineering 21 (9): 1263–84. doi:10.1109/TKDE.2008.239.