Guidelines for the Project
There is no single (optimal) solution for this project. But we should pay attention on some key points and steps. And also, there are standards on the misclassification rate. I summarize as follows.
1. Clean the data:
Although there are about twelve hundred customers in this data set, a small percentage of the sample set can be removed because they are not meaningful (e.g., a person is too young to have a child or a person with many no responses). If we include these samples in the data set, they will affect our result in a negative way. In practice, it is unavoidable to have a small percentage of meaningless samples. But we do not have to include them when we build the model.
2. The non-numeric independent variables:
Among the fourteen independent variables, the job status and residential status are non-numeric. Computers only read numeric values. Therefore, we need to assign numeric scores to these independent variables. You can decide the range of scores. But with common sense, people with more stable income should receive higher scores. And also, people who are the owner of their apartments should receive higher score.
3. Merge and delete some independent variables:
In this data set, we have a lot of independent variables that contain very detailed information. But as some of you have discovered, more independent variables do not necessarily give us better result in terms of classification accuracy. If you merge some of the independent variables that are similar in nature (e.g., those outgoings) to create a new independent variable, you might be able to get better result because this new independent variable can play a much stronger role in the model. At the same time, you can delete some independent variables that are not very relevant or important. It is true that the more information we collect, the better. But when it comes to build the model, more independent variables do not necessarily lead to a better result.
4. Construct the training and testing data sets:
We need to divide the original data set into two parts, the training data set and testing data set. We should pay special attention on the good and bad ratio in the training data set. As I explained in the lecture, if the number of good customers and bad customers in the training data set are unbalanced, the model built on this set will be biased. Therefore, the ideal ratio is 50:50. But in this data set, the number of good customers is much more than the number of bad customers. At the same time, we want to leave some bad customers in the testing data set. We can allow some degree of unbalance between good and bad customers up to 60:40. I suggest you choose 200-250 bad customers and 200-300 good customers for the training data set. All of the rest customers should be put into the testing data set.
5. The objectives of the project and standard of the accuracy rate.
The first objective is to minimize the training error rate and testing error rate. If one of your three methods (we don’t require all three) has both training error rate and testing error rate are within 30%, you have done a good job. We use this standard because this is not an easy data set to work with. This data set is from a bank. We know that each customer is good or bad because they are already a customer of the bank. In another word, the bank originally classified all of them as good customers to give them credit card. But it turns out that around 30% of them are bad customers. Therefore, if one of your three methods can control both training error rate and testing error rate within 30%, you have done a comparable job with the professionals in the bank. Then we should not complain. If you get 20% misclassification rate, you have outperformed those people who work in the bank. If you have two methods that give you similar training error rate and testing error rate, you should pick the one that gives you a smaller Type II error rate, which is the second objective of the project.
6. Linear Regression:
When you do the linear regression on all the independent variables, you will find that the p-values of some variables are very large. In this case, you should remove some variables and do the regression again on the remaining variables. Our objective here is to get a meaningful linear model with small misclassification rate. You might also find that for some customers, their total weighted values (yi, the value of dependent variable) are out of the range [0, 1], which can be a problem if you look this value as the probability of default. This comes to the question that if Logistics regression can do a better job since the returned value of dependent variable is always within [0, 1] for Logistics regression. Actually, when Logistics regression was created, people expected that Logistics regression can do a better job than linear regression in these cases. However, in reality, Logistics regression can only do a comparable job with linear regression, which is also tested in our case. Therefore, for the linear regression, we do not need to take it too seriously on the fact that some of the returned values for dependent variable are out of the region of [0, 1] because our objective is not to get the probability of default. Instead, our objective is to get a linear function that separate good and bad customers with minimal misclassification rate. For this reason, you can decide the optimal cut off point (above which customer is classified as bad and below which customer is classified as good) that minimizes the training error rate and testing error rate.
7. Linear Programming:
The decision variables for this model are wi and ai. Therefore, if the number of independent variables is p and the size of training data set is n, the number of decision variables is n+p and the number of constraints is 2n. The upgraded software I suggested can easily solve a linear programming problem with this size. The predetermined c can be any value except 0 because when c is 0, wi=0 and ai=0 are trivial solutions to this model. If you can correctly set up and solve the linear programming model in this project, you will be able to solve any linear programming problem in the real world. This is what I want you to get from this project.
8. Classification tree:
As the tree becomes larger and larger, the training error rate will decrease. However, the testing error rate will increase. Therefore, you need to use the input parameters to find a balance. The key of this method is to find a tree that can control both the training error rate and testing error rate within 30% (of course the smaller, the better).