首页 > > 详细

Data Analysis辅导、讲解Python编程设计、辅导UCI machine、辅导Python语言辅导Web开发|解析C/C++编程

Introduction to Data Analysis
Final Project [30 points]
The final project is worth 30% of the grade.
The final project utilizes the machine learning technique of classification to predict an outcome for a
banking marketing problem. A bank is planning a telemarketing campaign to increase the number of
term deposits it has. The data records include the output target (whether they responded positively: yes
or no), and several candidate features. Your task is to use the data analysis techniques you have learned
in class to predict which customers are likely to respond positively to the campaign. You will use any two
techniques learned in class, compare the models from both the techniques and make a
recommendation. Based on your recommendation the bank will then use your chosen model to score
unseen data to target customers for the telemarketing campaign.
You will create a power point deck to report your findings and make a recommendation on which model
to choose and likely impact.
Data set and related information:
The dataset is available in the UCI machine learning repository:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Read through the dataset information, attribute information.
Note that we will only use the more recent versions of the datasets. That is, we will only use the
bankadditional.zip folder and files.
Also note that bank-additional-full.csv contains the full dataset of 41,118 records and bankadditional.csv
contains a 10% random sample of 4119 examples. It is recommended that you try most of your work on
the bank-additional.csv (the 10% random sample) and move to the full dataset only when you have got
your models working and are trying to improve the accuracy or other aspects.
The following is a checklist of the contents for each slide.
Slide 1 [5 points]
• Name of presenter
• Description of the problem
• How you would apply data analytics to the problem
• What are the likely impacts of applying data analytics
Slide 2 [5 points]
• The methodology you will use in tackling the problem
Slides 3-8 [5 points]
• Create some basic plots and graphs (histograms, boxplots, scatterplots) of the data
• Also compute some statistics of the features that you think are important
• Plot some scatter plots showing the classes in different colors
Slides 9-11 [5 points]
• Describe your first choice for model building
• Justify your choice. How is it meaningful or relevant for the business problem at hand?
• Describe your model
• Report on performance metrics of your model
Slides 12-14 [5 points]
• Describe your second choice for model building
• Justify your choice. How is it meaningful or relevant for the business problem at hand?
• Describe your model
• Report on performance metrics of your model
Slide 15 [5 points]
• Make a recommendation on which model should be selected among your two models
• State your conclusion based on this data analytics exercise
• State what are the possible business outcomes
Some tips you will find useful
1. Converting categorical variables to numeric variables: This stackoverflow page has some tips on how
to convert categorical variables to numeric variables:
https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe
2. Assessing model performance (In addition to the metrics that we have seen in class): ROC Curves
and AUC
This scikit-learn help page contains some hints on create ROC curves and computing Area under the
curve:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#
You can learn more about the ROC curve and Area under the curve here:
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
3. Plotting
You might find this page helpful in getting started with plotting:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
You might also find the scikit-learn pages helpful:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.scatter.html
Final Term Project Rubric
Slides Exemplary

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!