讲解COM6012-Assignment 2辅导留学生Python设计

COM6012 Assignment 2 - Deadline: 4:00 PM, Wed 13 May 2020

Assignment Brief
How and what to submit
A. Create a .zip file containing the following:
1) AS2_report.pdf: A report in PDF containing answers to ALL questions. The report should be concise. You may include appendices/references for additional information but marking will focus on the main body of the report.
2) Code, script, and output files: All files used to generate the answers for individual questions above, except the data. These files should be named properly starting with the question number: e.g., your python code as Q2_xxx.py (one each question), your script for HPC as Q2_HPC.sh, and your output files on HPC such as Q2_output.txt or Q2_figB.jpg. If you develop your answers in Jupyter Notebook, you MUST have these files (code, script, output, images etc specified above in bold) prepared and submitted after you finalise your code and output. The results should be generated from the HPC, not your local machine.
B. Upload your .zip file to MOLE before the deadline above. Name your .zip file as USERNAME_STUDENTID_AS2.zip , where USERNAME is your username such as abc18de , and STUDENTID is your student ID such as 18xxxxxxx.
C. NO DATA UPLOAD: Please do not upload the data files used. We have a copy already. Instead, please use a relative file path in your code (data files under folder ‘Data’), as in the lab notebook so that we can run your code smoothly. D. Code and output. 1) Use PySpark as covered in the lecture and lab sessions to complete the tasks; 2) Submit your PySpark job to HPC with qsub Assessment Criteria (Scope: Session 6-9; Total marks: 20)
1. Being able to use pipelines, cross-validators and a different range of supervised learning methods for large datasets
2. Being able to analyse and put in place a suitable course of action to address a large scale data analytics challenge
Late submissions: We follow the Department's guidelines about late submissions, i.e., a deduction of 5% of the mark each working day the work is late after the deadline, submission will be marked one week after the deadline because we will release a solution by then. Please see this link.
Use of unfair means: " Any form of unfair means is treated as and action may be taken under the Discipline Regulations." (from the MSc Handbook). Please carefully read this link on what constitutes Unfair Means if not sure. Please, only use interactive HPC when you work with small data to test that your algorithms are working fine. If you use rse-com6012 in interactive HPC, the performance for the whole group of students will be better if you only use up to four cores and up to 15G per core. When you want to produce your results for the assignment and/or want to request access to more cores and more memory, PLEASE USE BATCH HPC. This will be mandatory. We will monitor the time your jobs are taking to run and will automatically “qdel” the job if it is taken much want to promote good code practices (e.g. memory usage) so, please, once more, make sure that what you run on HPC has already been tested enough for a smaller dataset. It is OK to attempt to produce results several times in HPC, but please, be mindful that extensive running jobs will affect the access of other users to the pool of resources.
Question 1. Searching for exotic particles in high-energy physics using classic supervised learning algorithms [15 marks]
In this question, you will explore the use of supervised classification algorithms to identify Higgs bosons from particle collisions, like the ones produced in the Large In particular, you will use the HIGGS dataset.
About the data: “The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.”
You will apply Random Forests and Gradient boosting over a subset of the dataset and over the full dataset. As performance measures use classification accuracy and area under the curve .
1. Use pipelines and cross-validation to find the best configuration of parameters for each model.
a. For finding the best configuration of parameters, use 5% of the data chosen randomly from the whole set (2 marks).
b. Use a sensible grid for the parameters (for example, three options for each parameter) for each predictive model (3 marks).
c. Use the same splits of training and test data when comparing performances among the algorithms (1 mark).
Please, use the batch mode to work on this. Although the dataset is not as large, the batch mode allows queueing jobs and for the cluster to better allocate resources.
2. Working with the larger dataset. Once you have found the best parameter configurations for each algorithm in the smaller subset of the data, use the full dataset to compare the performance of the three algorithms in the cluster. Remember to use the batch mode to work on this.
a. Use the best parameters found for each model in the smaller dataset of the previous step, for the models used in this step (2 marks)
b. Once again, use the same splits of training and test data when comparing performances between the algorithms (1 mark)
c. Provide training times when using 10 CORES and 20 CORES (2 marks). Based on our own solution, with the proper setting, you need a maximum of 10 mins for running the exercise when using 10 cores (with the rse-com6012 queue).
3. Report the three most relevant features according to each method in step 2 (1 mark).
4. Discuss at least three observations (e.g., anything interesting), with one to three sentences for each observation (3 marks).
Do not try to upload the dataset to MOLE when returning your work. It is 2.6Gb. HINTS: 1) An old, but very powerful engineering principle says: divide are unable to analyse your datasets out of the box, you can always start with a smaller one, and build your way from it. 2) This dataset was used in the paper “Searching Particles in High-energy Physics with Deep Learning” by P. Baldi, P. Sadowski, and D. Whiteson, published in Nature Communications 5 (July 2, 2014). You can compare the results that you get against Table 1 of the paper.
Question 2. Senior Data Analyst at Intelligent Insurances Co. [15 marks] You are hired as a Senior Data Analyst at Intelligent Insurances Co. The company wants to develop a predictive model that uses vehicle characteristics to accurately predict insurance claim payments. Such a model will allow the company to assess the potential risk that a vehicle represents.
The company puts you in charge of coming up with a solution for this problem and provides you with a historic dataset of previous insurance claims. The claimed amount can be zero or greater than zero and it is given in US dollars. A more detailed description of the problem and the available historic dataset is here The website contains only need to work with the .csv file in train_set.zip. The uncompressed 1. Preprocessing
a. The dataset has several fields with missing data. Choose a method to deal with missing data (e.g. remove the rows with missing fields or use an imputation method) and justify your choice (1 mark).
b. convert categorical values to a suitable representation (1 mark).
c. the data is highly unbalanced: most of the records contain zero claims. When designing your predictive model, you need to account for this (2 marks).
2. Prediction using linear regression. You can see the problem as a regression problem where the variable to predict is continuous. Be careful about the preprocessing step above. The performance of the regression model will depend on the quality of your training data
a. Use linear regression in PySpark as the predictive model. Partition your data into training and test (percentages according to your choice) and report the mean absolute error and the mean squared error (2 marks).
b. Provide training times when using 10 CORES and 20 CORES. Remember to use the batch mode to work on this (1 mark). Based on our own solution, with the proper setting, you need a maximum of 5 mins for running the exercise when using 10 cores (with the rse-com6012 queue).
3. Prediction using a combination of two models. You can build a prediction model based on two separate models in tandem (one after the other). Once again, be careful about the preprocessing step above. For this step, use the same training and test data that you used in 2.a.
a. The first model will be a binary classifier (of your choice) that will tell whether the claim was zero or different from zero. The performance of the classifier will depend on the quality of your training data (2 marks).
b. For the second model, if the claim was different from zero, train a Gamma regressor (a GLM) to predict the value of the claim. Report the mean absolute error and the mean squared error (2 marks).
c. Provide training times when using 10 CORES and 20 CORES. Remember to use the batch mode to work on this (1 mark). Based on our own solution, with the proper setting, you need a maximum of 15 mins for running the exercise when using 10 cores (with the rse-com6012 queue).
4. Discuss at least three observations (e.g., anything interesting), with one to three sentences for each observation (3 marks).
HINT: An old, but very powerful engineering principle says: divide unable to analyse your datasets out of the box, you can always start with a smaller one, and build your way from it.

to obtain the output.

but NO late

a serious academic offence

more than expected. We

Hadron Collider.

and conquer. If you

for Exotic

several files. You
file is 2.66 Gb.

and conquer. If you are