COMP9417 Project: Multitask Machine Learning (MML)
March 4, 2024
Project Description
As a Data Scientist at Predictive Solutions Inc., you have become comfortable working with any type of dataset that comes your way. Your newest client, a medical researcher in an undisclosed branch at the local hospital, is interested in utilizing machine learning to understand data obtained from a recent clinical trial they conducted. In this particular dataset, there are n = 1000 observations and p = 111 features To ensure privacy of patient data, the features have been anonymized (that is, the features are generically labelled X1 , X2 , . . . ,.) The features are a mix of binary, categorical and continuous valued data that which contain information about each patient. In this problem, the outcome is multivariate, which means that there are multiple target variables to predict as opposed to the usual case in which we have a single target variable. Each target is a speciic medical condition. This sort of problem is known as Multitask Learning. The data will be released on March 25, 2022.
Description of the Data
The client has provided you with the following data sets in numpy.array format: X train, X test, Y train . You will need to use best practices to come up with a model that generates predictions for X test which will be submitted for evaluation and will count towards your inal grade. The X variable is comprised of tabular data, and each feature is of dimension 1000 × 111. The Y variable is comprised of tabular data of dimension 1000 × 11, so that there are 11 binary targets (tasks) that need to be predicted. The loss function used for this problem is the average binary cross entropy loss, i.e. if Yij denotes the j-th target for the i-th observation, and Y(ˆ)ij is the corresponding prediction from your model, then the total loss is:
where
is the usual binary cross entropy loss.
Important Aspects
The following problems should be considered and discussed in detail in your report:
. Data: Perform. an extensive exploratory data analysis (EDA). This should include a pre-processing step in which the data is cleaned. You should pay particular attention to the following questions:
1. Which features are most likely to be predictive of each target variable?
2. What, if any, are the relationship between target variables?
. Research: Provide a summary of the multi-task learning literature. Be sure to explain rigorously some of the algorithms that are used. It is a good idea to pick one or two areas to explore further here. The report should be well written and well referenced.
. Modelling: The approach to modelling is open ended and you should think carefully about the types of models you wish to deploy. It is generally a bad idea to build a large number of generic models. Instead, you should think carefully about the models you want to use and how best to build them. Regardless of the models you choose, you need to:
1. Construct a model that performs well in terms of the loss function described earlier.
2. Compare your model to the naive approach to multi-task learning in which you would construct 11 separate models.
. Discussion: provide a detailed discussion of the problem, your approach and your results. Explain whether your inal approach was better than the naive one, and why you think that might be the case. Discuss what you could have improved on.
Overview of Guidelines
. The deadline to submit the report is 5pm April 22. The deadline to submit your predictions, your code (and the documentation), and a 2-min presentation is 5pm April 19 for both the Internal Challenge project (MML) and External Challenge (Berrijam project).
. Submission will be via the Moodle page
. You must complete this work in a group of 4-5, and this group must be declared on Moodle under Group Project Member Selection
. The project will contribute 30% of your inal grade for the course.
. Recall the guidance regarding plagiarism in the course introduction: this applies to all aspects of this project as well, and if evidence of plagiarism is detected it may result in penalties ranging from loss of marks to suspension.
. Late submissions will incur a penalty of 5% per day from the maximum achievable grade. For example, if you achieve a grade of 80/100 but you submitted 3 days late, then your inal grade will be 80 - 3 × 5 = 65. Submissions that are more than 5 days late will receive a mark of zero. The late penalty applies to all group members.
Project Proposal
Each group must submit their project choice and a 1 page proposal by Friday, March 15th, 5 PM. The plan should not exceed 1 page and should include the following:
1. Approach: Briely describe your approach and techniques you want to explore
2. Owners and Collaborators: Nominate the team member who will work on each part of the project.
3. 4 week Plan: A list of weekly milestones leading to the inal project deliverable.
Your actual project may deviate from the proposal. Changes from the original plan will NOT impact team scores. The goal of the proposal is to help teams self-organize.
Objectives
In this project, your group will use what they have learned in COMP9417 to construct a predictive model for the speciic task described above as well as write a detailed report outlining your exploration of the data and approach to modelling. The report is expected to be a maximum of 12 pages long (with a single column, 1.5 line spacing), and easy to read. The body of the report should contain the main parts of the presentation, and any supplementary material should be deferred to the appendix. For example, only include a plot if it is important to get your message across. The guidelines for the report are as follows:
1. Title Page: tile of the project, name of the group and all group members (names and zIDs).
2. Introduction: a brief summary of the task, the main issues for the task and a short description of how you approached these issues.
3. Exploratory Data Analysis and Literature review: this is a crucial aspect of this project and should be done carefully given the lack of domain information. Some (potential) questions for consideration: are all features relevant? How can we represent the data graphically in a way that is informative? What is the distribution of the targets? What are the relationships between the features? What are the relationships between the targets? How has this sort of task been approached in the literature? etc.
4. Methodology: A detailed explanation and justiication of methods developed, methodselection, featureselection, hyper-parameter tuning, evaluation metrics, design choices, etc. State which method has been selected for the inal test and its hyper-parameters.
5. Results: Include the results achieved by the diferent models implemented in your work, with a focus on the f1 score. Be sure to explain how each of the models was trained, and how you chose your inal model.
6. Discussion: Compare diferent models, their features and their performance. What insights have you gained?
7. Conclusion: Give a brief summary of the project and your indings, and what could be improved on if you had more time.
8. References: list of all literature that you have used in your project if any. You are encouraged to go beyond the scope of the course content for this project.
You must follow this outline, and each section should be standalone. This means for example that you should not display results in your methodology section.
Project implementation
Each group must implement a model and generate predictions for the provided test set. You are free to select the types of models, features and tune the methods for best performance as you see it, but your approach must be outlined in detail in the report. You may also make use of any machine learning algorithm, even if it has not been covered in the course, as long as you provide an explanation of the algorithm in the report, and justify why it is appropriate for the task. You can use any open-source libraries for the project, as long as they are cited in your work. You can use all the provided features or a subset of features; however you are expected to give a justiication for your choice. You may run some exploratory analysis or some feature selection techniques to select your features. There is no restriction on how you choose your features as long as you are able to justify it. In your justiication of selecting methods, parameters and features you may refer to published results of similar experiments.
Code submission
Code iles should be submitted as a separate .zip ile along with the report, which must be .pdf format. Penalties will apply if you do not submit a pdf ile (do not put the pdf ile in the zip).
Peer review
Individual contribution to the project will be assessed through a peer-review process which will be announced later, after the reports are submitted. This will be used to scale marks based on contribution. Anyone who does not complete the peer review by the 5pm Friday of Week 11 (26 April) will be deemed to have not contributed to the assignment. Peer review is a conidential process and group members are not allowed to disclose their review to their peers.
Project help
Consult Python package online documentation for using methods, metrics and scores. There are many other resources on the Internet and in literature related to classiication. When using these resources, please keep in mind the guidance regarding plagiarism in the course introduction. General questions regarding group project should be posted in the Group project forum in the course Moodle page.