讲解 Bio216 Artificial Intelligence for Life Science 2023-24SEM22 Coursework 2辅导 Web开发

Instructions

Deadline: 11.59 PM May 17, 2024

Submission File Format: A compressed zip file containing a report in PDF format – 3000 words and code implementation in the jupyter notebook file

Bio216 Artificial Intelligence for Life Science 2023-24SEM22

Coursework 2 (Individual Work)

Question: Regression - Heart disease prediction

Introduction

World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardiovascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle. changes in high-risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using regression methods.

Source

The dataset (https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download) is available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

l Variables

Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.

l Demographic:

• Sex: male or female (binary outcome (0, 1) corresponds to male and female individuals).

• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

l Behavioral

• Current Smoker: whether or not the patient is a current smoker (Nominal)

• Cigs Per Day: the number of cigarettes that the person smoked on average in one day. (can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

l Medical (history)

• BP Meds: whether or not the patient was on blood pressure medication (Nominal)

• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

• Diabetes: whether or not the patient had diabetes (Nominal)

l Medical(current)

• Tot Chol: total cholesterol level (Continuous)

• Sys BP: systolic blood pressure (Continuous)

• Dia BP: diastolic blood pressure (Continuous)

• BMI: Body Mass Index (Continuous)

• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

• Glucose: glucose level (Continuous)

Predict variable (desired target)

• 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Study Objective and Outcome: You will be expected to develop a regression model based on a binary CHD outcome with certain accuracy (Like above 80%). Then, the proposed model can assess the impact of all risk factors (variables) on CHD susceptibility.

Marking Criteria

Data Splitting (10%):

Reasonable data splitting into training and testing sets.

Model Training (20%)

Successfully training the model on the training set.

Utilizing the selected regression algorithm and optimizing its parameters to best fit the training data.

Choose an appropriate regression algorithm based on the nature of the problem and the characteristics of the dataset. Explain the rationale behind the selection, considering factors such as linearity, complexity, and interpretability of the model.

Model Evaluation (20%):

Evaluating the model’s performance on the testing set.

Using appropriate metrics such as mean squared error, R-squared, or other relevant metrics.

Visualizing the model’s performance through appropriate plots or graphs.

Feature Importance Analysis (20%):

Analyzing the importance of each feature in predicting the outcome.

Using techniques like feature importance scores, coefficients, or permutation importance.

Providing clear explanations for the importance of key features.

Documentation and Reporting (30%):

Documenting the entire process in a well-organized Jupyter notebook.

Presenting the findings and results clearly in the report.

Submitting a compressed zip file containing the Jupyter notebook and a report in PDF format with the word limitation as 3000 words (exclude references, table of contents, appendix and tables) with a logical structure (introduction – main body – conclusion, use sub-title to divide main parts into several sections according to the needs).

Using headings for sections

Typeset properly, e.g., in Google Doc, Word, or LaTex (recommended); you could use an online LaTex environment such as Overleaf and use one of its Homework Templates

Academic writing style, no spelling or grammar mistakes.

Include Page numbers

Font (Times New Roman), Font size (10), Line spacing (Single)

联系我们