首页 > > 详细

BENG0091 Stochastic Calculus & Uncertainty AnalysisCoursework 2

BENG0091 Stochastic Calculus & Uncertainty Analysis
Coursework 2
Please read the guidelines before starting the work.
Guidelines
- You need to provide all MATLAB, Python or equivalent code that you have developed as part 
of your submission to Turnitin. This is compulsory. Include clarifications/comments in your 
code whenever you feel appropriate. 
- You need to submit one version of the code that is executable. Unless the code is executable 
locally reaching the same results as those in your report, it will not receive full marks.
o One option is to have the code in the submitted document in a state where we can 
copy it off your submission and execute. A few tips to assist you in the process: 
Please note that line numbers left in the code often creates an issue with 
executability. Python codes embedded in LaTeX can also create problems with 
executability. Please ensure that prior to submission, you can copy the code back 
from the document you plan to submit and execute it, just to double check. 
o If you do not want to worry about the Turnitin version being executable or not, you 
can additionally choose to use Datalore as suggested. A detailed video on how to 
use it is available in Moodle. Please note that submitting via Datalore is optional. 
- Your submission (excluding the space taken up by your code) should be no more than 15 
pages and contain no more than 15 Figures. Clarity is expected in the Text, in your Figures, 
and in your codes. A single figure/image cannot comprise of 10 illegible plots, please use 
your reasoning when preparing your report.
- Please make sure that you address the answer for each section or question at its respective 
slot. e.g. a correct answer to section (a) provided as response to section (b) will not be 
considered for marking.
- You need to develop your own code. You are not allowed to use pre-existing toolboxes to 
conduct stochastic simulations, for example. However, the use of standard Python packages 
such as pandas or NumPy are acceptable. Regarding random number generators (r.n.g.), you 
are only allowed to use a/the uniform r.n.g. available in the programming language you 
chose (MATLAB, Python etc.). Uniqueness of your scripts will be assessed and will contribute 
to your mark. 
- To achieve full marks in each question, your methodology needs to be correctly 
implemented and your code needs to be original (i.e. your own work). 
- You will be allowed to submit your work multiple times until the deadline. The Turnitin 
submission will be made available weeks before the deadline. Please note that it is your 
responsibility to ensure that the submission is made on time. Late submissions, SORAs and 
ECs will be handled by the Admin Team, not your tutors.
Department of Biochemical Engineering
Coursework 2 Brief:
Tuberculosis (TB) is one of the leading infectious disease killer in the world. According to the World
Health Organisation (WHO), a total of 1.5 million people died from TB in 2020. Worldwide, TB is the 
13th leading cause of death, and the second leading infectious killer after COVID-19. TB is present in 
all countries and age groups, but it is curable and preventable. Globally, close to one in two TB-affected 
households face costs higher than 20% of their household income. The world did not reach the 
milestone of 0% TB, and this leads to patients and their households facing catastrophic costs as a result 
of TB disease by 2020 (https://www.who.int/news-room/fact-sheets/detail/tuberculosis). This 
indicates a clear relationship between the socioeconomic parameters and disease treatment and 
management. It has been shown that patients, who have been diagnosed and treated with TB are 
susceptible to future pulmonary complications including other lung diseases and accelerated lung 
ageing (https://doi.org/10.1016/j.ijid.2020.02.032). These relationships demonstrate the complex 
landscape of treating the disease and its long-term post-treatment effects and the socioeconomic 
factors that play a role in having access to good treatment. Elucidation of the nature of these 
relationships can assist and advise worldwide disease treatment and prevention programmes and help 
save millions of lives. As a data scientist, you have been asked to look into this further. You are 
provided a dataset containing demographic information collected from some key locations around 
the world and are asked to find which (if any) of these factors would be a predictor for the 
prevalence of residual susceptibility for future complications.
The dataset you have been provided with (TB_demographics.csv) has 3047 data rows, each row 
corresponding to data collected from one specific region in the world. There are 19 different factors 
data was collected for: column 1 gives us the incidence of residual impacts due to TB per capita (per 
100,000 people), which you are asked to predict (Y) with your model (Column ID: 
TARGET_residualsRatePerCapita). The remaining 18 columns represent different types of 
demographic information collected from each one of these regions, which are the input factors (𝑍𝑖
)
to your model. The details of the information for these input factors as named in the dataset are given 
in Table 1:
Table 1: Dataset input factors summary
Model ID Column ID in dataset Input factor
𝑍1
incidenceRatePerCapita Mean per capita (100,000) TB diagnoses
𝑍2 popEst2015 Population of region
𝑍3 MarriedPerCapita Residents who are married (per capita)
𝑍4 NoHS18_24PerCapita Residents aged 18-24 highest education attained: less 
than high school (per capita)
𝑍5 HS18_24PerCapita Residents aged 18-24 highest education attained: 
high school diploma (per capita)
𝑍6 BachDeg18_24PerCapita Residents aged 18-24 highest education attained: 
bachelor's degree (per capita)
𝑍7 HS25_OverPerCapita Residents aged 25 and over highest education 
attained: high school diploma (per capita)
𝑍8 BachDeg25_OverPerCapita Residents aged 25 and over highest education 
attained: bachelor's degree (per capita)
𝑍9 Employed16_OverPerCapita Residents aged 16 and over employed (per capita)
𝑍10 Unemployed16_OverPerCapita Residents ages 16 and over unemployed (per capita)
Department of Biochemical Engineering
𝑍11 PrivateCoveragePerCapita Residents with private health coverage (per capita)
𝑍12 EmpPrivCoveragePerCapita Residents with employer-provided private health 
coverage (per capita)
𝑍13 PublicCoveragePerCapita Residents with government-provided health coverage
(per capita)
𝑍14 PublicCoverageAlonePerCapita Residents with government-provided health coverage 
alone (per capita)
𝑍15 MarriedHouseholdsPerCapita Married households (per capita)
𝑍16 avgResidualsPerCapitaPerYear Average number of people suffering from residual 
effects from any disease
𝑍17 povertyPerCapita Poverty score of regions given per capita
𝑍18 AvgHouseholdSizePerCapita Average household size per capita
You will build a multiple linear regression model taking in all the data you have been provided into 
account using the following equation:
𝑌 = 𝑏0 + ∑𝑏𝑖
18
𝑖=1
∙ 𝑍𝑖
All your input factors (𝑍𝑖
) will be associated with a certain degree of uncertainty arising from the 
nature of the way the data was collected. These uncertainties are represented by random errors. 
Furthermore, you have been given the following information in Table 2 concerning the systematic
uncertainty associated with some of the regression coefficients. Unless listed in the table below, you 
can assume all other regression coefficients not to have any uncertainty associated with them. You 
know that there is no correlation between the parameter uncertainties for the following regression 
coefficients and the input factor uncertainties. 
Table 2: Summary of standard systematic errors for a subset of the regression coefficients
Variable
Distribution 
of systematic
errors
Systematic 
Uncertainty
(br) % value
b0 Uniform 0.5
b2 Normal 2
b7 Normal 1
b9 Triangular 0.25
b10 Normal 6
b16 Uniform 4
b17 Uniform 1.5
Department of Biochemical Engineering
Q1. You are asked to check the quality of your dataset by identifying and eliminating any outliers. 
For this purpose, you will investigate all your input variables and output variable (i.e., each of the 19 
columns) separately and determine the outliers, if any, in each column. [5 marks]
You are asked to follow a very strict approach: If any data row has at least one variable, which is 
identified as an outlier, you will exclude that row from analysis. [5 marks]
Explain your decisions stating all the underlying assumptions you have made and state the size of the 
new dataset you end up with, and the new population properties for each column of variables. [5 
marks]
[Total marks available for Q1: 15]
Q2. You are then asked to determine the uncertainty around the output variable, i.e., the predicted 
incidence of residual impacts due to TB per capita as given by the multiple linear regression model. 
For this purpose, you will use the Monte Carlo Method for uncertainty propagation to determine the 
expanded uncertainty using your dataset. [5 marks]
Make sure to demonstrate that your calculation of the expanded uncertainty has converged. Justify 
any assumptions you make in your analysis and discuss your results. If you have used the standard 
MCM, state the number of iterations that would be sufficient to achieve convergence. If you are 
using adaptive MCM, state your criterion for convergence and at how many iterations that has been 
reached. [5 marks]
In this case does it suffice to report expanded uncertainty within a confidence interval? Do the 
results implicate that the coverage interval needs to be calculated? If yes, report the probabilistically 
symmetric coverage interval. If no, justify your reasoning. [10 marks]
Correct implementation of the codes and their originality: [15 marks]
Interpretation and discussion of your results, presentation of assumptions: [10 marks]
[Total marks available for Q2: 45 marks]
Q3. The initial challenge you were tasked was to identify which (if any) of the demographic factors 
have the largest impact on the incidence rate of residual disease impacts due to TB per capita. For 
this question assume all of the uncertainties around your input factors and regression coefficients 
stated in Table 2 to follow a uniform distribution. Perform a Sensitivity Analysis by applying the 
Elementary Effects Method on the multiple linear regression model, assuming an appropriate range 
of variation for each variable. Apply the Elementary Effects Method using the original sampling 
strategy proposed by Morris (refer to lecture notes) and justify/show the convergence of your 
results. [5 marks]
Based on your findings, which demographic factors plan an important role in predicting the 
uncertainty around residual effects of TB manifesting at a later stage in treated individuals? [5 
marks]
Department of Biochemical Engineering
Correct implementation of code and its originality for the Sensitivity Analysis by applying the 
Elementary Effects Method: [15 marks]
Interpretation and discussion of results, stating assumptions: [15 marks]
[Total marks available for Q3: 40 marks]
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!