ENG3 Final Project
Due June 12, 2020, 5:00pm
Modeling the Covid-19 pandemic
The SIR (susceptible-infected-removed) epidemiological model consists of a
simplified mathematical model for the spread of infectious diseases. The idea
of the SIR model is simple: consider a population of N individuals. We then
separate the population into three groups: susceptible, infected and removed
(deceased). The sizes of these subpopulations at time t are denoted by S(t),
I(t), and R(t), respectively. Mathematically, this model can be written as a
system of three coupled, non-linear ordinary differential equations (ODEs)
dS
dt
= −βSI, (1)
dI
dt
= βSI − γI, (2)
dR
dt
= γI, (3)
where we assume that the disease transmission rate β = β(t) and the mortal-
ity rate γ = γ(t) are real, time-dependent functions. Given that the ODEs
are nonlinear and coupled, finding an analytical solution for Eqs. (1)-(3) is
impractical. Nonetheless, one can search for solutions numerically, using, for
example, ODE solvers such as the ones available in Matlab. We will use this
classic epidemiological model to model the spread of Covid-19 in Santa Bar-
bara County.
Part 1: Exploring the data (10 points)
Your first task consists of loading the data file covid_data.mat provided on
GauchoSpace and plotting its contents. The file consists of official data re-
ported by local authorities in the interval from March 15 to May 13 in Santa
1
Barbara County. It has 2 columns: the first contains the cumulative num-
ber of infected individuals in the population at a certain date, and
the second contains the cumulative number of deaths recorded on the
same date.
(a) Create a script named explore_data.m that loads the file covid_data.mat
provided.
(b) Still in your explore_data.m script, create a figure with two plots: i)
number of infected individuals vs time and ii) deaths vs time. Make sure that
you:
• Change the line thickness to 3 for each line.
• Use legends "Infected Individuals", "Deaths" to identify each curve.
• Name the x and y axes as "Time (days since March 15)" and "Number
of individuals", respectively. You can use legend(’location’,’best’)
to move your legend box to the top left corner, so it doesn’t cover your
plot.
• Give your plot the following title: "Covid-19 in Santa Barbara County".
• Turn the grid on using the command grid on.
• Increase the font size of legends, axis labels and title using the command
set(gca,’FontSize’,20).
At the end, your plot should look like Fig. (1). Upload explore_data.m
to GauchoSpace.
2
Figure 1: Plot of the number of individuals infected and deaths in Santa Barbara County
due to Covid-19.
Part 2: Estimating the parameters from the data (20
points)
Now that you have familiarized yourself with the data, let us come back to
the SIR model. Our goal is to use it to predict the growth of the number
of infected individuals and as well as the number of deaths in the near fu-
ture. But the problem is that we don’t know the values for the functions
β(t), γ(t) in Eqs. (1)-(3). Despite that, observing the data from places where
the Covid-19 epidemic got under control via quarantine measures can give us
some hints about these parameters. For example, it is reasonable to consider
that both the infection and the mortality rates decreased as more and more
people started taking prophylaxis measures, e.g., stay-at-home shelters, self-
quarantine, crowd control in public transportation systems and the usage of
masks to avoid the spread of the virus.
Based on this information, let us postulate that both the infection and
the mortality rate decreased exponentially after the quarantine measures were
implemented, i.e.,
γ(t) = 0.0247e−0.111t, (4)
β(t) = β0e
−0.0506t, (5)
where β0 is a constant to be determined. Your task will be to estimate β0
3
using the data you plotted in Part 1.
There exist many methods to estimate parameters from a given set of data
points. For simplicity, you are going to use a Euclidean norm approach to
minimize the error in the total number of cases C(t) = I(t) + R(t). The idea
is simple: solve Eqs. (1)-(3) for various β0 values within a range, and for each
of these values, compute the error given by the norm of the difference between
the model results C(t) and the data, i.e.,
err =
√∑Ndays
i=1 (I(i) +R(i)− covid_data(i, 1))2
Ndays
, (6)
where Ndays is the number of days covered by the data (i.e., the length of the
covid_data matrix).
(a) Write a function named sir_model.m that takes as input tspan (time
span) and X0 = [* ; * ; *] (column vector of initial conditions for S, I, R)
and returns the right-hand side of the ODE system given by Eqs. (1)-(3).
(b) Write a script named sir_main.m that solves the ODE system for various
values of β0 (use a for loop for that). Use the following range of values for
β0: beta0 = [0.1:0.001:0.5]./S0, where S0 is the initial susceptible popu-
lation. Compute the error using Eq. (6) for each value of β0 tested, and store
these values. For the initial conditions, consider that the susceptible popu-
lation of Santa Barbara County was 500,000, and assume that the infection
started with 1 person infected and zero deaths. Use a time span of 100 days.
Use the ode45() function to solve the ODE system.
(c) Still in your script sir_main.m, plot β0 vs err. Make sure that:
• Your plots look like Fig. 2.
• Give your plot the following title: "Error analysis (SIR model vs data)".
• Change the line thickness to 3 for each line.
• Name the x and y axis as "beta0" and "Error", respectively.
• Turn the grid on using the command grid on.
• Increase the font size of legends, axis labels and title using the command
set(gca,’FontSize’,20).
4
Figure 2: Evolution of the Euclidean norm of the error as a function of β0.
(d) Still in your script sir_main.m, plot i) the total number of cases C(t) =
I(t) + R(t) vs. time, and ii) the number of deaths R(t) vs. time, for the best
value of β0 found in part (c). In addition to that, overlap the curves for C and
R with the covid_data.mat data provided on GauchoSpace, so that you can
compare simulations with data. Make sure that:
• Your plots look like Fig. 3.
• Change the line thickness to 3 for each line.
• Instead of using the usual plot() command, use semilogy() to plot in
semi-log scale. This will make your two curves look more distinguishable.
• Give your plot the following title: "SIR model, beta0 = X", where X
must be replaced by the respective value of β0 found in (c).
• Use legends "Cumulative infected (SIR)", "Deaths (SIR)", "Cumulative
infected (data)", "Deaths (data)" in your plots.
• Name the x and y axis as "Time (days since March 15)" and "Number of
individuals", respectively. Again, you can use legend(’location’,’best’)
to move your legend box to the best available spot, so it doesn’t cover
your plots.
• Turn the grid on using the command grid on.
5
• Increase the font size of legends, axis labels and title using the command
set(gca,’FontSize’,20).
Note: The best β0 value obtained in (c) should be close to β0 ≈ 7.580 ×
10−7. If you have not been able to accomplish the optimization, please proceed
with β0 = 7.580× 10−7 for Part 3 of this project.
Figure 3: Semi-log plot of the solution of the SIR model with β0 = 7.580× 10−7.
Upload sir_model.m and sir_main.m to GauchoSpace.
6
Part 3: Stochastic simulation of SIR model (20 points)
Let’s take a look back at your plots from Part 2-(c). At this point you probably
noticed that we can find a β0 that fits the data reasonably well, but no matter
how much you change it, we will never find the perfect fit. One of the reasons
for that involves the fact that traditional deterministic (ODE) models like
the one given by Eqs. (1)-(3) cannot accurately predict the early stages of
the infection, since they rely on the hypothesis of a well-mixed population
and interactions of millions of individuals. In contrast, for small populations,
random events are often relevant, and therefore situations like jumps in the
number of cases are prone to occur; deterministic models, on the other hand,
disregard these events. To solve these problems, we use stochastic models.
Notice that the current number of infected individuals in Santa Barbara
County is relatively small compared to the total population. Thus, the deter-
ministic model given by Eqs. (1)-(3) might lead to incorrect predictions. Your
goal for this assignment is to solve a stochastic SIR model using a method
called the Gillespie algorithm.
The Gillespie algorithm belongs to a class of methods called dynamic Monte
Carlo methods. As you probably remember from homeworks 6 and 7, Monte
Carlo simulations allow us to estimate probabilities of events by testing mul-
tiple trials (also called realizations). Since each realization involves random
numbers, the results of Monte Carlo simulations are never exactly the same.
Similarly, each realization (simulation) via the Gillespie algorithm will gener-
ate unique curves for S(t), I(t), R(t).
The stochastic SIR model considers the following chains of events (also
called Markov chains)
S + I
β (t)−−→ 2 I, (7)
I
γ (t)−−→ R, (8)
where γ(t), β(t) are the same functions given by Eqs. (4)-(5). The chain given
by Eq. (7) describes the process of infection: each time a susceptible individual
meets an infected one, there’s a propensity (which is proportional to β(t)) that
dictates how often the susceptible individual becomes infected. Similarly, the
chain in Eq. (8) describes the process of infected individuals succumbing to
the virus and dying, with a propensity proportional to γ(t). The propensity
function αi gives the probability that an event in the chain i will occur in the
time interval (t, t+ τ):
αi = number of individuals interacting at chain ”i”× rate, (9)
7
which, for the chains given by Eq. (7)-(8), read
α1 = S(t)I(t)β(t), (10)
α2 = I(t)γ(t). (11)
For more details about the implementation of this algorithm, see p. 11 of
this final project assignment: Recipe for Stochastic SIR using Gillespie Algo-
rithm.
(a) Using the recipe provided on p. 11, write a function named ssir_model.m,
that takes as input tspan (time span) and X0 = [* ; * ; *] (column vector
of initial conditions for S, I, R) and returns one realization (i.e., column vec-
tors containing S(t), I(t), R(t), resulting from the algorithm provided in p. 9)
of the Gillespie algorithm, and a time vector tVec. Use the functions γ(t), β(t)
given by Eqs. (4)-(5), and use the value of β0 that you found best fit
the data in Part 2-(c), i.e., β0 = 7.580× 10−7 .
(b) Write a script named ssir_main.m to compute Nr = 5 realizations of the
stochastic SIR model by calling your function ssir_model 5 times (use a for
loop for that). For the initial conditions, assume that the susceptible popu-
lation of Santa Barbara County was 500,000, and assume that the infection
started with 1 person infected and zero deaths. Use a time span of 100 days.
(c) Still in your script ssir_main.m, plot i) the total number of cases C(t) =
I(t) +R(t) vs. time, and ii) the number of deceased individuals R(t) vs. time,
for each of the 5 realizations computed in Part 3-(b). In addition to that,
overlap the curves for C and R with the covid_data.mat data provided on
GauchoSpace, so that you can compare simulations with data. Make sure that:
• Your plots, legends and title look like Fig. 4. Note: since the method is
stochastic, your curves will be different than the ones shown in Fig. 4.
• Instead of using the usual plot(), use the command semilogy() to plot
in semi-log scale. This will make your two curves look more distinguish-
able.
• Give your plot the following title: "sSIR model".
• Use legends ’Cumulative infected (sSIR)’, ’Deaths (sSIR)’, ’Cumulative
infected (data)’, ’Deaths (data)’ in your plots. Note: Since all the real-
izations have the same legends, use ’HandleVisibility’,’off’ in your
semilogy() commands for realizations 2 to Nr, to avoid legend repetition.
8
• Name the x and y axis as "Time (days since March 15)" and "Number of
individuals", respectively. Again, you can use legend(’location’,’best’)
to move your legend box to the best available spot, so it doesn’t cover
your plots.
• Turn the grid on using the command grid on.
Figure 4: Semi-log plot of the solution of the stochastic SIR model for 5 realizations (each
line corresponds to a realization).
9
(d) In a similar manner as you did for your script ssir_main.m, create an-
other script, named ssir_hist.m, to compute Nr = 2000 realizations of the
stochastic SIR model. At the end of each realization i, store the last value of
the total number of cases, i.e., C(i) = I(end) = R(end). This value repre-
sents a prediction of the number of cases in Santa Barbara County after tspan
days (i.e., 100 days). With these values, create a histogram of the array C
using the command histogram(C). Your histogram should look like Fig. 5.
Based on your histogram, write a comment in your script ssir_hist.m indi-
cating, using your own words, what you think is reasonable upper bound for
the number of cases.
Hint: for an empirical distribution, you can compute the cumulative distri-
bution function using the command cdfplot(C). The plot gives an estimate
of the probability F that the number of cases C will be smaller than or equal
to a value X.
Upload ssir_model.m, ssir_main.m, ssir_hist.m to GauchoSpace.
Figure 5: Histogram of the total number of cases C = I+R at day 100, for 2000 realizations
of the stochastic SIR model.
10
Recipe for Stochastic SIR using the Gillespie Algorithm:
While t < tspan , do the following:
1. Generate two random numbers, r1, r2, uniformly distributed in the inter-
val (0, 1).
2. Compute the propensity function α for each chain. Since we have two
chains, the propensities are given by
α1 = S(t)I(t)β(t), (12)
α2 = I(t)γ(t), (13)
where γ(t), β(t) are the functions given by Eqs. (4)-(5). Use the value of
β0 that you found best fit the data in Part 2-(c).
3. Compute the total propensity α0 = α1 + α2.
4. Compute the time τ when the next event (i.e., either contamination or
death) will take place.
τ =
1
α0
ln
(
1
r1
)
(14)
5. Compute the number of individuals for each group S, I, R at time t + τ
as the following:
if r2 ∈ [0, α1/α0) :
S(t+ τ) = S(t)− 1,
I(t+ τ) = I(t) + 1,
R(t+ τ) = R(t),
(15)
else if r2 ∈ [α1/α0, 1) :
S(t+ τ) = S(t),
I(t+ τ) = I(t)− 1,
R(t+ τ) = R(t) + 1.
(16)
6. Make sure that the system always includes at least one infected person,
i.e., check if I(t+ τ) > 0. If I(t+ τ) becomes 0, include another infected
person at time t+ τ by setting S(t+ τ) = 1.
7. Save your new time as a point in the time vector: tVec(i) = t + tau
(so you can plot S(t), I(t), R(t) vs t later).
11