辅导Epid 633、讲解R编程语言、Data set辅导、讲解R辅导Python程序|辅导留学生Prolog

Lab 6: Parameter Estimation
Epid 633
Scenario & Setup
A flu-like illness has been spreading through Stardew Valley County. You have been asked to develop a predictive model for the disease, which the Stardew Valley health authorities will use to estimate several important parameters that can help in designing an intervention:

the infectious period of the disease
the reporting fraction (i.e. what fraction of cases are reported)
the transmission rate and/or basic reproduction number (R0)
In addition, they are hoping that your model will be able to successfully forecast the number of cases there will be in the upcoming weeks.

Data set

Here’s the data set that we’ll be working with–weekly data on the number of currently ill individuals:

times = c(0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98)
data = c(97, 271, 860, 1995, 4419, 6549, 6321, 4763, 2571, 1385, 615, 302, 159, 72, 34)

1) SIR Model Setup
Start by setting up an SIR model, using the equations:
In this case, we’ll take S, I, and R to be the fraction of the population that’s susceptible, infectious, and recovered. The measurement equation we’ll be using is:

y=kNI

where N is the total population size and k is the reporting fraction. To break down this equation, note that I is the fraction of the population currently infected, so NI is the total number of actual people currently infected. Then y=kNI represents the total number of observed cases currently infected.

Code up the model above, noting that:

Stardew Valley County has a total population of N=200000. We will treat this as a fixed (i.e. known, not estimated) parameter, so you can just code this as a number (i.e. it’s not part of the parameter vector that you’ll use for estimation).

At t=0, approximiately 200 people were infected, and no one had recovered yet. This means that the initial conditions were that 99.9% of the population was susceptible, and 0.1% was infected (i.e. x0 = c(0.999,0.001,0) for S, I, and R respectively).

For starting parameters, we can take β=0.4. So far, the infectious period appears to be roughly 4 days, so we can start with a guess of γ=1/4=0.25. For the reporting fraction, let’s start by supposing we have perfect reporting (i.e. every case is reported), so that k=1.

Now, plot your model output (y) from 0 to 100 days, and plot the data on the same plot (use a line for your model and points/dots for the data). How well does the model match the data using our initial guesses for the parameters? Do you notice any issues with the model fit currently?
2) Setting up the likelihood
Next, we need to make a function that will evaluate the goodness-of-fit for the model—in this case, the negative log likelihood. We will pass this function to the optimizer, optim, which will use our function to find the best-fit parameter estimates (i.e. the parameter values that minimize our function).

Code up a likelihood function for the SIR model, using supposing that your data is taken from a Poisson distribution with rate parameter y (I’ve included the Poisson likelihood form in the template code below, but you can try deriving it yourself if you want! It’s just like the Gaussian likelihood we did on Monday, just use a Poisson distribution instead of a normal distribution). Here’s the overall structure that the likelihood function will take (note that this is just a template! You need to adapt this for your own code, depending on how you name things, etc.):

CalcNLL = function(params, ODEfunction, x0, times, data){
# The function will calculate the negative log likelihood---to do that, we need:
# - the parameters (this is what optim will fiddle with to find the best fit)
#
# - the ODE function for the SIR model you wrote in part 1). Optim will use
# to simulate the model so it can compare the model to the data
#
# - to simulate the model with SIRode, we also need the initial conditions (x0)
# and the time points (times)
#
# - lastly, we need to pass the data into NLL, so we can compare the model and
# data

# Since we don't want to include negative values for the parameters, start by taking the absolute value of whatever the current parameter values are (note this is a bit of a silly trick, ask me more if you're confused)
params = abs(params)

# Simulate model
xcurr = ode(x0, times, ODEfunction, params) # replace this with whatever ODE code you're using to run your model!

# Measurement equation
y = k*N*I # replace "k", "N", and "I" with whatever they are in your code, e.g. params[1], xcurr[,3], etc.

# Negative Log Likelihood (NLL)
NLL = sum(y) - sum(data*log(y)) # Poisson ML
# note this is a slightly shortened version--there's an additive constant term missing but it shouldn't alter the optimal estimates. Alternatively, can do:
# NLL = -sum(log(dpois(round(data),round(y)))) # the round is b/c Poisson is for (integer) count data

# return(NLL)
}

Evaluate your likelihood function for a couple of values—try your default parameters, and then try taking β=0.42 and β=0.38 (leaving the other parameters equal to the defaults). Based on this, if you were optim, which direction would you move β to improve the fit? Does this match with what you saw in your initial model fit in Part 1?

3) Parameter Estimation
Okay, time to run the optimizer! Estimate your model parameters from the data set provided. While you’re at it, also pull the final, best-fit negative log likelihood value. The function optim takes the following inputs:

starting parameters the first input for optim will be the starting values for the model parameters

cost function - in our case, this is our likelihood-calculator we made above

arguments for the cost function - all the extra arguments that the cost function takes (i.e. everything else the cost function needs, like data,times,your ODE function, etc.)

So the structure is: optim(starting parameters, cost function, arguments for cost function). Here’s a template for how to set up your optimizer:

### First, run optim ###
results = optim(initialparams, fn = CalcNLL, ODEfunction = SIRode, times = times, data = data, x0 = x0) # Ask me if you're confused about why we have 'times = times', 'data = data', etc.!

### Pull out the parameter estimates ###
paramests = results$par

### Pull out the final cost function value (-log likelihood) ###
negLL = results$value
Plot your model and data on the same plot (like you did in Part 1). How does your model fit look now? How did your parameters change from your initial guess? What are the values you estimated for each of your parameters (β, γ, and k)? What implications might your estimate of k have for intervention efforts?

4) Forecasting
Let’s consider the case where you are attempting to fit and forecast an ongoing epidemic (i.e. with incomplete data). Truncate your data to only include the first five data points, then re-fit the model parameters.

How do your parameter estimates change?

Simulate your model for the full time (0 to 100 days) and plot your model predictions along with the complete data set. How well did you predict the dynamics of the epidemic?

Now re-do the forecast, but suppose that we only have the first four data points of the epidemic—how do your estimates and predictions change?

5) Model Comparison and the Akaike Information Criterion (AIC)
Lastly, let’s compare the SIR model we’ve been using to a slightly more complex model, the SEIR model:
Here, E represents the fraction of the population who are infected but not yet transmissible (i.e. latently infected).

Estimate the parameters using this model and the complete data set, supposing that:

The initial conditions for S, E, I, and R are: x0 = c(0.9985,0.001,0.0015,0)

The initial parameter guesses are params = c('beta'=0.4,'alpha' = 1, 'gamma'=0.25, 'kappa'=1)

You can use the same measurement equation as before (y=kNI). Does the fit look better/different than what you saw with the SIR model in Part 3? How are the parameter estimates different with this model compared to the SIR model?

Calculate the AIC for both models and compare. Which model has a lower (better) AIC? How different are the two AICs? Recall that the AIC is given by:
2(−LL+nP)
where −LL is the negative log likelihood and nP is the number of parameters to be estimated (i.e. 3 for the SIR model and 4 for the SEIR model).

2π) Extra problems (just for fun—these are not required and don’t need to be turned in!)
Try out alternative cost functions, e.g. using dpois for the Poisson likelihood, or using least squares. Do you notice differences in fit? What about in how long it takes to generate the parameter estimates?

Try making a heat map of the likelihood as a function of β and γ (you could try out a fancier plotting package like ggplot2 or plotly for this). How does the shape of the likelihood change as you reduce the amount of data you have?