School of Mathematics
Bayesian Data Analysis, 2021/2022, Semester 2
Lecturer: Daniel Paulin
Assignment 1
IMPORTANT INFORMATION ABOUT THE ASSIGNMENT
In this paragraph, we summarize the essential information about this assignment. The format
and rules for this assignment are different from your other courses, so please pay attention.
1) Deadline: The deadline for submitting your solutions to this assignment is the 7 March
12:00 noon Edinburgh time.
2) Format: You will need to submit your work as 2 components: a PDF report, and your
R Markdown (.Rmd) notebook. There will be two separate submission systems on Learn:
Gradescope for the report in PDF format, and a Learn assignment for the code in Rmd
format. You are encouraged to write your solutions into this R Markdown notebook (code
in R chunks and explanations in Markdown chunks), and then select Knit/Knit to PDF in
RStudio to create a PDF report.
It suffices to upload this PDF in Gradescope submission system, and your Rmd file in the Learn
assignment submission system. You will be required to tag every sub question on Gradescope.
A video describing the submission process will be posted on Learn.
Some key points that are different from other courses:
a) Your report needs to contain written explanation for each question that you solve, and some
numbers or plots showing your results. Solutions without written explanation that clearly
demonstrates that you understand what you are doing will be marked as 0 irrespectively
whether the numerics are correct or not.
b) Your code has to be possible to run for all questions by the Run All in RStudio, and
reproduce all of the numerics and plots in your report (up to some small randomness due
to stochasticity of Monte Carlo simulations). The parts of the report that contain material
that is not reproduced by the code will not be marked (i.e. the score will be 0), and the only
feedback in this case will be that the results are not reproducible from the code.
1
c) Multiple Submissions are allowed BEFORE THE DEADLINE are allowed for both the
report, and the code. However, multiple submissions are NOT ALLOWED AFTER THE
DEADLINE. YOU WILL NOT BE ABLE TO MAKE ANY CHANGES TO YOUR SUBMISSION
AFTER THE DEADLINE. Nevertheless, if you did not submit anything before the
deadline, then you can still submit your work after the deadline. Late penalties will apply
unless you have a valid extension. The timing of the late penalties will be determined by the
time you have submitted BOTH the report, and the code (i.e. whichever was submitted later
counts).
We illustrate these rules by some examples:
Alice has spent a lot of time and effort on her assignment for BDA. Unfortunately she has
accidentally introduced a typo in her code in the first question, and it did not run using Run
All in RStudio. - Alice will get 0 for the whole assignment, with the only feedback “Results
are not reproducible from the code”.
Bob has spent a lot of time and effort on his assignment for BDA. Unfortunately he forgot to
submit his code. - Bob will get no personal reminder to submit his code. Bob will get 0 for
the whole assignment, with the only feedback “Results are not reproducible from the code, as
the code was not submitted.”
Charles has spent a lot of time and effort on his assignment for BDA. He has submitted both
his code and report in the correct formats. However, he did not include any explanations in
the report. Charles will get 0 for the whole assignment, with the only feedback “Explanation
is missing.”
Denise has spent a lot of time and effort on her assignment for BDA. She has submitted
her report in the correct format, but thought that she can include her code as a link in the
report, and upload it online (such as Github, or Dropbox). - Denise will get 0 for the whole
assignment, with the only feedback “Code was not uploaded on Learn.”
3) Group work: This is an INDIVIDUAL ASSIGNMENT, like a 2 week exam for the course.
Communication between students about the assignment questions is not permitted. Students
who submit work that has not been done individually will be reported for Academic Misconduct,
that can lead to serious consequences. Each problem will be marked by a single
instructor, so we will be able to spot students who copy.
4) Piazza: You are NOT ALLOWED to post questions about Assignment Problems visible to
Everyone on Piazza. You need to specify the visibility of such questions as Instructors only,
2
by selecting Post to / Individual students/Instructors and type in Instructors and click on the
blue Instructors banner that appears below
Students who post any information related to the solution of assignment problems visible to
their classmates will
a) have their access to Piazza revoked for the rest of the course without prior warning, and
b) reported for Academic Misconduct.
Only questions regarding clarification of the statement of the problems will be answered by
the instructors. The instructors will not give you any information related to the solution of
the problems, such questions will be simply answered as “This is not about the statement of
the problem so we cannot answer your question.”
THE INSTRUCTORS ARE NOT GOING TO DEBUG YOUR CODE, AND YOU ARE
ASSESSED ON YOUR ABILITY TO RESOLVE ANY CODING OR TECHNICAL DIFFICULTIES
THAT YOU ENCOUNTER ON YOUR OWN.
5) Office hours: There will be two office hours per week (Monday 16:00-17:00, and Wednesdays
16:00-17:00) during the 2 weeks for this assignment. The links are available on Learn /
Course Information. We will be happy to discuss the course/workshop materials. However,
we will only answer questions about the assignment that require clarifying the statement of
the problems, and will not give you any information about the solutions. Students who ask for
feedback on their assignment solutions during office hours will be removed from the meeting.
6) Late submissions and extensions: Students who have existing Learning Adjustments in
Euclid will be allowed to have the same adjustments applied to this course as well, but they
need to apply for this BEFORE THE DEADLINE on the website
https://www.ed.ac.uk/student-administration/extensions-special-circumstances
by clicking on “Access your learning adjustment”. This will be approved automatically.
For students without Learning Adjustments, if there is a justifiable reason (external circumstances)
for not being able to submit your assignment in time, then you can apply for an
extension BEFORE THE DEADLINE on the website
https://www.ed.ac.uk/student-administration/extensions-special-circumstances
by clicking on “Apply for an extension”. Such extensions are processed entirely by the central
ESC team. The course instructors have no role in this decision so you should not write to us
about such applications. You can contact our Student Learning Advisor, Maria Tovar Gallardo
(maria.tovar@ed.ac.uk) in case you need some advice regarding this.
Students who submit their work late will have late submission penalties applied by the ESC
team automatically (this means that even if you are 1 second late because of your internet
connection was slow, the penalties will still apply). The penalties are 5% of the total mark
3
deduced for every day of delay started (i.e. one minute of delay counts for 1 day). The course
intructors do not have any role in setting these penalties, we will not be able to change them.
The first picture is a rotifier (by Steve Gschmeissner), the second is a unicellular algae (by NEON ja, colored
by Richard Bartz).
Problem 1 - Rotifier and algae data
In this problem, we study an experimental dataset (Blasius et al. 2020, https://doi.org/10.
1038/s41586-019-1857-0) about predator-prey relationship between two microscopic organism:
rotifier (predator) and unicellular green algae (prey). These were studied in a controlled environment
(water tank) in a laboratory over 375 days. The dataset contains daily observations
of the concentration of algae and rotifiers. The units of measurement in the algae column is
106 algae cells per ml of water, while in the rotifier column it is the number of rotifiers per ml
of water.
We are going to apply a simple two dimensional state space model on this data using JAGS.
The first step is to load JAGS and the dataset.
# We load JAGS
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
4
## Loaded modules: basemod,bugs
#You may need to set the working directory first before loading the dataset
#setwd("/Users/dpaulin/Dropbox/BDA_2021_22/Assignments/Assignment1")
rotifier_algae=read.csv("rotifier_algae.csv")
#The first 6 rows of the dataframe
print.data.frame(rotifier_algae[1:6,])
## day algae rotifier
## 1 1 1.50 NA
## 2 2 0.82 6.58
## 3 3 0.77 17.94
## 4 4 0.36 17.99
## 5 5 0.41 21.12
## 6 6 0.41 17.06
As we can see, some values in the dataset are missing (NA).
We are going to model the true log concentrations xt by the state space model
A are model parameters, and t denotes the time point. In particular, t = 0
corresponds to day 0, and t = 1, 2, . . . , 375 correspond to days 1-375.
Here xt is a two dimensional vector. The first component denotes the logarithm of the rotifier
concentration measured in number of rotifiers per ml of water, and the second component
denotes the logarithm of the algae concentration measured in 106 algae per ml (these units
are the same as in the dataset). A =
A11 A12
A21 A22
is a two times two matrix, and b is a two
dimensional vector.
The observation process is described as
yt = xt + vt,
R are additional model parameters.
a)[10 marks] Create a JAGS model that fits the above state space model on the rotifier-algae
dataset for the whole 375 days period.
Use 10000 burn-in steps and obtain 50000 samples from the model parameters A, b, σ2
R, σ2
A, η2
R, η2
A
(4+2+4=10 parameters in total).
Use a Gaussian prior N
log(6)
log(1.5)
,
4 0
0 4 for the initial state x0, independent Gaussian
N(0, 1) priors for each 4 elements of A, Gaussian prior N
0
0
,
1 0
0 1 for b, and inverse
Gamma (0.1,0.1) prior for the variance parameters σ
2
R, σ2
A, η2
R, η2
A.
Explain how did you handle the fact that some of the observations are missing (NA) in the
dataset.
5
Explanation: (Write your explanation here)
b)[10 marks]
Based on your MCMC samples, compute the Gelman-Rubin convergence diagnostics (Hint:
you need to run multiple chains in parallel for this by setting the n.chains parameter). Discuss
how well has the chain converged to the stationary distribution based on the results.
Print out the summary of the fitted JAGS model. Do autocorrelation plots for the 4 components
of the model parameter A.
Compute and print out the effective sample sizes (ESS) for each of the model parameters
A, b, σ2
R, σ2
A, η2
R, η2
A.
If the ESS is below 1000 for any of these 10 parameters, increase the sample size/number of
chains until the ESS is above 1000 for all 10 parameters.
Explanation: (Write your explanation here)
c)[10 marks]
We are going to perform posterior predictive checks to evaluate the fit of this model on the data
(using the priors stated in question a). First, create replicate observations from the posterior
predictive using JAGS. The number of replicate observations should be at least 1000.
Compute the minimum, maximum, and median for both log-concentrations (i.e. both for
rotifier and algae, 3 · 2 = 6 in total).
Plot the histograms for these quantities together with a line that shows the value of the function
considered on the actual dataset (see the R code for Lecture 2 for an example). Compute the
DIC score for the model (Hint: you can use the dic.samples function for this).
Discuss the results.
Explanation: (Write your explanation here)
d)[10 marks]
Discuss the meaning of the model parameters A, b, σ2
R, σ2
A, η2
R, η2
A. Find a website or paper that
that contains information about rotifiers and unicellular algae (Hint: you can use Google search
for this). Using your understanding of the meaning of model parameters and the biological
information about these organisms, construct more informative prior distributions for the
model parameters. State in your report the source of information and the rationale for your
choices of priors.
Re-implement the JAGS model with these new priors. Perform the same posterior predictive
checks as in part c) to evaluate the fit of this new model on the data.
Compute the DIC score for the model as well (Hint: lower DIC score indicates better fit on
the data).
Discuss whether your new priors have improved the model fit compared to the original prior
from a).
Explanation: (Write your explanation here)
e)[10 marks] Update the model with your informative prior in part d) to compute the posterior
distribution of the log concentrations sizes (xt) on the days 376-395 (20 additional days).
Plot the evolution of the posterior mean of the log concentrations for rotifier and algae during
days 376-395 on a single plot, along with curves that correspond to the [2.5%, 97.5%] credible
interval of the log concentration size (xt) according to the posterior distribution at each year
[Hint: you need** 2 + 2 · 2 = 6 **curves in total, use different colours for the curves for rotifier
and algae].
6
Finally, estimate the posterior probability that the concentration of algae (measured in 10ˆ6
algae/ml, as in the data) becomes smaller than 0.1 at any time during this 20 additional days
(days 376-395).
Explanation: (Write your explanation here)
Problem 2 - Horse racing data
In this problem, we are going to construct a predictive model for horse races. The dataset
(races.csv and runs.csv) contains the information about 1000 horse races in Hong Kong during
the years 1997-1998 (originally from https://www.kaggle.com/gdaley/hkracing). Races.csv
contains information about each race (such as distance, venue, track conditions, etc.), while
runs.csv contains information about each horse participating in each race (such as finish
time in the race). Detailed description of all columns in these files is available in the file
horse_racing_data_info.txt.
Our goal is to model the mean speed of each horse during the races based on covariates
available before the race begins.
We are going to use INLA to fit several different regression models to this dataset. First, we
load ILNA and the datasets and display the first few rows.
library(INLA)
## Loading required package: Matrix
## Loading required package: foreach
## Loading required package: parallel
## Loading required package: sp
## This is INLA_21.11.22 built 2021-11-21 16:10:15 UTC.
## - See www.r-inla.org/contact-us for how to get help.
## - Save 80Mb of storage running ’inla.prune()’
7
#If it loaded correctly, you should see this in the output:
#Loading required package: Matrix
#Loading required package: foreach
#Loading required package: parallel
#Loading required package: sp
#This is INLA_21.11.22 built 2021-11-21 16:13:28 UTC.
# - See www.r-inla.org/contact-us for how to get help.
# - To enable PARDISO sparse library; see inla.pardiso()
#The following code does the full installation. You can try it if INLA has not been installed.
#First installing some of the dependencies
# install.packages("BiocManager")
# BiocManager::install("Rgraphviz")
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("graph")
#Installing INLA
# install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
#library(INLA)
runs <- read.csv(file = 'runs.csv')
head(runs)
## race_id horse_no horse_id result won lengths_behind horse_age horse_country
## 1 0 1 3917 10 0 8.00 3 AUS
## 2 0 2 2157 8 0 5.75 3 NZ
## 3 0 3 858 7 0 4.75 3 NZ
## 4 0 4 1853 9 0 6.25 3 SAF
## 5 0 5 2796 6 0 3.75 3 GB
## 6 0 6 3296 3 0 1.25 3 NZ
## horse_type horse_rating horse_gear declared_weight actual_weight draw
## 1 Gelding 60 -- 1020 133 7
## 2 Gelding 60 -- 980 133 12
## 3 Gelding 60 -- 1082 132 8
## 4 Gelding 60 -- 1118 127 13
## 5 Gelding 60 -- 972 131 14
## 6 Gelding 60 -- 1114 127 5
## position_sec1 position_sec2 position_sec3 position_sec4 position_sec5
## 1 6 4 6 10 NA
## 2 12 13 13 8 NA
## 3 3 2 2 7 NA
## 4 8 8 11 9 NA
## 5 13 12 12 6 NA
## 6 11 11 5 3 NA
## position_sec6 behind_sec1 behind_sec2 behind_sec3 behind_sec4 behind_sec5
## 1 NA 2.00 2.00 1.50 8.00 NA
## 2 NA 6.50 9.00 5.00 5.75 NA
## 3 NA 1.00 1.00 0.75 4.75 NA
## 4 NA 3.50 5.00 3.50 6.25 NA
## 5 NA 7.75 8.75 4.25 3.75 NA
## 6 NA 5.00 7.75 1.25 1.25 NA
## behind_sec6 time1 time2 time3 time4 time5 time6 finish_time win_odds
## 1 NA 13.85 21.59 23.86 24.62 NA NA 83.92 9.7
8
## 2 NA 14.57 21.99 23.30 23.70 NA NA 83.56 16.0
## 3 NA 13.69 21.59 23.90 24.22 NA NA 83.40 3.5
## 4 NA 14.09 21.83 23.70 24.00 NA NA 83.62 39.0
## 5 NA 14.77 21.75 23.22 23.50 NA NA 83.24 50.0
## 6 NA 14.33 22.03 22.90 23.57 NA NA 82.83 7.0
## place_odds trainer_id jockey_id
## 1 3.7 118 2
## 2 4.9 164 57
## 3 1.5 137 18
## 4 11.0 80 59
## 5 14.0 9 154
## 6 1.8 54 34
races<- read.csv(file = 'races.csv')
head(races)
## race_id date venue race_no config surface distance going
## 1 0 1997-06-02 ST 1 A 0 1400 GOOD TO FIRM
## 2 1 1997-06-02 ST 2 A 0 1200 GOOD TO FIRM
## 3 2 1997-06-02 ST 3 A 0 1400 GOOD TO FIRM
## 4 3 1997-06-02 ST 4 A 0 1200 GOOD TO FIRM
## 5 4 1997-06-02 ST 5 A 0 1600 GOOD TO FIRM
## 6 5 1997-06-02 ST 6 A 0 1200 GOOD TO FIRM
## horse_ratings prize race_class sec_time1 sec_time2 sec_time3 sec_time4
## 1 40-15 485000 5 13.53 21.59 23.94 23.58
## 2 40-15 485000 5 24.05 22.64 23.70 NA
## 3 60-40 625000 4 13.77 22.22 24.88 22.82
## 4 120-95 1750000 1 24.33 22.47 22.09 NA
## 5 60-40 625000 4 25.45 23.52 23.31 23.56
## 6 60-40 625000 4 23.47 22.48 23.25 NA
## sec_time5 sec_time6 sec_time7 time1 time2 time3 time4 time5 time6 time7
## 1 NA NA NA 13.53 35.12 59.06 82.64 NA NA NA
## 2 NA NA NA 24.05 46.69 70.39 NA NA NA NA
## 3 NA NA NA 13.77 35.99 60.87 83.69 NA NA NA
## 4 NA NA NA 24.33 46.80 68.89 NA NA NA NA
## 5 NA NA NA 25.45 48.97 72.28 95.84 NA NA NA
## 6 NA NA NA 23.47 45.95 69.20 NA NA NA NA
## place_combination1 place_combination2 place_combination3 place_combination4
## 1 8 11 6 NA
## 2 5 13 4 NA
## 3 11 1 13 NA
## 4 5 3 10 NA
## 5 2 10 1 NA
## 6 9 14 8 NA
## place_dividend1 place_dividend2 place_dividend3 place_dividend4
## 1 36.5 25.5 18.0 NA
## 2 12.5 47.0 33.5 NA
## 3 23.0 23.0 59.5 NA
## 4 14.0 24.5 16.0 NA
## 5 15.5 28.0 17.5 NA
## 6 16.5 408.0 70.0 NA
## win_combination1 win_dividend1 win_combination2 win_dividend2
## 1 8 121.0 NA NA
## 2 5 23.5 NA NA
9
## 3 11 70.0 NA NA
## 4 5 52.0 NA NA
## 5 2 36.5 NA NA
## 6 9 61.0 NA NA
a)[10 marks] Create a dataframe that includes the mean speed of each horse in each race and
the distance of the race in a column [Hint: you can do this adding two extra columns to the
runs dataframe].
Fit a linear regression model (lm) with the mean speed as a response variable. The covariates
should be the horse id as a categorical variable, and the race distance, horse rating, and
horse age as standard variable. Scale the non-categorical covariates before fitting the model
(i.e. center and divide by their standard deviation, you can use the scale function in R for
this).
Print out the summary of the lm model, discuss the quality of the fit.
Explanation: (Write your explanation here)
b)[10 marks] Fit the same model in INLA (i.e. Bayesian linear regression with Gaussian likelihood,
mean speed is the response variable, and the same covariates used with scaling for
the non-categorical covariates). Set a Gamma (0.1,0.1) prior for the precision, and Gaussian
priors with mean zero and variance 1000000 for all of the regression coefficients (including the
intercept).
Print out the summary of the INLA model. Compute the posterior mean of the variance
parameter σ
2
. Plot the posterior density for the variance parameter σ
2
. Compute the negative
sum log CPO (NSLCPO) and DIC values for this model (smaller values indicate better fit).
Compute the standard deviation of the mean residuals (i.e. the differences between the posterior
mean of the fitted values and the true response variable).
Discuss the results.
Explanation: (Write your explanation here)
c)[10 marks] In this question, we are going to improve the model in b) by using more informative
priors and more columns from the dataset.
First, using some publicly available information from the internet (Hint: use Google search)
find out about the typical speed of race horses in Hong Kong, and use this information to
construct a prior for the intercept. Explain the rationale for your choice.
Second, look through all of the information in the datasets that is available before the race
(Hint: you need to read the description horse_racing_data_info.txt for information about
the columns. position, behind, result, won, and time related columns are not available before
the race). Discuss your rationale for including some of these in the dataset (make sure to scale
them if they are non-categorical).
Feel free to try creating additional covariates such as polynomial or interaction terms (Hint:
this can be done using I() in the formula), and you can also try to use a different likelihood
(such as Student-t distribution).
Fit your new model in INLA (i.e. Bayesian linear regression, mean speed is the response
variable, and scaling done for the non-categorical covariates).
Print out the summary of the INLA model. Compute the negative sum log CPO (NSLCPO)
and DIC values for this model (smaller values indicate better fit).
Compute the standard deviation of the mean residuals (i.e. the differences between the posterior
mean of the fitted values and the true response variable).
10
Discuss the results and compare your model to the model from b).
Please only include your best performing model in the report.
Explanation: (Write your explanation here)
d)[10 marks] We are going to perform model checks to evaluate the fit the two models in parts
b) and c) on the data.
Compute the studentized residuals for the Bayesian regression model from parts b) and c).
Perform a simple Q-Q plot on the studentized residuals. Plot the studentized residuals versus
their index, and also plot the studentized residuals against the posterior mean of the fitted
value (see Lecture 2). Discuss the results.
Explanation: (Write your explanation here)
e)[10 marks] In this question, we are going to use the model you have constructed in part
c) to predict a new race, i.e. calculate the posterior probabilities of each participating horse
winning that race. First, we load the dataset containing information about the future race.
race_to_predict <- read.csv(file = 'race_to_predict.csv')
race_to_predict
## race_id date venue race_no config surface distance going horse_ratings
## 1 1000 1998-09-18 ST 2 B+2 0 1400 GOOD 40-15
## prize race_class sec_time1 sec_time2 sec_time3 sec_time4 sec_time5 sec_time6
## 1 485000 5 NA NA NA NA NA NA
## sec_time7 time1 time2 time3 time4 time5 time6 time7 place_combination1
## 1 NA NA NA NA NA NA NA NA 5
## place_combination2 place_combination3 place_combination4 place_dividend1
## 1 7 8 NA 27.5
## place_dividend2 place_dividend3 place_dividend4 win_combination1
## 1 43 57 NA 5
## win_dividend1 win_combination2 win_dividend2
## 1 86 NA NA
runs_to_predict <- read.csv(file = 'runs_to_predict.csv')
runs_to_predict
## race_id horse_no horse_id result won lengths_behind horse_age horse_country
## 1 1000 1 3940 NA NA NA 3 NZ
## 2 1000 2 474 NA NA NA 3 NZ
## 3 1000 3 3647 NA NA NA 3 NZ
## 4 1000 4 144 NA NA NA 3 AUS
## 5 1000 5 3712 NA NA NA 3 AUS
## 6 1000 6 3734 NA NA NA 3 AUS
## 7 1000 7 1988 NA NA NA 3 AUS
## 8 1000 8 3247 NA NA NA 3 AUS
## 9 1000 9 4320 NA NA NA 3 NZ
## 10 1000 10 1077 NA NA NA 3 NZ
## 11 1000 11 3916 NA NA NA 3 AUS
## 12 1000 12 768 NA NA NA 3 NZ
## 13 1000 13 3164 NA NA NA 3 SAF
## 14 1000 14 498 NA NA NA 3 AUS
## horse_type horse_rating horse_gear declared_weight actual_weight draw
## 1 Gelding 60 -- 1148 133 7
11
## 2 Gelding 60 -- 1039 122 4
## 3 Gelding 60 -- 1064 129 5
## 4 Gelding 60 -- 1086 131 2
## 5 Gelding 60 -- 1101 128 6
## 6 Gelding 60 -- 1137 130 8
## 7 Gelding 60 -- 1063 122 11
## 8 Gelding 60 -- 1092 126 10
## 9 Gelding 60 -- 1096 126 13
## 10 Gelding 60 -- 1034 123 9
## 11 Gelding 60 -- 1125 124 1
## 12 Gelding 60 -- 1191 123 3
## 13 Gelding 60 -- 1059 120 14
## 14 Gelding 60 -- 1027 112 12
## position_sec1 position_sec2 position_sec3 position_sec4 position_sec5
## 1 9 6 6 14 NA
## 2 4 4 4 4 NA
## 3 5 3 3 12 NA
## 4 6 8 7 5 NA
## 5 1 2 1 1 NA
## 6 10 9 8 6 NA
## 7 3 1 2 2 NA
## 8 2 5 5 3 NA
## 9 14 13 13 9 NA
## 10 7 7 9 13 NA
## 11 12 11 10 8 NA
## 12 13 14 14 10 NA
## 13 8 10 11 7 NA
## 14 11 12 12 11 NA
## position_sec6 behind_sec1 behind_sec2 behind_sec3 behind_sec4 behind_sec5
## 1 NA 2.75 2.75 3.00 11.00 NA
## 2 NA 1.00 1.75 2.00 2.25 NA
## 3 NA 1.25 1.25 1.50 7.50 NA
## 4 NA 2.00 4.25 3.75 2.75 NA
## 5 NA 0.50 0.15 0.50 1.50 NA
## 6 NA 2.75 4.25 3.75 3.50 NA
## 7 NA 0.75 0.15 0.50 1.50 NA
## 8 NA 0.50 2.50 2.50 2.00 NA
## 9 NA 5.25 6.75 6.75 3.75 NA
## 10 NA 2.25 3.50 4.25 9.50 NA
## 11 NA 3.75 5.50 5.25 3.75 NA
## 12 NA 5.25 7.50 7.25 6.75 NA
## 13 NA 2.75 5.00 5.25 3.50 NA
## 14 NA 3.75 6.25 6.50 7.25 NA
## behind_sec6 time1 time2 time3 time4 time5 time6 finish_time win_odds
## 1 NA NA NA NA NA NA NA NA 55.0
## 2 NA NA NA NA NA NA NA NA 4.6
## 3 NA NA NA NA NA NA NA NA 11.0
## 4 NA NA NA NA NA NA NA NA 3.8
## 5 NA NA NA NA NA NA NA NA 8.6
## 6 NA NA NA NA NA NA NA NA 5.9
## 7 NA NA NA NA NA NA NA NA 12.0
## 8 NA NA NA NA NA NA NA NA 20.0
## 9 NA NA NA NA NA NA NA NA 21.0
## 10 NA NA NA NA NA NA NA NA 57.0
12
## 11 NA NA NA NA NA NA NA NA 26.0
## 12 NA NA NA NA NA NA NA NA 18.0
## 13 NA NA NA NA NA NA NA NA 27.0
## 14 NA NA NA NA NA NA NA NA 62.0
## place_odds trainer_id jockey_id
## 1 17.0 38 138
## 2 1.7 47 31
## 3 2.8 54 34
## 4 1.5 138 57
## 5 2.7 75 131
## 6 2.2 29 18
## 7 4.3 7 63
## 8 5.7 69 151
## 9 5.3 109 145
## 10 17.0 117 38
## 11 6.2 128 125
## 12 5.4 97 49
## 13 8.0 55 91
## 14 17.0 63 155
Based on your model from part c), compute the posterior probabilities of each of these 14 horses
winning the race. [Hint: you will need to sample from the posterior predictive distribution.]
Explanation: (Write your explanation here)