# Data Science作业代做、代写Statistical Modelling作业、R编程语言作业代写、代做R课程作业代做SPSS|代做SPSS

Applications of Data Science and Statistical Modelling
Assignment 4
29/11/2019
The dataset SubstationRPD.RData contains real power delivered (KW) for each 10-minute period, of every
day during June and July, for 410 substations in the southwest of Wales, UK. The aim of this assignment is
to understand how the power demand changes throughout the day, identify any weekly/monthly patterns if
present, and using this information fit a GAM which allows us to predict future demands. Note that in order
to fit a GAM you’ll need to have the mgcv package installed.
1. [3 marks] Produce summaries of the dataset SubstationRP D.RData and produce histograms showing
the distributions of real power delivered for the 410 substations. Comment on the distributions of
real power delivered, and any variations between those distributions between substations. (E.g. You
could choose specific 10 minute intervals - say the 10 minute window after midnight, and plot the
distribution of the power demand across the substations, or look at average daily demands, maximum
daily demands...)
2. [3 marks] For each substation, calculate the average demand for each 10 minute period (that is you
should average over the days) and then plot these on the same plot, using a different colour for each
substation. Add a thick, black line showing the overall mean for the demand of all of the substations.
Comment on the variability in patterns between substations. Does the overall mean seem a reasonable
summary of all the data? (Hint: Since we are plotting 410 separate curves, you might want to suppress
the legend, which can be done using the ggplot option ‘theme(legend.position = "none")‘).
0
100
200
300
400
00:00 04:00 08:00 12:00 16:00 20:00 23:50
Time
Average Daily Demand
All days
3. [3 marks] Split your plot in Question 2 into four separate plots representing; 1) All days, 2) Weekdays,
3) Saturdays and 4) Sundays. Are there any differences in patterns between days? (Hint: You might
find the ‘weekdays‘ function useful.)
Now that we understand how the demand changes throughout the day, and have identified some seasonal
patterns, the next step is to fit a GAM to our data:
4. [2 marks] First, reformat the SubstationRPD.RData dataset so that each row is the average of all
demand data for each substation. That is each row corresponds to one day, and in each column you
should have the average demand (across all substations) for the corresponding 10 minute period.
1
5. [10 marks] Add a column with the day of the month, and another one with the month of the year. Note
that you can access these using the following R code:
as.numeric(substr(Date,9,10)) # day
as.numeric(substr(Date,6,7)) # month
Next collapse the data, so that the previously calculated mean power demands are in a single column, instead
of separate rows. By this point you should have a dataset similar to the following:
# A tibble: 6 x 6
# Groups: Date, weekdays [1]
Date weekdays minute.int mean day month

1 2012-06-01 Friday 1 56.7 1 6
2 2012-06-01 Friday 2 57.0 1 6
3 2012-06-01 Friday 3 56.6 1 6
4 2012-06-01 Friday 4 55.7 1 6
5 2012-06-01 Friday 5 55.5 1 6
6 2012-06-01 Friday 6 54.9 1 6
Fit and plot a GAM which accounts for the underlying seasonal pattern in demands (you should decide which
seasonal patterns are appropriate to include - daily (use the minute.int column in the above dataset), weekly -
(use the day column in the above dataset), monthly - (use the month column in the above dataset)). Comment
on the fit of the model. What are the (effective) degrees of freedom, and what does this tell us about the
complexity of the model that has been fit?
6. [4 marks] Choose an appropriate model, with which predict the demand for the 21st to the 28th of July.
Take the daily average demand, and produce a plot showing these mean predictions against time. You
can use the following code to create a new dataset for the prediction. Note that depending on how you
named the columns of your dataset you might have to modify the column names in the following code:
new.data <- data.frame(matrix(c(rep(1:144,8),rep(21:28,144),
rep(7,1152)),nrow=1152,ncol=3,byrow=FALSE))
new.data\$Date <- rep(seq(as.Date("2012-07-21"),
as.Date("2012-07-28"),"days"),144)
names(new.data) <- c("minute.int","day","month","Date")
All the exercises should be solved using R. A pdf document with your answers, (commented)
R code and its outputs/plots should be submitted via ELE by Noon (12pm), 18th December.
Note that late submissions will be penalised.