辅导SIT114、讲解data留学生、辅导R编程设计、讲解R语言调试C/C++编程|解析C/C++编程

SIT114 2020.T1: Task 6.4HD
Contents
1 To Do 1
2 Hint 2
3 Further Reading 4
4 Artefacts 4
5 Intended Learning Outcomes 4
1 To Do
Create a single RMarkdown report where you perform what follows.
1. Just as in task 6.3D, load the Wine Quality dataset.
wines <- read.csv("winequality-all.csv",
comment="#", stringsAsFactors=FALSE)
2. Then, add a new 0/1 column named quality again (quality equal to 1 if
and only if a wine is ranked 7 or higher).
3. Perform a random train-test split of size 60-40% – create the matrices
X_train and X_test and the corresponding label vectors Y_train and
Y_test that provide the information on the wines’ quality.
4. Your task is to determine the best (see below) parameter setting for the
K-nearest neighbour classification of the quality variable based on the 11
physicochemical features. Perform the so-called grid (exhaustive) search
over all the possible combinations of the following parameters:
a. K: 1, 3, 5, 7 or 9
b. preprocessing: none (raw input data), standardised variables or robustly
standardised variables
c. metric: L2 (Euclidean) or L1 (Manhattan)
1
In other words, there are 5*3*2=30 combinations of parameters in total,
and hence – 30 different scenarios to consider.
By robust standardisation we mean: from each column, subtract
its median and then divide by median absolute deviation
(MAD, i.e., median(abs(x-median(x)))). This data preprocessing
scheme is less sensitive to outliers than the classic standardisation.
Note that the L1 metric-based K-nearest neighbour method is
not implemented in the FNN package. You need to implement it
on your own (see Chapter 3 of LMLCR).
By the best classifier we mean the one that maximises the F-measure
obtained by the so-called 5-fold cross-validation.
In Chapter 3 we discussed that it would not be fair to use the test set for
choosing of the optimal parameters (we would be overfitting to the test
set). We know that one possible way to assure the transparent evaluation
of a classifier is to perform a train-validate-test split and use the validation
set for parameter tuning.
Here we will use a different technique – one that estimates the methods’
“true” predictive performance more accurately, yet at the cost of significantly
increased run-time. Namely, in 5-fold cross-validation, we split the original
train set randomly into 5 disjoint parts: A, B, C, D, E (more or less of
the same number of observations). We use each combination of 4 chunks
as training sets and the remaining part as the validation set, on which we
compute the F-measure:
train set validation set F-measure
B, C, D, E A FA
A, C, D, E B FB
A, B, D, E C FC
A, B, C, E D FD
A, B, C, D E FE
Finally, we report the average F-measure, (FA + FB + FC + FD + FE)/5.
5. Report the best scenario (out of 30) together with the corresponding
classifier’s accuracy, precision, recall and F-measure on the test set.
Make sure that the report has a readable structure. Divide the document into
sections. Before each code chunk, explain what purpose does it serve.
Side note: If you want a real challenge (this is definitely not obligatory),
you can add another level of complexity: select the best
combination of the input variables, e.g., amongst all the possible
2
pairs or triples of columns in the dataset.
2 Hint
A grid search can be implemented based on a triply-nested for loop:
Ks <- c(1, 3, 5, 7, 9)
Ps <- c("none", "standardised", "robstandardised")
Ms <- c("l2", "l1")
for (K in Ks) {
for (preprocessing in Ps) {
for (metric in Ms) {
if (preprocessing == "standardised") {
# ...
}
else if (preprocessing == "robstandardised") {
# ...
}
else {
# ...
}
if (metric == "l2") {
# ...
}
else {
# ...
}
}
}
}
Alternatively, you can go through every row in the following matrix and process
each thus defined scenario:
expand.grid(Ks, Ps, Ms)
## Var1 Var2 Var3
## 1 1 none l2
## 2 3 none l2
## 3 5 none l2
## 4 7 none l2
## 5 9 none l2
## 6 1 standardised l2
## 7 3 standardised l2
## 8 5 standardised l2
3
## 9 7 standardised l2
## 10 9 standardised l2
## 11 1 robstandardised l2
## 12 3 robstandardised l2
## 13 5 robstandardised l2
## 14 7 robstandardised l2
## 15 9 robstandardised l2
## 16 1 none l1
## 17 3 none l1
## 18 5 none l1
## 19 7 none l1
## 20 9 none l1
## 21 1 standardised l1
## 22 3 standardised l1
## 23 5 standardised l1
## 24 7 standardised l1
## 25 9 standardised l1
## 26 1 robstandardised l1
## 27 3 robstandardised l1
## 28 5 robstandardised l1
## 29 7 robstandardised l1
## 30 9 robstandardised l1
3 Further Reading
See Section 5.1 of the book by James G et al. 2017. An introduction to statistical
learning with applications in R. Springer-Verlag. http://faculty.marshall.usc.
edu/gareth-james/ISL
4 Artefacts
Submit two files via OnTrack:
1. the Rmd file (RMarkdown report),
2. the resulting PDF file that is generated by clicking Knit Document to PDF
in RStudio; if you are unable to generate the PDF file directly, convert the
report to HTML or Word, and manually export the resulting file to PDF.
5 Intended Learning Outcomes
ULO Related
ULO1 (Methods) YES
ULO2 (Problems) YES
4
ULO Related
ULO3 (Implementation and Evaluation) YES
ULO4 (Communication) YES
ULO5 (Impact) YES
5

辅导SIT114、讲解data留学生、辅导R编程设计、讲解R语言 调试C/C++编程|解析C/C++编程

辅导SIT114、讲解data留学生、辅导R编程设计、讲解R语言调试C/C++编程|解析C/C++编程