Data留学生辅导、讲解analysis、辅导R语言、R编程设计调试讲解留学生Prolog|辅导R语言编程

1. (Wine Data Set) These data are the results of a chemical analysis of wines grown in
the same region in Italy but derived from three different cultivars. The analysis determined
the quantities of 13 constituents (including Alcohol, Malic acid, Ash, Alcalinity of
ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins,
Color intensity, Hue,OD280/OD315 of diluted wines, and Proline) found in each of the
three types of wines. The sample size is 178. The dataset is available in the course site. The
main interest of this dataset is to study multiclassification of the three types of wines. Let yb
denote the predicted class of observations.
(a) Use nominal logistic regression in Section 2.3 to examine the multiclassification. The R
function is multinom. In addition, summarize the confusion table for y and yb, use macro
averaged metrics to evaluate recall, precision, F-measure, and then conduct performance
of classification.
(b) Use the methods in linear discriminant analysis and quadratic discriminant analysis to
obtain yb. In addition, summarize the confusion table for y and yb, use macro averaged
metrics to evaluate recall, precision, F-measure, and then conduct performance of classification.
(c) Use the support vector machine method to obtain yb. In addition, summarize the confusion
table for y and yb, use macro averaged metrics to evaluate recall, precision, Fmeasure,
and then conduct performance of classification.
(d) Summarize your findings in (a)-(c).
2
2. (Simulation studies) Consider the following linear model:
y = X1β1 + X2β2 + X3β3 + X4β4 − 4√ρX5β5 + , (1)
where X = (X1, · · · , Xp) is a p-dimensional vector of covariates and each Xk is generated
from N(0, 1). The correlations of all Xk except X5 are ρ, while X5 has the correlation √ρ
with all other p − 1 variables. Suppose that the sample size is n = 200.
(a) Show that X5 is marginally independent of y.
(b) Now, consider p = 1500 and generate the artificial data based on model (1) for 1000
repetitions. Specifically, let βi = 1 for every i = 1, · · · , 5 and set ρ = 0.7. After that, use
the SIS and iterated SIS methods to do variable selection and estimate the parameters
associated with selected covariates. Finally, summarize the estimator in the following
table:
Table 1: Simulation result for (b)
k∆βk1 k∆βk2 #S #FN
SIS
Iterated SIS
(c) Here we consider the scenario that is different from (b). Let p = 40 and X ∼ N(0, ΣX)
with entry (j, k) in ΣX being 0.5
|j−k|
for j, k = 1, · · · , p. We generate the artificial data
based on (1) for 1000 repetition with βi = 1 for every i = 1, · · · , 5. After that, use the
lasso, adaptive lasso, and Elastic net (set α = 0.5) methods to estimate the parameters.
Finally, summarize numerical results in the following table.
Table 2: Simulation result for (c)
k∆βk1 k∆βk2 #S #FN
lasso
adaptive lasso
Elastic net (α = 0.5)
(d) Summarize your findings for parts (b) and (c), respectively.
Note: Let βb be the estimator, then ∆β is defined as ∆β = βb − β with the ith component
being βbi − βi
. Therefore, k∆βk1 and k∆βk2 are defined as
Hint: Regarding simulation studies with 1000 repetitions.
In Question 2, you are asked to use simulation studies with 1000 repetitions to estimate the
parameters. Specifically, based on the kth artificial data that are independently generated, you are
able to obtain the estimator, denoted by βb(k). As a result, with 1000 repetitions.