B365留学生讲解、辅导iris data、讲解R程序语言、辅导R讲解留学生Processing|辅导留学生 Statistics统计、回

B365 Homework 5
1. Consider the Fisher iris data, as plotted in the program “iris.r” studied in class.
(a) Reasoning from this plot, choose a variable and a split point for the variable so that the resulting two
regions have one that contains only Setosa flowers, while the ther mixes the other two classes.
(b) For the “mixed” region resulting from the first part, choose a variable and split point that separates the
Versicolor and Virginica flowers as well as possible.
(c) Sketch the resulting three regions over the scatterplot of the two relevant variables, clearly labeling each
region with the resulting class.
2. Consider the table for a 2-class tree classifier with classes {+, −} below giving the number of +’s and -’s
reaching each node.
The nodes are numbered so that 1 is the root, while the children of node k are 2k and 2k + 1. The “terminal”
column says if a node is terminal or not.
(a) Construct the tree as a graph (the usual depiction of a tree) labeling the nodes with numbers and giving
the classification each node and the number of errors it would produce if it were a terminal node.
(b) Compute R(T) where T is the tree given in the example and R(T) is our probabability of error for the
tree when tested on the training set.
(c) Explain why you do or do not believe this is an accurate representation of the tree’s performance on new
data.
(d) Compute the optimal penalized risk, R∗α for each node of T where α = .03. Give the corresponding optimal
tree Tα.
(e) How much do you need to increase α before a different Tα appears? Same question for decreasing α.
3. The following code shows two things. First we show how to create a matrix from the file “tree data.dat,”
available from Canvas, which stores the data above. In the resulting matrix, the columns contain the number
of +’s, the number of -’s arriving at each node, and the boolean variable describing the node as terminal or
not. The rows 10 through 17 are all zeros and are unused. This way X[i, ] gives the data associated with treenode i.
The factorial function, shown below, gives an example of a simple recursive function in R, which you would
call by, e.g. factorial(5).
X = matrix(scan("tree_data.dat"),byrow=T,ncol = 3)
factorial <- function(i) {
if (i == 1) { return(1); }
else return(i*factorial(i-1))}
(a) Write a recursive function in R that takes as input the number of a node and returns the optimal risk
associated with that node, with a split penalty of α = .03. When you run your function with input 1 (the
root node) it should return the optimal risk for the entire tree.
(b) Let Tα=.03 denote the associated optimal tree, as computed in the previous problem. Construct this tree,
explicitly giving the classifications associated with each terminal tree node.
(c) Explain clearly what problem has Tα=.03 as its optimal solution.
4. Consider the following table of cross validation on tree induction for a two-class classification problem,as
discussed in class:
Root node error: 1524/3100 = 0.49161
n= 3100
CP nsplit rel error xerror xstd
1 0.54757282 0 1.0000000 1.025890 0.0180141
2 0.11909385 1 0.4524272 0.462136 0.0151731
3 0.06601942 2 0.3333333 0.368932 0.0139601
4 0.05372168 3 0.2673139 0.293204 0.0127297
5 0.05242718 4 0.2135922 0.238835 0.0116698
6 0.03430421 5 0.1611650 0.186408 0.0104615
7 0.01326861 6 0.1268608 0.150809 0.0095013
8 0.01165049 8 0.1003236 0.127508 0.0087912
9 0.01035599 9 0.0886731 0.119741 0.0085368
10 0.00776699 10 0.0783172 0.100324 0.0078541
11 0.00550162 12 0.0627832 0.083495 0.0071968
12 0.00517799 14 0.0517799 0.077023 0.0069238
13 0.00453074 15 0.0466019 0.073786 0.0067825
14 0.00291262 17 0.0375405 0.066019 0.0064285
15 0.00258900 19 0.0317152 0.062783 0.0062741
16 0.00226537 20 0.0291262 0.057605 0.0060178
17 0.00194175 22 0.0245955 0.055663 0.0059185
18 0.00161812 24 0.0207120 0.054369 0.0058512
19 0.00129450 26 0.0174757 0.054369 0.0058512
20 0.00097087 30 0.0122977 0.050485 0.0056440
21 0.00064725 34 0.0084142 0.045955 0.0053910
22 0.00032362 42 0.0032362 0.048544 0.0055371
23 0.00000000 52 0.0000000 0.046602 0.0054279
(a) In the “rel error” column we get a value of 0. for the 23rd row. Explain what this number means.
(b) In terms of error rate, how well do you think the tree associated with line 23 will perform on different
data from the sample population.
(c) Consider the tree that makes no splits — i.e. the one that simply classifies according to the most likely
class. How well will this tree classify new data from the same population.
(d) Judging from the table, what appears to be your best choice of complexity parameter α? In what sense
is your α value best?
5. As a result of a recent exam, an instructor of a class believes that 70% of the students do yet not understand a
topic sufficiently well. The instructor wishes to implement a Naive Bayes classifier to estimate each student’s
probability of understanding. Students are asked a sequence of 7 true or false questions. The instructor assumes
that the responses to these questions are conditionally independent given the student’s state of knowledge —
understands or does not understand. Of course, understanding is not really a binary attribute in real life as
there are degrees of understanding and various aspects to understanding, though we regard it as binary here.
The following code fragment creates a 2x7 matrix, x, where x[i, j] is the probability that a student will answer
the jth question correctly when her state of knowledge is i. Here i = 1 corresponds to “not understanding” and
i = 2 corresponds to understanding. For ease of computation we compute the 2x7x2 array, z, where z[i, j, k]
gives the probability that a student will give answer k (k=1 means wrong and k=2 means right) to question j,
given her state of knowledge is i.
x = matrix(c(.7, .6, .5, .5, .5, .5, .7, .8, .7, .6, .7, .9, .8, .9 ),byrow=T,nrow=2);
z = array(0,c(2,7,2));
(a) Write R code to fill in the matrix z to be as described in the problem.
(b) Using your z matrix create an R function that receives a vector of 7 test answers which are either wrong
or right. For instance, if the answers are c(0,0,0,0,1,1,1), that would mean the student answered only the
last three questions correctly. The function should return the probability that the student has understood
the subject, using a Naive Bayes classifier.