首页 >
> 详细

B365 Homework 5

1. Consider the Fisher iris data, as plotted in the program “iris.r” studied in class.

(a) Reasoning from this plot, choose a variable and a split point for the variable so that the resulting two

regions have one that contains only Setosa flowers, while the ther mixes the other two classes.

(b) For the “mixed” region resulting from the first part, choose a variable and split point that separates the

Versicolor and Virginica flowers as well as possible.

(c) Sketch the resulting three regions over the scatterplot of the two relevant variables, clearly labeling each

region with the resulting class.

2. Consider the table for a 2-class tree classifier with classes {+, −} below giving the number of +’s and -’s

reaching each node.

The nodes are numbered so that 1 is the root, while the children of node k are 2k and 2k + 1. The “terminal”

column says if a node is terminal or not.

(a) Construct the tree as a graph (the usual depiction of a tree) labeling the nodes with numbers and giving

the classification each node and the number of errors it would produce if it were a terminal node.

(b) Compute R(T) where T is the tree given in the example and R(T) is our probabability of error for the

tree when tested on the training set.

(c) Explain why you do or do not believe this is an accurate representation of the tree’s performance on new

data.

(d) Compute the optimal penalized risk, R∗α for each node of T where α = .03. Give the corresponding optimal

tree Tα.

(e) How much do you need to increase α before a different Tα appears? Same question for decreasing α.

3. The following code shows two things. First we show how to create a matrix from the file “tree data.dat,”

available from Canvas, which stores the data above. In the resulting matrix, the columns contain the number

of +’s, the number of -’s arriving at each node, and the boolean variable describing the node as terminal or

not. The rows 10 through 17 are all zeros and are unused. This way X[i, ] gives the data associated with treenode i.

The factorial function, shown below, gives an example of a simple recursive function in R, which you would

call by, e.g. factorial(5).

X = matrix(scan("tree_data.dat"),byrow=T,ncol = 3)

factorial <- function(i) {

if (i == 1) { return(1); }

else return(i*factorial(i-1))}

(a) Write a recursive function in R that takes as input the number of a node and returns the optimal risk

associated with that node, with a split penalty of α = .03. When you run your function with input 1 (the

root node) it should return the optimal risk for the entire tree.

(b) Let Tα=.03 denote the associated optimal tree, as computed in the previous problem. Construct this tree,

explicitly giving the classifications associated with each terminal tree node.

(c) Explain clearly what problem has Tα=.03 as its optimal solution.

4. Consider the following table of cross validation on tree induction for a two-class classification problem,as

discussed in class:

Root node error: 1524/3100 = 0.49161

n= 3100

CP nsplit rel error xerror xstd

1 0.54757282 0 1.0000000 1.025890 0.0180141

2 0.11909385 1 0.4524272 0.462136 0.0151731

3 0.06601942 2 0.3333333 0.368932 0.0139601

4 0.05372168 3 0.2673139 0.293204 0.0127297

5 0.05242718 4 0.2135922 0.238835 0.0116698

6 0.03430421 5 0.1611650 0.186408 0.0104615

7 0.01326861 6 0.1268608 0.150809 0.0095013

8 0.01165049 8 0.1003236 0.127508 0.0087912

9 0.01035599 9 0.0886731 0.119741 0.0085368

10 0.00776699 10 0.0783172 0.100324 0.0078541

11 0.00550162 12 0.0627832 0.083495 0.0071968

12 0.00517799 14 0.0517799 0.077023 0.0069238

13 0.00453074 15 0.0466019 0.073786 0.0067825

14 0.00291262 17 0.0375405 0.066019 0.0064285

15 0.00258900 19 0.0317152 0.062783 0.0062741

16 0.00226537 20 0.0291262 0.057605 0.0060178

17 0.00194175 22 0.0245955 0.055663 0.0059185

18 0.00161812 24 0.0207120 0.054369 0.0058512

19 0.00129450 26 0.0174757 0.054369 0.0058512

20 0.00097087 30 0.0122977 0.050485 0.0056440

21 0.00064725 34 0.0084142 0.045955 0.0053910

22 0.00032362 42 0.0032362 0.048544 0.0055371

23 0.00000000 52 0.0000000 0.046602 0.0054279

(a) In the “rel error” column we get a value of 0. for the 23rd row. Explain what this number means.

(b) In terms of error rate, how well do you think the tree associated with line 23 will perform on different

data from the sample population.

(c) Consider the tree that makes no splits — i.e. the one that simply classifies according to the most likely

class. How well will this tree classify new data from the same population.

(d) Judging from the table, what appears to be your best choice of complexity parameter α? In what sense

is your α value best?

5. As a result of a recent exam, an instructor of a class believes that 70% of the students do yet not understand a

topic sufficiently well. The instructor wishes to implement a Naive Bayes classifier to estimate each student’s

probability of understanding. Students are asked a sequence of 7 true or false questions. The instructor assumes

that the responses to these questions are conditionally independent given the student’s state of knowledge —

understands or does not understand. Of course, understanding is not really a binary attribute in real life as

there are degrees of understanding and various aspects to understanding, though we regard it as binary here.

The following code fragment creates a 2x7 matrix, x, where x[i, j] is the probability that a student will answer

the jth question correctly when her state of knowledge is i. Here i = 1 corresponds to “not understanding” and

i = 2 corresponds to understanding. For ease of computation we compute the 2x7x2 array, z, where z[i, j, k]

gives the probability that a student will give answer k (k=1 means wrong and k=2 means right) to question j,

given her state of knowledge is i.

x = matrix(c(.7, .6, .5, .5, .5, .5, .7, .8, .7, .6, .7, .9, .8, .9 ),byrow=T,nrow=2);

z = array(0,c(2,7,2));

(a) Write R code to fill in the matrix z to be as described in the problem.

(b) Using your z matrix create an R function that receives a vector of 7 test answers which are either wrong

or right. For instance, if the answers are c(0,0,0,0,1,1,1), that would mean the student answered only the

last three questions correctly. The function should return the probability that the student has understood

the subject, using a Naive Bayes classifier.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- 代写data留学生作业、R编程语言作业调试、代做r课程设计作业帮做c/C++ 2019-12-04
- 代写mat2040留学生作业、代做linear Algebra作业、Pyth 2019-12-04
- 代写framework留学生作业、代做r编程语言作业、代写r课程设计作业、D 2019-12-04
- Plid50留学生作业代做、代写java，C++程序设计作业、代做pytho 2019-12-04
- 代写strategy Game作业、Java程序语言作业调试、Java课程设 2019-12-04
- Fs19 Stt481作业代做、代写dataset课程作业、C/C++，Py 2019-12-04
- Data留学生作业代写、代做java，Python编程设计作业、代写c/C+ 2019-12-04
- Data Frames作业代写、代做r编程语言作业、代写r课程设计作业、Uc 2019-12-04
- 代写es3c5留学生作业、Systems课程作业代做、Matlab程序语言作 2019-12-04
- Cs 1160留学生作业代做、代写programming课程作业、代做c/C 2019-12-04
- 代做comp 250作业、Math 240作业代写、Java编程语言作业调试 2019-12-04
- 代写elec5681m作业、Programming课程作业代做、代写c++实 2019-12-04
- Cs 201留学生作业代做、代写dynamic Analysis作业、代写j 2019-12-04
- 代写cs2313留学生作业、代做programming作业、C++程序语言作 2019-12-04
- Data Frame作业代写、代做r编程设计作业、代做r课程设计作业、Mas 2019-12-03
- Cs 116留学生作业代做、代写program课程作业、Python程序设计 2019-12-03
- 代写fdm留学生作业、代做c++课程设计作业、C++程序语言作业调试、Alg 2019-12-03
- 代写database课程作业、代做python/Java编程语言作业、代写p 2019-12-03
- 代做ie 6200作业、代写r编程设计作业、Data留学生作业代做、R实验作 2019-12-03
- 5Cosc001w作业代做、代写programming课程作业、代写java 2019-12-03