首页 > > 详细

讲解COMP527/337辅导留学生R设计、R程序讲解

PAPER CODE NO. EXAMINER: Dr. Angrosh Mandya Tel. No. 0151 7954293 
COMP527/337 DEPARTMENT: Computer Science 
SECOND SEMESTER EXAMINATIONS 2019/20 
Data Mining and Visualisation 
TIME ALLOWED : Open-Book Exam 
INSTRUCTIONS TO CANDIDATES 
• Answer all SIX questions. 
• The final exams for the course Data Mining and Visualisation is conducted as open-book 
exam. The question paper will be provided on 22nd May, 2020 at 10:00 a.m. You are 
expected to complete the exam and return scanned copies of answer scripts before 29th 
May 10:00 a.m. Answer scripts received after 29th May 10:00 a.m. will be invalid. 
• Completed answer scripts should be submitted using the link: 
https://sam.csc.liv.ac.uk/COMP/Submissions.pl 
• Answer scripts can be prepared in the following ways: 
1. You can write your answers by hand on plain paper, then scan or photograph the 
sheets, convert them to a single PDF and submit it at the link provided above. 
2. In case you do not have access to scanners or unable to photograph a handwritten 
solution, you can use a word processor (MS Word or LaTeX) and convert the file into a 
single PDF and submit it as above. 
• Please note the recommended time for completing the exam when conducted as a regular 
exam during previous years was 2.5 hours. However, the open-book exam is available to 
you from 22nd May, 2020 10:00 a.m. - 29th May, 2020, 10:00 a.m. Answer scripts can be 
submitted at any time during this period. The latest version received before 29th May, 2020, 
10:00 a.m. will be considered for evaluation. All submissions received after the deadline will 
be considered as invalid. 
• Since the exam is conducted as an open-book exam, students are expected to work on their 
own to prepare their answers. Any sort of plagiarism will be strictly penalised. 
• Calculators can be used for performing calculations. 
PAPER CODE COMP527/337 page 1 of 7 Continued 
Question 1 Consider the two sentences 
S1 = great movie my family loved it 
S2 = worst movie my family hated it 
Let us consider the words my and it to be a stop words and apply lemmatisation on the other 
words. We represent S1 and S2 respectively by vectors x1,x2 ∈ R6 in the vector space where the 
first six dimensions defined respectively by great, movie, family, loved, worst and hated. Answer 
the following questions. 
A. Write the two vectors x1 and x2 in this vector space. (2 marks) 
B. Calculate `1 and `2 norms of x1 and x2 (2 marks) 
C. Calculate the Euclidean distance and Manhattan distance between x1 and x2 (2 mark) 
D. Compute the cosine similarity between x1 and x2 (1 mark) 
E. Write the set of unigrams, bigrams and trigrams that can be extracted from S1. (3 mark) 
F. Write the set of bigrams that can be extracted from S2. (3 mark) 
G. State one advantage of removing stop words in text mining tasks. (1 mark) 
H. State one disadvantage of removing stop words in text mining tasks. (1 mark) 
I. What is meant by part-of-speech in text mining? (2 marks) 
PAPER CODE COMP527/337 page 2 of 7 Continued 
Question 2 Consider the following two training datasets D1 = {(xn, tn)}4n=1 and D2 = {(xn, tn)}4n=1, 
where xn ∈ R2. 
In D1, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t2 = t3 = −1 
and t4 = 1. 
In D2, x1 = (0, 0)>, x2 = (0, 1)>, x3 = (1, 0)>, and x4 = (1, 1)> and the labels t1 = t4 = −1 and 
t2 = t3 = 1. 
A. What is linear separability? Given D1 and D2, which of the two datasets is not linearly sepa- 
rable and why? (3 marks) 
B. If the current weight vector and bias are respectively w (k ) and b(k ), write their updated values 
w (k+1) and b(k+1) after misclassifying an instance x with label t . (2 marks) 
C. Consider the training set D1. Initialising w (0) = (0, 0)> and b(0) = 0, compute the final values 
of the weight vector and bias when the Perceptron algorithm is applied to D1 in the order 
x1,x2,x3 and x4 using the update rule ta ≤ 0, where a = 
∑4 
n=1 w 
>x + b is the activation 
function. (4 marks) 
D. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 also correctly 
classifies the data points in training dataset D1 . (4 marks) 
E. Show that the Perceptron with weight vector w ′ = (3, 2)> and bias b′ = −4 that correctly 
classifies all four instances in D1 in part D, fails to correctly classify all data points in the 
training dataset D2. (4 marks) 
PAPER CODE COMP527/337 page 3 of 7 Continued 
Question 3 Consider five data points in R2 given by x1 = (1, 1)>, x2 = (2, 1)>, x3 = (2, 2)>, 
x4 = (1, 2)>, and x5 = (0, 1)>. Answer the following questions about this dataset. 
A. Let us assume that we clustered this dataset into two clusters S1 = {x1, x2} and S2 = 
{x3, x4, x5}. Moreover, let us represent S1 and S2 by two 2-dimensional arbitrary vectors 
respectively µ1 and µ2. Write the Within-Cluster Sum of Squares (WCSS) objective, f (S1,S2) 
for this clustering. (2 marks) 
B. Calculate the partial derivatives of f (S1,S2) with respect to µ1 and µ2 and find the critical 
points of f (S1,S2) by setting the derivatives to zero. (2 marks) 
C. We would like to apply the k -means clustering algorithm taught in the lectures to the above- 
mentioned dataset to create two clusters. If the initial centroids for the two clusters were set to 
µ1 = (2, 1)> and µ2 = (2, 2)>, then what would be the final set of clusters after the first iteration 
using Euclidean distance as a similarity measure between the data points. (3 marks) 
D. Compute the values for cluster centroids µ1 and µ2 following the assignment resulting in part 
(C). (3 marks) 
E. In a different setting, consider two clusters S1 = {B,B,R} and S2 = {R,B,R} consisting of 
blue (B) and red (R) colour balls. Compute the purity for this clustering. (2 mark) 
F. Compute the rand index for the clustering described in part (E). (3 marks) 
G. Compute the precision, recall and F-score for the clustering described in part (F). (2 mark) 
PAPER CODE COMP527/337 page 4 of 7 Continued 
Question 4 Consider the following logistic regression model shown in Figure 1. Here a two 
dimensional input (represented by two features x1 and x2) is multiplied by two-dimensional weight 
vector w. A bias term b is also used in the network. Let z be the net score and yn by the activation 
function computed in the network resulting in an output o. 
Figure 1: Logistic regression model 
Answer the following questions: 
A. Write the formula for computing net score z in the network. (1 mark) 
B. Assuming that the logistic model uses sigmoid function to compute activation yn, write down 
the formula for computing yn. (1 mark) 
C. Provide the reasons for using sigmoid function in the logistic model. (1 mark) 
D. Given a training dataset {(xn, tn)}Nn=1, where tn ∈ {0, 1}, with n = 1, ... ,N, write down the 
steps involved in deriving the likelihood as a function of w for this dataset. (3 marks) 
E. Define the negative log of the likelihood function obtained in part D to derive cross entropy 
error function E(w). (2 marks) 
F. Write down the gradient ∇E(w), by differentiating E(w) with respect to w obtained in part E . 
(3 marks) 
G. Using a fixed learning rate η, write down the stochastic gradient descent weight update rule 
using the gradient obtained in F. (2 mark) 
H. Modify the update rule in G to add an `2 regularisation on w with a regularisation coefficient 
λ. (2 mark) 
I. Discuss some of the possible solutions to avoid overfitting in a logistic regression model. 
(2 mark) 
PAPER CODE COMP527/337 page 5 of 7 Continued 
Question 5 Consider the following sentence provided below in S1. Consider that you are required 
to train a Continuous Bag of Words (CBOW) Word2Vec model using S1. 
S1 = i like data science 
A. Illustrate with a diagram the CBOW model with a hidden layer consisting of 2 neurons. In- 
dicate the input, hidden and output layer of the network with x,h and u, respectively. Use 
W and W ′ to indicate the parameters of the model from input layer to hidden layer and from 
hidden layer to output layer, respectively. Write down the dimensions of x,h and u. (3 marks) 
B. Let W = 
 
1.0 0.2 
0.5 2.4 
1.2 0.8 
2.0 0.6 
 be a randomly initialised weight matrix between the input and the hidden 
layer. Following the vectorised form, compute the output h at the hidden layer obtained in 
CBOW model for the input context word “science”, using the one-hot vector representation of 
“science” obtained in B. (4 marks) 
C. Justify the use of softmax activation at the output layer in CBOW model (5 marks) 
D. Let W ′ = 
 1 2 3.0 2.5 
1.5 0.5 1.0 0.5 
 be a randomly initialised weight matrix defined between the 
hidden layer and the output layer. Write down the softmax function to compute output ui at 
the output layer in CBOW model. Also compute the output probabilities for each word in V 
using the hidden output h obtained in part D. (5 marks) 
PAPER CODE COMP527/337 page 6 of 7 Continued 
Question 6 Consider that you are hired for a project that studies what adverse side effects are 
reported by patients who are taking a particular drug X on Twitter. You are given a large collection 
of tweets that mention the drug name X for this purpose. A subset of these tweets, annotated 
with regard to the effect of the medicine as positive or negative by the Research and Development 
(RD) wing of the company is also provided to you. Given this scenario, answer the following 
questions 
A. Describe three challenges that you are likely to encounter in this project. (5 marks) 
B. Assuming that you want to use an Artificial Neural Network (ANN) for training a supervised 
model for sentiment analysis using the annotated corpus, answer the following: 
B1. Describe the pre-processing steps that you would use to clean the data. (5 marks) 
B2. Provide the details as to how you would proceed in implementing the ANN model. 
(Please note that we don’t want you to write or explain implementation code. We are 
more interested in seeing your approach as to how you would proceed. For example, 
your approach to pre-processing the Twitter data, convert pre-processed text into feature 
vectors (using pre-trained word embeddings), details of different layers of the network, 
activation and loss functions used in the model). (5 marks) 
PAPER CODE COMP527/337 page 7 of 7 End 
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!