辅导INT304 Pattern Recognition In Computer Vision

INT304: Pattern Recognition In Computer Vision S2 2021-2022
Assignment/Lab 1: Feature Selection
Lecturer: Xi Yang Unit: INT Dept. of Intelligent Science
Disclaimer:
1. Lab reports deadlines are strict. University late submission policy will be applied.
2. Collusion and plagiarism are absolutely forbidden (University policy will be
applied).
3. Report is due 21 days from the date of running this lab. (March 29th, 2022)
1.1 Objectives
• Master the concepts and knowledge on feature selection.
• Be familiar with the text processing
1.2 Introduction
Text categorization is the task on classifying a set of documents into categories from a set of predefined
labels. Texts cannot be directly handled by our model. The indexing procedure is the first step that maps
a text dj into a numeric representation during the training and validation. The standard tfidf function
is used to represent the text. The unique words from English vocabulary are represented as a dimension of
the dataset. The high dimensionality of the word space may be problematic for our classification methods.
In this case, we will choose a subset by the feature selection methods to reduce the dimensionality of the
feature space.
1.3 Tasks
1.3.1 Data Preprocessing
• ( 10 marks ) Download the text dataset and read the documents: http://qwone.com/ jason/20Newsgroups/.
The training examples of the version 20news-bydate.tar.gz are used for our experiments, where
the duplicates are removed.
• ( 10 marks ) Remove the stopwords (Stopword.txt), which are frequent words that carry no information.
Convert all word into their lower case form. Delete all non-alphabet characters from the text. Hint: using
python set, regex and dict
• ( 10 marks ) Perform word stemming, which means the remove the word suffix.
– install the library: nltk (python language)
– usage: see the following code on how to use Porter stemmer (https://www.nltk.org/howto/stem.html).
1-1
1-2 Lecture 1: Feature Selection
from nltk . stem . porter import *
stemmer = PorterStemmer ()
plurals = [’caresses ’, ’flies ’, ’dies ’, ’mules ’, ’denied ’,
’died ’, ’agreed ’, ’owned ’, ’humbled ’, ’sized ’,
’meeting ’, ’stating ’, ’siezing ’, ’itemization ’,
’sensational ’, ’traditional ’, ’reference ’, ’colonizer ’,
’plotted ’]
singles = [ stemmer . stem ( plural ) for plural in plurals ]
1.3.2 Indexing
The documents are represented as the vector space model. In the vector space model, each document is
represented as a vector of words. A collection of documents are represented by a document-by-word matrix
A
A = (aik) (1.1)
where aik is the weight of word k in document i.
1.3.3 Feature Selection
Feature selection try to remove non-informative words from the document in order to improve categorization
effectiveness and reduce the computational complexity.
• ( 10 marks ) Remove the low-frequency words
The document frequency for a word is the number of documents in which the words occurs. You should
compute the document frequency for each word in the training dataset and removes those words whose
document frequency is less than some predefined threshold (the setting df < 5).
• ( 40 marks ) Choose features with information gain
Information gain measures the number of bits of information by knowing the presence or absence of a
word in a document.
( 10 marks ) Let c1, c2, · · · , cK denote the set of possible categories. The information gain of a word
w is defined to be
IG(w) = H(C) − H(C|w)
= −
X
K
j=1
P(cj ) log(P(cj )) + P(w)
X
K
j=1
P(cj |w) log P(cj |w) + P( ¯w)
X
K
j=1
P(cj |w¯) log P(cj |w¯)
– 5 marks P(cj ): the fraction of documents in the total collection that belongs to class cj
– 5 marks P(w): the fraction of documents in which the word w occurs
– 10 marks P(cj |w): the fraction of documents from class cj that have at least one occurrence of
word w
– 10 marks P(cj |w¯): the fraction of documents from class cj that does not contain the word w
In the end, we choose 1000 word with maximum IG values by sorting all words.
Lecture 1: Feature Selection 1-3
1.3.4 ( 20 marks ) TFIDF Representation
TFIDF representation assigns the weight to word i in document k in proportion to the number of occurrences
of the word in the document, and inverse proportion to the number of documents in the collection for which
the word occurs at least once.
aik = fik ∗ log(N/nk) (1.2)
• fik: the frequency of word k in document i
• N: the number of documents in the training dataset
• nk: the total number of times word k occurs in the training dataset called the document frequency.
Taking int account the length of different documents, we normalize the representation of the document as
Aik = q
aik
P1000
j=1 a
2
ij
(1.3)
The training set can be represent as a matrix AN×1000. Once the features are chosen, the test set can be
converted into another matrix BM×1000, where M is the size of the test dataset.
1.4 Lab Report
• Write a short report which should contain a concise description of your results and observations.
• Please insert the clipped running image into your report for each step.
• Submit the report and the source code to electronically into LearningMall.
• Python is strongly suggested.
• The report is strongly suggested to be written with the latex typesetting language.
• The report in pdf format and python source code of your implementation should be zipped into a single
file. The naming of report is as follows:
e.g. StudentID LastName FirstName LabNumber.zip (123456789 Einstein Albert 1.zip)
1.5 Hints
Please refer to the paper for more details: K Aas and L. Eikvil, Text Categorisation: A Survey, 1999.
• Latex IDE: texstudio
• Python IDE: pycharm
• Use the python set, dict and list collections flexibly.