辅导留学生INFS7410 语言、Java编程调试、解析Java、Java语言讲解留学生

INFS7410 ASSIGNMENT 3
Semester 2/2018

Marks: 10 marks (10%)
Assessment Date: Tutorial Session on 9 October 2018 (No later than 9 October)
Submission Due Date: 11.59PM, 12 October 2018 (No late submission is allowed)
What to Submit: Zipped source code with detailed comments
Where to Submit: Electronic submission via blackboard

The goal of this project is to gain practical experience in using the vector space
model with tf.idf weight and cosine similarity measure for document retrieval.
You must work on this project individually. The standard academic honesty rules
apply.
Dataset: Cranfield
Assumptions: This assignment builds on top of Assignment 1 and 2, assuming
that the corpus has been tokenized and transformed into lower cases, all SGML
tags and stopwords have been removed, and the corpus is indexed by the
inverted index.
Task 1 - Building the vector space model representations for the corpus:
Write the necessary code to build the vector space model representations for all
the documents in the corpus. In this representation, tf.idf weight is used to
indicate the term weight. Assume that only the top 1000 most frequent words in
the corpus are used to construct the term dictionary. (2 marks)
Task 2 – Using the vector space model representations to perform. search:
Write the code to implement search: In the following cases, constructing its
vector space model representation, and returning top 10 documents that are
ranked based on their cosine similarities to the query vector, by comparing the
query vector with all the document vectors in the dataset.
(1) Query = “method” (0.5 mark)
(2) Query = “transfer equations” (1 mark)
(3) Query = “free problem case” (1 mark)
Task 3 – Using the Inverted Index to speed up the search:
Write the code to speed up the search process in Task 2 by combining the
inverted index. The idea is to first select the documents which contain the query
words using the inverted index, followed by comparing the selected documents’
vectors with the query vector and ranking them based on their cosine similarities.
(2 marks)
Code: Your implementation should be coded in some general programming
language (e.g., C, Java, Python, etc.) without using any external IR packages.
Your code should provide a simple interface (on console) that provides the
following functions: (0.5 mark)
 Allow user to enter the name of the corpus directory (assume that corpus
directory is in the same directory as your executable code)
 Allow user to enter the keywords of a search query
Deliverables: Your submission includes the following components:
1) Program: (5 marks in total)
 Source code and its brief description
 Interface for input
2) Output: (2 marks in total)
 Reporting the query, query results (see Task 2)
3) Performance Bonus: (3 marks in total)
 Efficiency: Report average query execution time for both Task 2 and 3
respectively over 10 executions of the same query.
 Retrieval Models: Implement two or more retrieval models including
Vector Space Model. (except Boolean Retrieval)