COMP3009J – Information Retrieval
Programming Assignment
This assignment is worth 30% of the final grade for the module.
Due Date: Sunday 28th May 2023 at 23:55 (i.e. before beginning of Week 15)
Before you begin, download and extract the files ``small_corpus.zip’’ and ``large_corpus.zip’’
from Brightspace. These contain several files that you will need to complete this assignment.
The README.md file in each describes the files contained in the archive and their format1
.
Both corpora are in the same format, except for the relevance judgments. For the small
corpus, all documents not included in the relevance judgments have been judged nonrelevant. For the large corpus, documents not included in the relevance judgments have not
been judged.
You should first try the following tasks using the small corpus. If you can do this successfully,
then you should move to the large corpus. The large corpus will require more efficient
programs.
Part 1: BM25 Model
For this assignment you are required to implement the BM25 of Information Retrieval. You
must create a program (using Python) that can do the following.
1. Extract the documents contained in the document collections provided. You must divide
the documents into terms in an appropriate way. The strategy must be documented in your
source code comments.
2. Perform stopword removal. A list of stopwords to use is contained in the stopwords.txt
file that is provided in the ``files’’ directory.
3. Perform stemming. For this task, you may use the porter.py code in the ``files’’
directory.
4. The first time your program runs, it should create an appropriate index so that IR using the
BM25 method may be performed. Here, an index is any data structure that is suitable for
performing retrieval later.
This will require you to calculate the appropriate weights and do as much pre-calculation as
you can. This should be stored in an external file in some human-readable format. Do not use
database systems (e.g. MySQL, SQL Server, SQLite, etc.) for this.
1 This is a Markdown file. Although you can open and read it as plain text, a Markdown
editor like Remarkable (https://remarkableapp.github.io/ - Windows or Linux) or MacDown
(https://macdown.uranusjr.com/ - macOS) is recommended.
5. The other times your program runs, it should load the index from this file, rather than
processing the document collection again.
6. Run queries according to the BM25 model. This can be done in two ways:
- In “interactive” mode, a user can manually type in queries and see the first 15 results
in their command line, sorted beginning with the highest similarity score. The output should
have three columns: the rank, the document’s ID, and the similarity score. A sample run of
the program is contained later in this document. The user should continue to be prompted to
enter further queries until they type “QUIT”.
- In “automatic” mode, the standard queries should be read from the
``queries.txt’’ file (in the ``files’’ directory). This file has a query on each line, beginning
with its query ID. The results should be printed into a file named “results.txt”, which
should include four columns: query ID, rank, document ID and similarity score.
It is ESSENTIAL that this can be run as a standalone program, without requiring an IDE such
as IDLE, PyCharm, etc.
You can assume that your program will be run in the same directory as the README.md file
(i.e. the current directory will have the ``documents’’ and ``files’’ directories in it). Do not
use absolute paths in your code.
Non-standard libraries (other than the Porter stemmer provided) may not be used.
Users should be able to select the appropriate mode by using command line arguments, e.g.:
- python search_small_corpus.py -m interactive
Or for automatic mode:
- python search_small_corpus.py -m automatic
Part 2: Evaluation
For this part, your program should evaluate your results.txt file (that was created during
automatic mode above) to evaluate the effectiveness of the BM25 approach.
The user should be able to run the program using the following command:
- python evaluate_small_corpus.py
Based on the results.txt file, your program should calculate and print the following
evaluation metrics (based on the relevance judgments contained in the ``qrels.txt’’ file
in the ``files’’ directory):
- Precision
- Recall
- P@10
- R-precision
- MAP
- bpref
What you should submit
Submission of this assignment is through Brightspace. You should submit a single .zip archive
containing:
- Part 1: Python programs to run the queries.
o search_small_corpus.py
o search_large_corpus.py (if you have attempted the large corpus)
- Part 2: Python files to perform the evaluation.
o evaluate_small_corpus.py
o evaluate_large_corpus.py (if you have attempted the large corpus)
- A README.txt or README.md file that describes what your program can do (in
particular it should mention whether the program will work on both corpora or only
the small one).
Sample Run (Interactive)
$ python search_small_corpus.py -m interactive
Loading BM25 index from file, please wait.
Enter query: library information conference
Results for query [library information conference]
1 928 0.991997
2 1109 0.984280
3 1184 0.979530
4 309 0.969075
5 533 0.918940
6 710 0.912594
7 388 0.894091
8 1311 0.847748
9 960 0.845044
10 717 0.833753
11 77 0.829261
12 1129 0.821643
13 783 0.817639
14 1312 0.804034
15 423 0.795264
Enter query: QUIT
Note: In all of these examples, the results, and similarity scores were generated at random for
illustration purposes, so they are not correct scores.
Sample Run (Evaluation)
$ python evaluate_small_corpus.py
Evaluation results:
Precision: 0.138
Recall: 0.412
R-precision: 0.345
P@10: 0.621
MAP: 0.253
bpref: 0.345
Grading Rubric
Below are the main criteria that will be applied for the major grades (A, B, C, etc.). Other aspects
will be taken into account to decide minor grades (i.e. the difference between B+, B, B-, etc.).
- Readability and organisation of code (including use of appropriate functions, variable
names, helpful comments, etc.).
- Quality of solution (including code efficiency, presence of minor bugs, avoiding absolute
paths, etc.).
Questions should be sent to david.lillis@ucd.ie or posted in the Brightspace forum.
Passing Grades
``D'' Grade
Good implementation of the primary aspects of Information Retrieval, using the small corpus. This
includes extracting the documents from the document collection, preprocessing (stemming and
stopword removal), indexing and retrieval. The solution may contain some implementation errors.
It is clear that the student has correctly understood the Information Retrieval process.
``C'' Grade
Good implementation of the primary aspects of Information Retrieval, using the small corpus. The
program can also save and load the index to/from an external file, as appropriate.
``B'' Grade
Correct implementation of all sections of the assignment using the small corpus (some minor
implementation errors will be tolerated). It is clear that the student has understood both the
information retrieval process and the evaluation process. Note: This means that evaluation is only
taken into account if your search program can successfully retrieve documents for evaluation.
``A'' Grade
Excellent implementation of all sections of the assignment, including choice of appropriate efficient
data structures and efficient programming. The efficiency of the programs will be measured using
the large corpus. In particular, a response to a query must be returned in a reasonable amount of
time, although efficiency is important in indexing also. Note: This means that working efficiently on
the large corpus is only taken into account if your code can successfully work with the small corpus.
Failing Grades
``ABS'' Grade
No submission received.
``NM'' Grade
No relevant work attempted.
``G'' Grade
Wholly unacceptable, little or no evidence of meaningful work attempted.
``F'' Grade
Some evidence of work attempted, but little (if any) functionality operates in the correct manner.
``E'' Grade
Clear evidence that work has been attempted on implementing retrieval using BM25, but there
are serious errors in implementation, or in understanding of the process.
Other notes
1. This is an individual assignment. All code submitted must be your own work. Submitting the work
of somebody else or generated by AI tools such as ChatGPT is plagiarism, which is a serious
academic offence. Be familiar with the UCD Plagiarism Policy and the UCD School of Computer
Science Plagiarism Policy.
2. If you have questions about what is or is not plagiarism, ask!
Document Version History
v1.0: 2023-05-08, Initial Version.
v1.1: 2023-05-15, Updated requirements for the output format of automatic mode.