FIT3155 S2/2020: Assignment 1

(Due midnight 11:59pm on Fri 4 September 2020) [Weight: 10 = 3 + 4 + 3 marks.]

Your assignment will be marked on the performance/efficiency of your program. You must

write all the code yourself, and should not use any external library routines, except those

that are considered standard. The usual input/output and other unavoidable routines are

exempted.

Follow these procedures while submitting this assignment:

The assignment should be submitted online via moodle strictly as follows:

❼ All your scripts MUST contain your name and student ID.

❼ Use gzip or Winzip to bundle your work into an archive which uses your student ID

as the file name. (STRICTLY AVOID UPLOADING .rar ARCHIVES!)

– Your archive should extract to a directory which is your student ID.

– This directory should contain a subdirectory for each of the three questions, named

as: q1/, q2/ and q3/. – Your corresponding scripts and work should be tucked within those subdirectories.

❼ Submit your zipped file electronically via Moodle.

Academic integrity, plagiarism and collusion

Monash University is committed to upholding high standards of honesty and academic

integrity. As a Monash student your responsibilities include developing the knowledge and skills to avoid plagiarism and collusion. Read carefully the material available

at https://www.monash.edu/students/academic/policies/academic-integrity to understand

your responsibilities. As per FIT policy, all submissions will be scanned via MOSS.

Assignment Questions

1. Parameter matching: Imagine you are tasked with maintaining a large code base and

after examining the code you begin to suspect that it contains a lot of duplication. In

particular, it appears that the programming team were poorly coordinated and several

core functions have been copied or rewritten many times throughout the code. However,

the functions have not simply been copied line for line. Instead the variables have often

been renamed due to the differing styles of the individual programmers on the team.

You decide this duplication needs to be removed from the code in order to increase

maintainability, but first you must find it!

Let’s abstract the problem and implement a proof of concept application to detect duplication. Many of the lines in the duplicated functions will be identical across all copies,

so we’ll model these lines as tokens over an alphabet Π consisting of uppercase English

characters i.e. Π ={A, B, ..., Z}. To capture the fact that variable naming between the

copies is inconsistent we’ll model them as parameters over an alphabet Σ consisting of

lowercase English characters i.e. Σ = {a, b, ..., z}.

We can represent the functions in the code using p-strings, which are strings consisting

of characters from Π and Σ. One function is a duplicate of another if we can make them

identical by renaming its variables. Similarly, in our abstraction we are interested in

whether two p-strings S1 and S2 can be made identical by renaming the parameters in

one of them. Specifically, we wish to determine which substrings in a given text (the code

base given as a p-string) p-match a given pattern (a function represented as a p-string),

where S1 and S2 p-match if and only if:

❼ Every token in S2 is aligned with an identical token in S1. ❼ Every parameter y in S2 is aligned with a parameter z in S1 and it is possible to

make S2 identical to S1 by renaming1

every occurrence of y in S2 to z, such that no

two parameters in S2 are renamed to the same parameter in S1.

Your task is to implement an efficient algorithm that finds all the positions in the text

that p-match the pattern.

Strictly follow the following specification to address this question:

Program name: parameter matching.py

Arguments to your program: Two plain text files:

(a) an input file containing a p-string txt[1...n] (without any line breaks).

(b) another input file containing a p-string pat[1..m] (without any line breaks).

Command line usage of your script:

parameter matching.py

Do not hard-code the file names/input in your program. The pattern and text should be specified as arguments.

Penalties apply if you do.

1The parameters in S1 are independent of those in S2 so that if y in S2 is renamed to z this has no impact on

the copies of y in S1. Furthermore, no renaming needs to be done if two parameters are already identical, i.e.

if an x in S2 is aligned with an x in S1 these parameters can be made to p-match by leaving both parameters

unchanged. However, no other parameter y in S2 would be allowed to be renamed to x in this case.

Output file name: output parameter matching.txt

❼ Each position where pat p-matches the txt should appear in a separate line.

For example, when txt[1 . . . 16] = AaBcBaABCaBaAxBy, and pat[1 . . . 3] = aBc,

the output should be:

2. Binary Boyer-Moore: In week 2 we discussed the Boyer-Moore algorithm, which defines

the benchmark for exact pattern matching algorithms. Boyer-Moore performs extremely

well on natural languages, but its effectiveness is somewhat diminished when searching

binary files. Tables 1 and 2 show that when compared to naive pattern matching BoyerMoore’s advantage is much more pronounced for patterns and texts that consist of random

lower case English characters than when the text and pattern are random binary strings.

This should not be overly surprising, but we can improve Boyer-Moore’s performance

somewhat if we know that both the pattern and the text will be defined over a binary

alphabet. Your task is to implement an altered version of the Boyer-Moore algorithm

that is optimised to deal with this particular application. You should focus on what (if

any) alterations can be made to the preprocessing and search phases of the algorithm in

order to give better performance on a binary alphabet. You should assume that the text

is very large compared to the pattern and to avoid the complications that arise when

working with binary data2 we’ll take our alphabet to be ℵ = {0, 1}, i.e. the characters ‘0’

and ‘1’ with ASCII codes 48 and 49 respectively.

When optimising your algorithm your goal should be to minimise the number comparisons

it makes, without unreasonably sacrificing the space or time that the algorithm requires.

In particular, you should be able to improve on the number of comparisons made by the

general Boyer-Moore algorithm taught in lectures. One of texts and its corresponding

pattern used to gather the data for table 2 is available on Moodle.

Algorithm Number of comparisons number of Shifts

Boyer-Moore 122614 117791

Naive 1040264 999991

Table 1: The performance of the Boyer-Moore and the naive pattern matching algorithms

discussed in lectures on a random text and pattern of lower case English characters. The text

and the pattern contained 1000000 characters and 10 characters respectively.

Algorithm Number of comparisons number of Shifts

Boyer-Moore 642096 298816

Naive 1998276 999991

Table 2: The performance of the Boyer-Moore and the naive pattern matching algorithms discussed in lectures on random bitstrings. The text and the pattern contained 1000000 characters

and 10 characters respectively.

2While avoiding raw binary data will save us some headaches it also somewhat limits the optimisations we

can make, and you should ponder what further improvements we could make if we were working with raw binary

data.

Strictly follow the following specification to address this question:

Program name: binary boyermoore.py

Arguments to your program: Two plain text files:

(a) an input file containing txt[1...n] (without any line breaks), a string of characters

from ℵ.

(b) another input file containing pat[1..m] (without any line breaks), a string of

characters from ℵ.

Command line usage of your script:

binary boyermoore.py

Do not hard-code the file names/input in your program. The pattern and text should be specified as arguments.

Penalties apply if you do.

Output: The number of comparisons made by your algorithm should be output to the

terminal and you should write the result of your pattern matching to a file:

output binary boyermoore.txt

❼ Each position where pat matches the txt should appear in a separate line. For

example, when:

pat[1 . . . 3] = 010 and txt[1 . . . 25] = 0011010101111001001101100,

the output should be:

3. Modified Knuth-Morris-Pratt: It can be proven (see the week 2 lecture notes) that

for a pattern pat[1 . . . m] and a text txt[1 . . . n], the number of comparisons made by

the Knuth-Morris-Pratt (KMP) algorithm in its search phase is bounded by 2n. This is

because if a character txt[j] in the text is involved in a match, it is never examined again.

However, this is not true if txt[j] is involved in a mismatch and in this case txt[j] may be

involved in multiple comparisons with characters of the pattern. Fortunately the KMP

algorithm can be modified so that each character examined in the text is involved in only

single comparison with a character from the pattern, regardless of whether that comparison results in a match or a mismatch. As a consequence, the number of comparisons

made in the search phase of this modified version of KMP is bounded by n.

Your task is to modify the original KMP algorithm so that it adheres to this tighter bound

and to use it to perform exact pattern matching. You should first identify what aspect

of the original KMP shift rule prevents the number of comparisons made in the search

phase from being bounded by n. Then determine how to modify the shift rule in order to

achieve the desired bound on the number of comparisons. Hint: Modifying the shift rule

will force you to alter the preprocessing phase as well. Despite this, the preprocessing can

still be performed using the Z-algorithm and should only require a small increase in time

and space over the original preprocessing.

Strictly follow the following specification to address this question:

Program name: modified kmp.py

Arguments to your program: Two plain text files: