首页 > > 详细

FIT3155 S2/2020: Assignment 1

 FIT3155 S2/2020: Assignment 1

(Due midnight 11:59pm on Fri 4 September 2020) [Weight: 10 = 3 + 4 + 3 marks.]
Your assignment will be marked on the performance/efficiency of your program. You must
write all the code yourself, and should not use any external library routines, except those
that are considered standard. The usual input/output and other unavoidable routines are
exempted.
Follow these procedures while submitting this assignment:
The assignment should be submitted online via moodle strictly as follows:
❼ All your scripts MUST contain your name and student ID.
❼ Use gzip or Winzip to bundle your work into an archive which uses your student ID
as the file name. (STRICTLY AVOID UPLOADING .rar ARCHIVES!)
– Your archive should extract to a directory which is your student ID.
– This directory should contain a subdirectory for each of the three questions, named
as: q1/, q2/ and q3/. – Your corresponding scripts and work should be tucked within those subdirectories.
❼ Submit your zipped file electronically via Moodle.
Academic integrity, plagiarism and collusion
Monash University is committed to upholding high standards of honesty and academic
integrity. As a Monash student your responsibilities include developing the knowl￾edge and skills to avoid plagiarism and collusion. Read carefully the material available
at https://www.monash.edu/students/academic/policies/academic-integrity to understand
your responsibilities. As per FIT policy, all submissions will be scanned via MOSS.
1
Assignment Questions
1. Parameter matching: Imagine you are tasked with maintaining a large code base and
after examining the code you begin to suspect that it contains a lot of duplication. In
particular, it appears that the programming team were poorly coordinated and several
core functions have been copied or rewritten many times throughout the code. However,
the functions have not simply been copied line for line. Instead the variables have often
been renamed due to the differing styles of the individual programmers on the team.
You decide this duplication needs to be removed from the code in order to increase
maintainability, but first you must find it!
Let’s abstract the problem and implement a proof of concept application to detect dupli￾cation. Many of the lines in the duplicated functions will be identical across all copies,
so we’ll model these lines as tokens over an alphabet Π consisting of uppercase English
characters i.e. Π ={A, B, ..., Z}. To capture the fact that variable naming between the
copies is inconsistent we’ll model them as parameters over an alphabet Σ consisting of
lowercase English characters i.e. Σ = {a, b, ..., z}.
We can represent the functions in the code using p-strings, which are strings consisting
of characters from Π and Σ. One function is a duplicate of another if we can make them
identical by renaming its variables. Similarly, in our abstraction we are interested in
whether two p-strings S1 and S2 can be made identical by renaming the parameters in
one of them. Specifically, we wish to determine which substrings in a given text (the code
base given as a p-string) p-match a given pattern (a function represented as a p-string),
where S1 and S2 p-match if and only if:
❼ Every token in S2 is aligned with an identical token in S1. ❼ Every parameter y in S2 is aligned with a parameter z in S1 and it is possible to
make S2 identical to S1 by renaming1
every occurrence of y in S2 to z, such that no
two parameters in S2 are renamed to the same parameter in S1.
Your task is to implement an efficient algorithm that finds all the positions in the text
that p-match the pattern.
Strictly follow the following specification to address this question:
Program name: parameter matching.py
Arguments to your program: Two plain text files:
(a) an input file containing a p-string txt[1...n] (without any line breaks).
(b) another input file containing a p-string pat[1..m] (without any line breaks).
Command line usage of your script:
parameter matching.py
Do not hard-code the file names/input in your program. The pattern and text should be specified as arguments.
Penalties apply if you do.
1The parameters in S1 are independent of those in S2 so that if y in S2 is renamed to z this has no impact on
the copies of y in S1. Furthermore, no renaming needs to be done if two parameters are already identical, i.e.
if an x in S2 is aligned with an x in S1 these parameters can be made to p-match by leaving both parameters
unchanged. However, no other parameter y in S2 would be allowed to be renamed to x in this case.
2
Output file name: output parameter matching.txt
❼ Each position where pat p-matches the txt should appear in a separate line.
For example, when txt[1 . . . 16] = AaBcBaABCaBaAxBy, and pat[1 . . . 3] = aBc,
the output should be:
24
14
2. Binary Boyer-Moore: In week 2 we discussed the Boyer-Moore algorithm, which defines
the benchmark for exact pattern matching algorithms. Boyer-Moore performs extremely
well on natural languages, but its effectiveness is somewhat diminished when searching
binary files. Tables 1 and 2 show that when compared to naive pattern matching Boyer￾Moore’s advantage is much more pronounced for patterns and texts that consist of random
lower case English characters than when the text and pattern are random binary strings.
This should not be overly surprising, but we can improve Boyer-Moore’s performance
somewhat if we know that both the pattern and the text will be defined over a binary
alphabet. Your task is to implement an altered version of the Boyer-Moore algorithm
that is optimised to deal with this particular application. You should focus on what (if
any) alterations can be made to the preprocessing and search phases of the algorithm in
order to give better performance on a binary alphabet. You should assume that the text
is very large compared to the pattern and to avoid the complications that arise when
working with binary data2 we’ll take our alphabet to be ℵ = {0, 1}, i.e. the characters ‘0’
and ‘1’ with ASCII codes 48 and 49 respectively.
When optimising your algorithm your goal should be to minimise the number comparisons
it makes, without unreasonably sacrificing the space or time that the algorithm requires.
In particular, you should be able to improve on the number of comparisons made by the
general Boyer-Moore algorithm taught in lectures. One of texts and its corresponding
pattern used to gather the data for table 2 is available on Moodle.
Algorithm Number of comparisons number of Shifts
Boyer-Moore 122614 117791
Naive 1040264 999991
Table 1: The performance of the Boyer-Moore and the naive pattern matching algorithms
discussed in lectures on a random text and pattern of lower case English characters. The text
and the pattern contained 1000000 characters and 10 characters respectively.
Algorithm Number of comparisons number of Shifts
Boyer-Moore 642096 298816
Naive 1998276 999991
Table 2: The performance of the Boyer-Moore and the naive pattern matching algorithms dis￾cussed in lectures on random bitstrings. The text and the pattern contained 1000000 characters
and 10 characters respectively.
2While avoiding raw binary data will save us some headaches it also somewhat limits the optimisations we
can make, and you should ponder what further improvements we could make if we were working with raw binary
data.
3
Strictly follow the following specification to address this question:
Program name: binary boyermoore.py
Arguments to your program: Two plain text files:
(a) an input file containing txt[1...n] (without any line breaks), a string of characters
from ℵ.
(b) another input file containing pat[1..m] (without any line breaks), a string of
characters from ℵ.
Command line usage of your script:
binary boyermoore.py
Do not hard-code the file names/input in your program. The pattern and text should be specified as arguments.
Penalties apply if you do.
Output: The number of comparisons made by your algorithm should be output to the
terminal and you should write the result of your pattern matching to a file:
output binary boyermoore.txt
❼ Each position where pat matches the txt should appear in a separate line. For
example, when:
pat[1 . . . 3] = 010 and txt[1 . . . 25] = 0011010101111001001101100,
the output should be:
57
15
3. Modified Knuth-Morris-Pratt: It can be proven (see the week 2 lecture notes) that
for a pattern pat[1 . . . m] and a text txt[1 . . . n], the number of comparisons made by
the Knuth-Morris-Pratt (KMP) algorithm in its search phase is bounded by 2n. This is
because if a character txt[j] in the text is involved in a match, it is never examined again.
However, this is not true if txt[j] is involved in a mismatch and in this case txt[j] may be
involved in multiple comparisons with characters of the pattern. Fortunately the KMP
algorithm can be modified so that each character examined in the text is involved in only
single comparison with a character from the pattern, regardless of whether that com￾parison results in a match or a mismatch. As a consequence, the number of comparisons
made in the search phase of this modified version of KMP is bounded by n.
Your task is to modify the original KMP algorithm so that it adheres to this tighter bound
and to use it to perform exact pattern matching. You should first identify what aspect
of the original KMP shift rule prevents the number of comparisons made in the search
phase from being bounded by n. Then determine how to modify the shift rule in order to
achieve the desired bound on the number of comparisons. Hint: Modifying the shift rule
will force you to alter the preprocessing phase as well. Despite this, the preprocessing can
still be performed using the Z-algorithm and should only require a small increase in time
and space over the original preprocessing.
Strictly follow the following specification to address this question:
Program name: modified kmp.py
Arguments to your program: Two plain text files:
(a) an input file containing txt[1...n] (without any line breaks).
4
(b) another input file containing pat[1..m] (without any line breaks).
You may assume that both txt and pat consist of lower case English letters, i.e.
ASCII characters with values in the range [97 . . . 122].
Command line usage of your script:
modified kmp.py
Do not hard-code the file names/input in your program. The pattern and text should be specified as arguments.
Output file name: output kmp.txt
❼ Each position where pat matches the txt should appear in a separate line. For
example, when txt[1 . . . 12] = abcdabcdabcd, and pat[1 . . . 3] = abc the output
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2
热点文章
程序代写更多图片

联系我们 - QQ: 99515681 微信:codinghelp2
© 2014 www.7daixie.com
程序代写网!