2020 Semester 1
COMP20007 Design of Algorithms
School of Computing and Information Systems
The University of Melbourne
Assignment 2
Due: 7:00 PM AEST Thursday 4 June
Introduction
This assignment consists of 2 problems: 2 programming problems and 2 written problems. You
will submit your solutions for the C programming component via dimefox submit and the written
component via the LMS.
This assignment has a total of 20 marks and will contribute 20% to your final grade for this subject.
Programming Problems
Problem 1 (a)
3 Marks
You must create a C program which implements a string hash function, reads in a file containing N
strings and outputs the hash value for each string.
The strings will be between 1 and 256 characters long and will contain lowercase letters, uppercase
letters and digits.
Each character will be mapped to a 6 digit binary string according to the following function:
chr(a) = 0 chr(b) = 1 · · · chr(z) = 25
chr(A) = 26 chr(B) = 27 · · · chr(Z) = 51
chr(0) = 52 chr(1) = 53 · · · chr(9) = 61
For example A maps to the binary string 011010.
The hash value for a string is computed by concatenating the binary strings for each of the characters
which make up the string, and taking this mod M , for a given M .
For example if M = 17 the string "pub" would hash to 8. Since,
chr(p) = 15, chr(u) = 20, chr(b) = 1,
we have the concatenated binary string,
001111 010100 000001
This binary number represents the decimal number 62721. Taking 62721 mod 17 yields 8. We say
that h17("pub") = 8.
1
In general, if S = c0c1 · · · c`−1 is a string of ` characters then the hash value hM (S) is given by,
hM (S) =
`−1∑
i=0
chr(ci)× 64`−1−i mod M
We use 64 here because each binary string we’re concatenating has length 6 and 26 = 64.
Your program will be given input via stdin, where the first line of input contains the two integers N
and M and the next N lines contain each of the N strings to be hashed (be careful that your program
doesn’t include the newline character in the string).
For example the input file tests/p1a-in-1.txt contains:
4 17
pub
Dijkstra
AB
comp20007
Your program should output the hash values of each of these strings, one per line. E.g.,
$ ./a2 p1a < tests/p1a-in-1.txt
8
16
8
13
Hint: As you can see, these numbers are going to get very big very quickly since we can have strings
of up to 256 characters long. These numbers probably won’t fit into a C int. You should use Horner’s
Rule to deal with this problem.
Problem 1 (b)
3 Marks
In this problem you must implement a hash table with initial size M capable of storing strings using
the hash function hM described in Problem 1 (a).
Your program will again be given N strings which should be inserted (in the given order) into the
hash table.
Collisions should be handled using linear probing with a step size of K.
If an element cannot be inserted into the table then the size of the hash table must be doubled (i.e.,
Mnew = 2Mold) and the strings already in the hash table must be rehashed and inserted (in the order
in which they appear in the hash table). Finally the string which initially could not be inserted should
be hashed and inserted into the larger hash table.
Your program will receive input from stdin, where the first line will contain the three integers N,M
and K and the next N lines will contain the N strings to be inserted into the new larger hash table.
Any memory associated with the hash table before the size was increased should be correctly freed.
Consider as an example the following input:
5 6 3
sal
scu
ba
cam
wam
2
So we will have N = 5 strings to insert into a hash table which initially has M = 6 slots, and we will
use linear probing with step size K = 3.
The first string sal has a hash value of h6(sal) = 5 so it is inserted into index 5:
0 1 2 3 4 5
sal
We then insert scu with h6(scu) = 4 into index 4:
0 1 2 3 4 5
salscu
Now the hash value of ba is also h6(ba) = 4, so there is a collision. Since the step size is 3 we will try
to insert this string into index (h6(ba) +K) mod M = (4 + 3) mod 6 = 1:
0 1 2 3 4 5
salscuba
The string cam with h6(cam) = 2 is then inserted:
0 1 2 3 4 5
salscuba cam
We then try to insert wam with hash value h6(wam) = 4, however we get a collision at both viable
indices 4 and 1:
0 1 2 3 4 5
salscuba cam
At this step we fail to insert wam into the hash table. So we increase the size of the hash table from 6
to 12 and rehash and re-insert all the existing elements (we do this in order of position in the existing
hash table, i.e., ba, cam, scu, and then sal).
The new hash values are:
h12(ba) = 4 h12(cam) = 8 h12(scu) = 4 h12(sal) = 11 and h12(wam) = 4
So after reinserting all the existing elements the hash table looks like:
0 1 2 3 4 5 6 7 8 9 10 11
salba scu cam
3
Then after inserting wam again we get:
0 1 2 3 4 5 6 7 8 9 10 11
salba scu cam wam
The program must output the final state of the hash table with one line per slot. In this example the
output should be:
$ ./a2 p1b < tests/p1b-in-2.txt
0:
1:
2:
3:
4: ba
5:
6:
7: scu
8: cam
9:
10: wam
11: sal
Problem 2 (a)
3 Marks
For this problem you will create a character level trie. A trie is a tree where each node represents a
character in a word (or a "^" or "$" which represent the start and the end of a word respectively).
Each path from the root to a leaf in a trie corresponds to a word in the data set which the trie
represents. Each node has a corresponding frequency.
Consider the following trie as an example:
ˆ6
a5 b1
$2 a2 b 1 a 1
$2 $ 1 $ 1
This trie represents the set of strings "a", "aa", "ab" and "ba" with frequencies 2, 2, 1 and 1
respectively. The frequencies associated with leaf nodes (note that leaf nodes are always $ nodes)
reflect the frequencies of the words ending at those leaf nodes.
On the other hand, the frequencies of the intermediate nodes reflect the frequencies of the correspond-
ing prefix. For example the a node on the first layer indicates that there are 5 strings which start with
"a". What is important here is the path taken from the root to the node. If we are looking at the
first a node on the second layer the path from the root is ^, a, a and so the frequency of 2 indicates
that there are two strings which begin with "aa".
4
Your task in this problem is to read in N strings containing between 1 and 99 lowercase letters and
use them to build a character level trie. For example the input to create the trie above contains N = 6
on the first line followed by the 6 strings:
6
ab
a
aa
ba
aa
a
Your program must print the pre-order traversal of the trie after inserting each string. For example:
$ ./a2 p2a < tests/p2a-in-1.txt
^
a
$
a
$
b
$
b
a
$
Note that the children must be stored (and thus, traversed) in alphabetic order, where $ comes before
a, which comes before b etc.
Problem 2 (b)
2 Marks
In this problem you will construct a trie, as in Problem 2 (a), and use this trie to output all of the
prefixes of length exactly K, along with their corresponding frequencies.
The prefixes must be output in alphabetic order.
The input will have one line containing N and K followed by N strings (one per line). For example
tests/p2b-in-2.txt contains:
15 3
algebra
algorithmics
already
algebra
algae
albert
algorithm
algorithms
again
algorithms
artistic
albania
artemis
alg
algorithms
5
In this example your program should output the following:
$ ./a2 p2b < tests/p2b-in-2.txt
aga 1
alb 2
alg 9
alr 1
art 2
Problem 2 (c)
2 Marks
In this problem you will again construct a character level trie, as in Problem 2 (a), and use it the solve
to following problem.
You are provided with a word stub, which is the start of a word which is yet to be completed. Your
program should predict the probabilities of the possible word completions, based on the set of words
provided to your program.
For example if the word stub is "alg" then you should calculate the probability of each possible word
W with prefix "alg" according to the following formula:
Pr(W |stub = "alg") = Freq(W )
Freq(prefix = "alg")
The input to your program will contain N (the number of strings) on the first line, the word stub on
the second line, and then N lines containing the N strings (one per line).
For example, consider the input tests/p2c-in-2.txt:
15
alg
algebra
algorithmics
already
algebra
algae
albert
algorithm
algorithms
again
algorithms
artistic
albania
artemis
alg
algorithms
The word stub given in "alg". The possible word completions are "algorithms", "algebra",
"algorithm", "algorithmics", "algae" and "alg" with frequencies of 3, 2, 1, 1, 1 and 1 respectively.
Your program should output the probability (formatted to 2 decimal places) followed by the word
completion of up to 5 word completions (there may be fewer possibilities). Your program should output
the most probable word completions, in descending order of probability, breaking ties alphabetically
where a comes before aa.
The output in this example should be:
6
$ ./a2 p2c < tests/p2c-in-6.txt
0.33 algorithms
0.22 algebra
0.11 alg
0.11 algae
0.11 algorithm
Analysis Problems
Problem 3
3 Marks – Up to 1/2 page
In the lectures you saw a variety of sorting algorithms with different advantages and drawbacks. In
this problem you are a software consultant who is specialised in sorting. You have a set of case studies,
given by clients, each one with a range of requirements. The task is to select a good sorting algorithm
for each case and justify your choice, in a maximum of 2 sentences. Here is an example:
Example case: an airplane factory wants to improve the embedded system in its airplanes. The
system uses a range of sensors that constantly collect large amounts of data from wind currents
outside the airplane. This data needs to be sorted as a preprocessing step to ease their visualisation
by the pilot. The sorting algorithm should be completely in-place, as extra memory usage should be
minimised in an embedded system. The algorithm should also have good performance even in the
worst case, since it should be robust to hijacking. What algorithm would you use?
Solution: Heapsort, because it is in-place and has guaranteed Θ(n log n) worst case performance,
which makes it robust to adversarial attacks.
It’s important to keep in mind that this is a subjective question and there might be more than one
correct answer. Many real world scenarios do not have a single correct solution—this is why your
argument about your chosen algorithm is important.
(a) The Melbourne City Library has introduced a program where people can return books in local
“drop-off” boxes in their neighbourhoods. For each box, a staff member collects the returned
books, sorts them manually and deposits them in a chute at the library, where they are scanned
and their record is stored in an array. At the end of the day, the array will contain a sequence
of book records starting with all books from the first collected box, all books from the second
collected box and so on.
The array needs to be sorted as soon as the last box arrives, to inform the librarian before
the end of the day. Therefore, the main requirement is that the algorithm should be as fast as
possible in this scenario. However, the algorithm should also maintain the original order of any
duplicate books in case multiple copies of the same book are returned in the same day.
(b) A mobile phone manufacturer is planning to revolutionise the market by launching a phone with
a battery that lasts for a whole week. The developers of the phone firmware asked for you help
to decide on a sorting algorithm to be added to their code.
The sorting function would be general as it needs to sort arbitrary records and it does not need
to be fast as most use cases will only have a few hundreds of records maximum. However, the
developers are concerned about making it energy efficient: while they managed to keep record
comparisons at a very low energy cost (1 nanoWatt), record swapping is not as efficient (5
microWatts, 5000x times higher consumption).
(c) In astronomy, one method used to detect the presence of a new planet is to look for changes
of luminosity in stars. An observatory has a collection of electronic telescopes that constantly
collects luminosity values. Each telescope is focused on a small set of 50 stars. At every mi-
crosecond, it returns a luminosity value for each of its 50 stars: that value is an integer between
7
0 and 10. This set of values needs to be sorted and sent to a central server in the observatory.
As this whole procedure happens every microsecond, the sorting algorithm needs to be extremely
fast. Each telescope has a small amount of additional memory that can be used for that.
However, the algorithm cannot use more memory than it uses to store the luminosity values.
Problem 4
3 Marks – Up to 1/2 page
You are a hacker that recently got hired by a cybersecurity company. The team needs to be preemptive
against any adversarial attacks. Your role is to develop such attacks so the defense team can prepare
mechanisms to avoid them.
In this problem you were able to inject code into the algorithm for inserting an element into a Binary
Search Tree. Your goal is to use to force the BST to always degrade to a “stick” (to the
right), or more specifically, a linked list.
The algorithm to insert an element into the BST is as follows.
function BSTInsert(root, new):
if root.value < new.value then
if root.left = NULL then
root.left ← new
Stickify(new, root)
else
BSTInsert(root.left, new)
else
if root.value > new.value then
if root.right = NULL then
root.right ← new
Stickify(new, root)
else
BSTInsert(root.right, new)
else
return // The element new.value is already in the tree
Here root and new are node data structures with fields left and right (which are pointers to other node
data structures) and a field value.
Your task is to write the pseudocode for the algorithm Stickify function which takes the newly
inserted node along with its parent and performs any necessary rotations to ensure the tree remains
a “right stick”.
If the BST is already a “right stick”, then running the BSTInsert algorithm (with your Stickify
function) should always leave the tree as a “right stick”.
For example if we start with the tree on the left and insert 20 we will get the tree in the middle. After
Stickify is run (with pointers to nodes 20 and 50 respectively).
8
10
50
60
90
10
50
60
90
20
10
20
50
60
90
Your algorithm should rearrange the tree by performing rotations (you may use RotateLeft(node)
and RotateRight(node) without implementing these functions).
(i) Develop a set of rotations to ensure the BST is always a “stick”. You can safely assume that
1) the BST always starts as an empty structure, with elements inserted one-by-one and 2) the
linked list should use the right children of each node in the tree.
(ii) Write pseudocode to perform the rotations at every insertion. The code is assumed to be injected
in the blue line above. As before, it’s safe to assume the BST is empty in the beginning.
You may use examples and diagrams to explain your algorithm. Your whole solution must be at most
half a page in length.
Completing the Programming Problems
We have provided you with the skeleton code for this assignment. The provided files has the following
directory structure:
provided_files/Makefile
Makefile
hash.c
hash.h
main.c
text_analysis.c
text_analysis.h
tests/
p1a-in-1.txt
p1a-out-1.txt
...
p2c-in-4.txt
p2c-out-4.txt
You must not change the main.c file, however you are allowed to change any of the other files (without
changing the existing type declarations and function signatures).
You should implement your solution to Problem 1 in the hash.c and hash.h files, and you should
implement your solution to Problem 2 in the text_analysis.c and text_analysis.h files.
If you create new C modules then you must change the Makefile so that your program can be compiled
on dimefox using the make command.
To run the program you must first compile with make, and then run the a2 command with one of
these command line arguments: p1a, p1b, p2a, p2b, p2c.
9
For example:
$ make
$ ./a2 p1a < tests/p1a-in-1.txt
The pXX-in-X.txt files contain the input your program will be provided for each test case, and the
pXX-out-X.txt files contain the expected output. Your program must match the expected output
exactly (be careful with whitespace).
Note: in this assignment, unlike Assignment 1, there will be 4 provided test cases, and 2 hidden test
cases per subproblem.
Programming Problem Submission
You will submit your program via dimefox submit. Instructions for how to connect to the dimefox
server can be found on the LMS.
You should copy all files required to compile and run your code to dimefox, this includes the Makefile
and all .c and .h files.
It’s recommended that you test your program on dimefox before submitting to make sure that your
program compiles without warnings and runs as expected.
From the directory containing these files you should run submit and list all files required to com-
pile and run your program. For example, to submit only the provided files we would run:
$ submit comp20007 a2 Makefile main.c hash.c hash.h
... text_analysis.c text_analysis.h
Note that you can also list all .c and .h files in the current directory with *.c and *.h, so the following
command will be equivalent:
$ submit comp20007 a2 Makefile *.c *.h
For this to work correctly you should have your Assignment 2 code in its own subdirectory (i.e., the
only files in the directory should be the files you wish to submit).
You must then verify your submission, which will provide you with the outcome of each
of the tests, and the number of marks your submission will receive.
$ verify comp20007 a2 > a2-receipt.txt
To view the result of the verify command you must run:
$ less a2-receipt.txt
less is a tool which allows you to read files using the command line. You can press Q on your keyboard
to exit the less view.
You can submit as many times as you would like.
Any attempt to manipulate the submission system and/or hard-code solutions to pass the specific test
cases we have provided will result in a mark of 0 for the whole assignment.
10
Completing the Written Problems
You will submit your solutions to the written problems to the LMS.
For Problem 4, which asks for pseudocode, we expect you to provide the same level of detail as the
lectures and workshops do. Pseudocode which lacks sufficient detail or is too detailed (e.g., looks like
C code) will be subject to a mark deduction.
Your submission should be typed and not handwritten and submitted as a single .pdf file.
Pseudocode should be formatted similarly to lectures/workshops, or presented in a monospace font.
The page limit for each problem is given (half to one full page per written problem). Submissions
which go over the page limit, have too small a font size or are otherwise unreadable will be subject to
a mark deduction.
Make sure you confirm that your written submission has been submitted correctly.
Mark Allocation
The total number of marks in this assignment is 20. The maximum marks for each problem are:
Problem 1 6 marks (3 per part)
Problem 2 7 marks (3 for (a), 2 for (b) and (c))
Problem 3 3 marks (1 per part)
Problem 4 3 marks
There is one additional mark awarded for “structure and style” for the C programming component.
Of particular importance will be the structure of your C program in terms of how you separate your
types and functions, as well as your commenting and choice of variable names.
There will be 6 test cases for each programming sub-problem, of which 4 are provided to you and 2
are hidden test cases.
The marks awarded for each programming sub-problem will be calculated by:
Marks = Total Marks× Test Cases Passed÷ 6
Late Policy
A late penalty of 20% per day (24 hour period) will be applied to submissions made after the deadline.
The 20% applies to the number of total marks. This applies per component, i.e.,
Grade = max
{
Programming Grade− 0.2×Days Late× 14, 0
}
+ max
{
Written Submission Grade− 0.2×Days Late× 6, 0
}
.
For example, if you are 2 days late with the programming component but only 1 day late with the
analysis component your grade for the programming component will be reduced by 0.4×14 = 5.6 and
the grade for the analysis component will be reduced by 0.2× 6 = 1.2.
Academic Honesty
You may make use of code provided as part of this subject’s workshops or their solutions (with proper
attribution), however you may not use code sourced from the Internet or elsewhere. Using code from
the Internet is grounds for academic misconduct.
11
All work is to be done on an individual basis. All submissions will be subject to automated similarity
detection. Where academic misconduct is detected, all parties involved will be referred to the School
of Engineering for handling under the University Discipline procedures. Please see the Subject Guide
and the “Academic Integrity” section of the LMS for more information.