CSCI-570 Fall 2021
Analysis of Algorithms
Final Project
Due: Dec 8, 2021
I. Guidelines
The project is related to the implementation of the two different solutions provided in chapter 6
of the Kleinberg textbook for the Sequence Alignment problem. You can work on this project in
groups of no more than 3 people.
II. Project Description
Implement the basic Dynamic Programming solution to the Sequence Alignment problem. Run
the test set provided and show your results.
A. Algorithm Description:
Suppose we are given two strings X and Y, where X consists of the sequence of symbols
x1, x2 . . . xm and Y consists of the sequence of symbols y1, y2 . . . yn. Consider the sets {1,
2, . . . , m} and {1, 2, . . . , n} as representing the different positions in the strings X and Y,
and consider a matching of these sets; recall that a matching is a set of ordered pairs with
the property that each item occurs in at most one pair. We say that a matching M of these
two sets is an alignment if there are no “crossing” pairs: if (i, j), (i’ , j’ ) ∈ M and i < i’ ,
then j < j’ . Intuitively, an alignment gives a way of lining up the two strings, by telling
us which pairs of positions will be lined up with one another.
Our definition of similarity will be based on finding the optimal alignment between X
and Y, according to the following criteria. Suppose M is a given alignment between X
and Y:
1. First, there is a parameter δe > 0 that defines a gap penalty. For each position of X
or Y that is not matched in M — it is a gap — we incur a cost of δ.
2. Second, for each pair of letters p, q in our alphabet, there is a mismatch cost of αpq
for lining up p with q. Thus, for each (i, j) ∈ M, we pay the appropriate mismatch
cost αxiyj for lining up xi with yj. One generally assumes that αpp = 0 for each letter
p—there is no mismatch cost to line up a letter with another copy of
itself—although this will not be necessary in anything that follows.
3. The cost of M is the sum of its gap and mismatch costs, and we seek an alignment
of minimum cost.
B. Input string Generator:
The input to the program would be a text file, input.txt containing the following
information:
1. First base string
2. Next j lines would consist of indices corresponding the after which the
previous string to be added to the cumulative string
3. Second base string
4. Next k lines would consist of where the base string to be added to the
cumulative string
This information would help generate 2 strings from the original 2 base strings.
This file could be used as an input to your program and your program could use
the base strings and the rules to generate the actual strings. Also note that the
numbers j and k corresponds to the first and the second string respectively. Make
sure you validate the length of the first and the second string to be
2j*str1.length and 2k*str2.length. Please note that the base strings need
not have to be of equal length and similarly, j need not be equal to k.
Using the above numbers, the generated strings would be
ACACTGACTACTGACTGGTGACTACTGACTGG and
TATTATACGCTATTATACGCGACGCGGACGCG
Following is the step by step process on how the above strings are generated.
ACTG
ACTGACTG
ACTGACTACTGACTGG
ACACTGACTACTGACTGGTGACTACTGACTGG
TACG
TATACGCG
TATTATACGCGACGCG
TATTATACGCTATTATACGCGACGCGGACGCG
C. Values for Delta and Alphas
Values for α’s are as follows. δe is equal to 30.
A C G T
A 0 110 48 94
C 110 0 118 48
G 48 118 0 110
T 94 48 110 0
D. Programming/Scripting Languages:
Following are the list of languages which could be used:
1. C
2. C++
3. Java
4. Python
5. Python3
E. Bounds:
Bounds on the length of the base strings and the values of m and n, along with the
zip file will be released on 17 November 2021, i.e. 3 weeks before the due date.
III. Goals
Following are the goals to achieve for your project
A. Your program should take input:
1. 2 strings that need to be aligned, should be generated from the string
generator mentioned above.
2. Gap penalty (δe).
3. Mismatch penalty (αpq).
B. Your solution should output output.txt file containing the following information at
the respective lines:
1. The first 50 elements and the last 50 elements of the actual alignment.
2. The time it took to complete the entire solution.
3. Memory required.
C. Implement the memory efficient version of this solution and repeat the tests in
Part B.
D. Plot the results of Part A and Part B:
1. Single plot of CPU time vs problem size for the two solutions.
2. Single plot of Memory usage vs problem size for the two solutions.
IV. Notes and Hints
A. You will be provided a zipfile which has a folder containing some sample test
cases and corresponding output text files containing the correct answers. Please
note that we will be having another set of test cases (that are not released) which
will be used for testing your program/script.
B. You should provide us with a zipfile containing your programs/scripts along with
plots and a summary file.
C. The name of your program/script, folder and all the other meta-data files should
have the USC IDs (not email ids) of everyone in your group separated by
underscore, as follows:
1. Zip filename: 1234567890_1234567890_1234567890.zip, of the original
foldername 1234567890_1234567890_1234567890.
2. Program filename: 1234567890_1234567890_1234567890_basic.py and
1234567890_1234567890_1234567890_efficient.py (if using Python).
3. Plots: CPUPlot.png and MemoryPlot.png (you can use .jpg as well).
4. Summaries: Summary.txt.
D. Your programs/script should take input as input.txt file as an argument and output
an output.txt file.
E. Summarize your results and include any insights or observations drawn from your
results in Summary.txt
F. You also need to state each group members contribution to the project, e.g.coding,
testing, report preparation, etc. Do that in Summary.txt.
V. Grading
A. Basic Algorithm: 25 points.
B. If your program outputs a .txt file having all the 3 lines with correct answers: 15
points.
C. Memory efficient version: 30 points.
D. Plots, analysis of results, insights and observations: 30 points.