811312A Data Structures and Algorithms 2022 Assignment, Instructions
This document describes both general and specific requirements for the assignment of the course Data
Structures and Algorithms.
1. General Instructions
The assignment is an individual work. You are not allowed to copy code from others. NOTE!
All suspected misconduct will be investigated and brought to the attention of NJIT study office.
However, the course materials, for instance exercise solutions and example programs can be used
freely. When utilizing already made solutions, referencing practices need to be followed and
explanation of how it has been modified need to be included. The solution must be provided with the
standard features of C language. The use of extra code libraries is not allowed.
Everyone shall return one assignment implemented individually in C programming language
and one written report. Return the solution and the report in one zip file in the course's Moodle
workspace before the deadline of 23:59 on March 25, 2022. All late submissions will get reduced
grade. If not approvable, a returned solution can be revised one time, according to inspector's advice.
See more detailed requirements later.
The assignment submission shall contain the source code for your program, possible header
files, and your written report. See detailed instruction and requirements further down in this
document and see the checklist.
The assignment shall be returned at latest 23.59 on March 25, 2022.
2. Description of the Assignment
The task
In this assignment, you shall find the 100 most frequently occurring words from a large text
file. The program shall be implemented in C language. A continuous string of characters a..z and
A..Z, with possible apostrophes ’, is considered a word. Words with uppercase and lowercase
letters are considered equal. For example, in the text
Herman Melville’s book Moby Dick starts, as we all know, with the sentence ”Call me Ishmael”.
the words are: “herman”, “melville’s”, “book”, “moby”, “dick”, “starts”, “as”, “we”, “all”, “know”,
“with”, “the”, “sentence”, “call”, “me”, and “ishmael”
A word does not contain separators. For instance, “book”, “book,” (“book” with a comma after it),
and “book.” (“book” with a full stop after it) should all exclude the separator and be counted as
“book”.
A number in numeral form is not a word. For instance, “1” is not a word, while “one” is.
The name of the input text file shall be given as an input from the user. That is, the input file
should not be hardcoded into the source code of the program, but the program should ask the input
file from the user.
The program shall print the total number of words in the input file, the number of different
words in the input file, and the 100 most frequently occurring words and their frequencies.
The words are printed in descending order according to their frequencies. The program shall also
measure and print the processing time.
Because the input file can be very large, you need a suitable data structure to store words and
their frequencies. This can be, for instance, a hash table or a binary search tree. You can find the
most frequent words by sorting the structure with some fast algorithm. The challenge of the task
is that the input file your program will be tested with is very large (close to 3 million words).
Therefore, regular table structure is out of the question, and linked list data structure is too slow. In
addition, you will need to reserve enough memory.
There are examples of input files and corresponding outputs in Moodle. All the files are texts
written in English. You should use these to test your program. The biggest file, Bulk.txt, does not
have an output file but you can use it to test if your solution is fast. Bulk.txt can also contain words
of different languages, but uses the same English alphabet, so it should not be a problem. Output of
your program is not necessarily exactly the same with large input files. If they are not, think about
why and include it in your report.
The report
The report must be written in English, and it shall contain
1. Student’s name
2. Student’s University of Oulu ID number
3. Description of the solution and
4. Analysis of the solution program.
In the description of your solution, you shall explain what data structures and algorithms you have
chosen and why. You shall also explain the most important functions of your program.
You shall do analysis of the efficiency of your program. The size of the input is measured as the
number of words in the text file. Measure the running times of your program with given test
input files and make an estimate, how many words there can be in a file that you can process in 15
seconds. Based on this, estimate the maximum size (in megabytes) of files that your program can
process in the mentioned time, when we know that the average word length in English is 5.1 letters.
Analysis contains thus the measurements of program’s running times and the mentioned estimates
for number of words and file size in 15 seconds. The report shall naturally contain student’s
name and student’s University of Oulu ID number. If you wish, you can also give feedback of
the course and/or assignment in your report.
Remember to zip the files before returning.
3. Detailed requirements
The program
The program shall read words from text file input (.txt file extension).
The program shall ask for input file from user.
The program shall store the words it reads from a file into a suitable data structure.
The program shall count how many words there are in an input file.
The program shall count how many different words there are in an input file.
The program shall count the frequency of each occurring word.
The program shall measure the time spent for processing the input file.
The program shall print the total number of words in the input file.
The program shall print the number of different words found in the input file.
The program shall print the 100 most frequent words in descending order based on frequency and include
the frequency.
The program shall print the time it took for it to process the file and give the result (print-out does not
have to be included in the measurement of time)
The program shall execute the given task in reasonable time even with the larges test material (Bulk.txt)
- Over 10 seconds – unacceptable
- Under 10 seconds – acceptable
- Under 5 seconds – good
- Under 2 seconds – very good
The code
You shall in write in the C language.
You shall write clean and readable code.
- Decide on indentation and keep it consistent
- Comment code to explain functions and variables – comments shall be in English
- Use a consistent and meaningful name scheme for variables and functions
You shall implement all data structures and algorithms on your own – do not use ready library functions –
do not copy.
You shall give reference if you use available code found in study material and explain how you have used
it.
The code shall not be dependent on any IDE – it should compile and run from command line.
The program shall, before it terminates, free memory it has allocated.
The program reserves enough space for the input filename.
The report
You shall use the given template for the report.
The report shall include your name.
The report shall include your University of Oulu student ID number.
The report shall be written in English.
The report shall briefly present the chosen data structures and algorithms.
- Explain how you solved the problem.
The report shall briefly explain the most important functions of your program.
- Why these functions, what are they used for?
The report shall include measurements of running times for different example input files (at a minimum 3
different files)
The report shall include comparison of the output with given output files and your elaboration on why they
are different if they are different.
The report shall include the estimate of how big files it can process in 15 seconds (the analysis) and
whether this is realistic
Your report may include feedback regarding the course (teaching, material, methods, anything) if you wish
to include it. This will not affect your grade.
Submission
All files shall be included in one zip file.
You shall submit program source code whose file extension is .c and possible header files with extension .h
- No .ccp files.
- No executable files.
- No input files.
- No additional files.
You shall submit the report as a word file, an open office file, or a pdf file.
You shall use English in your file names, or at least the letters of the English alphabet.