首页 > > 详细

PIC 10A Section 2 - Homework #9

 PIC 10A Section 2 - Homework #9 (due Wednesday, december 4, by 6 pm)

You should upload each .cpp (and .h file) separately and submit them to
CCLE before the due date/time! Your work will otherwise not be considered
for grading. Do not submit a zipped up folder or any other type of file besides
.h and .cpp.
Be sure you upload files with the precise name that you used in your edi￾tor environment otherwise there may be linker and other errors when your
homeworks are compiled on a different machine. Also be sure your code com￾piles and runs on Visual Studio 2019.
VOCABULARY COMPARISON
This assignment is focused on building familiarity with streams and data structures.
In this assignment, you are to write a simple vocabulary comparison tool. You will sub￾mit 3 files in the end: Vocabulary.h (providing declarations of constructors/member
functions/functions), Vocabulary.cpp (providing definitions), and Compare.cpp (pro￾viding the main routine).
Here is what the program will do:
The user will be prompted to list as many files as they like ( could be 2, 3, 4, 10,
etc.), all separated by spaces. These files will then be compared against each other in
all possible pairwise comparisons for their vocabulary. Similarities between files will be
computed. The similarities will all be printed to the console. In addition, suppose there
are N files that were compared. Then there will be a filed Results Compare N.txt
where N appears directly as a number will be generated to store the identical output as
was on the console.
In this homework, you may not use std::to string or std::stoi and the likes. Any￾where where std::stringstreams could be used, you should be using them as practice.
A little more about the similarity score. Let A and B be two files. We define the
similarity score S as:
S =
number of words common to both A and B
√NANB 1
PIC 10A Section 2 - Homework #9
where NA is the number of unique words appearing in file A and NB is the number of
unique words in file B after all capitalization has been removed. The number S is always
in the interval [0, 1] and the larger it is, the more similar two files are in their vocabulary.
The desired format of the running program is below:
Enter all file names for comparison separated by spaces: [USER ENTERS ALL THE
FILE NAMES]
Comparison of [FILE NAME] and [OTHER FILE NAME]: [VALUE]
...
Results have been written to: Results Compare [NUMBER OF FILES].txt
Also see the screen shot.
To manage the comparisons, you should write a class VocabWrapper and a function
similarity. The VocabWrapper class should:
• store the list of words in an appropriate structure and the name of the associated
file;
• have a constructor that accepts the name of a file, initializing the filename;
• have a get filename function returning the filename;
• have a read vocab function that reads all of the words in from the file of the given
name turning all capital letters to lowercase!!!!!!!!!!!!
• have a word count function returning how many unique words there were in the
associated file; and
• have an overlap count function, accepting another VocabWrapper class, return￾ing a count of how many words their two associated files have in common (after
capitalizations have been removed).
The function similarity should compute the similarity between two inputs of type
VocabWrapper.
Note that the files will be given to you without any punctuation. But you must ensure
all words are represented without capital letters.
2
Word multiplicity is to be neglected in this homework. Whether a file has the
word “gravitation” appearing once or three-hundred times, as far as our simple similarity
score is concerned, it happened once! Some important details:
1. The user’s files will be in the same folder as the .cpp, .h, or .exe files, and the files
must be saved to the same folder.
2. You may assume the files contain no punctuation marks.
You can test your code against the sample input files and output file provided (yes,
there are typos in the files). A sample output is provided in this document.
The texts provided are samples of 250 words sourced from the following links:
• https://www.gutenberg.org/files/30155/30155-0.txt - The Special and General The￾ory, by Albert Einstein
• http://www.gutenberg.org/cache/epub/60271/pg60271.txt - From Newton to Ein￾stein, by Benjamin Harrow
• https://www.gutenberg.org/files/52521/52521-0.txt - Grimm’s Fairy Tales
Remarks: there are much better text comparison algorithms out there but to avoid
going too heavy into machine learning and algorithms, this is sufficient. Two very natural
improvements would be to (i) remove “stop words”, i.e., words like “a”, “the”, “and”,
etc., that appear in almost all text and (ii) to consider word frequency. Considering word
semantics and topic representations of the documents would be huge improvements.
If you’re wondering about the math behind the similarity score S: imagine encoding
all words in the English language in {0, 1}N where N is the number of possible words.
We treat each word as being orthogonal to every other distinct word. This means we
could imagine encoding “physics” as (1, 0, 0, 0, ...)T and “bear” as (0, 1, 0, 0, ...)T
, etc. We
neglect the word multiplicity in each document and then compute the cosine similarity
score between two documents to find S. 3
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!