PIC 10A Section 2 - Homework #9

PIC 10A Section 2 - Homework #9 (due Wednesday, december 4, by 6 pm)

You should upload each .cpp (and .h file) separately and submit them to

CCLE before the due date/time! Your work will otherwise not be considered

for grading. Do not submit a zipped up folder or any other type of file besides

.h and .cpp.

Be sure you upload files with the precise name that you used in your editor environment otherwise there may be linker and other errors when your

homeworks are compiled on a different machine. Also be sure your code compiles and runs on Visual Studio 2019.

VOCABULARY COMPARISON

This assignment is focused on building familiarity with streams and data structures.

In this assignment, you are to write a simple vocabulary comparison tool. You will submit 3 files in the end: Vocabulary.h (providing declarations of constructors/member

functions/functions), Vocabulary.cpp (providing definitions), and Compare.cpp (providing the main routine).

Here is what the program will do:

The user will be prompted to list as many files as they like ( could be 2, 3, 4, 10,

etc.), all separated by spaces. These files will then be compared against each other in

all possible pairwise comparisons for their vocabulary. Similarities between files will be

computed. The similarities will all be printed to the console. In addition, suppose there

are N files that were compared. Then there will be a filed Results Compare N.txt

where N appears directly as a number will be generated to store the identical output as

was on the console.

In this homework, you may not use std::to string or std::stoi and the likes. Anywhere where std::stringstreams could be used, you should be using them as practice.

A little more about the similarity score. Let A and B be two files. We define the

similarity score S as:

S =

number of words common to both A and B

√NANB 1

PIC 10A Section 2 - Homework #9

where NA is the number of unique words appearing in file A and NB is the number of

unique words in file B after all capitalization has been removed. The number S is always

in the interval [0, 1] and the larger it is, the more similar two files are in their vocabulary.

The desired format of the running program is below:

Enter all file names for comparison separated by spaces: [USER ENTERS ALL THE

FILE NAMES]

Comparison of [FILE NAME] and [OTHER FILE NAME]: [VALUE]

...

Results have been written to: Results Compare [NUMBER OF FILES].txt

Also see the screen shot.

To manage the comparisons, you should write a class VocabWrapper and a function

similarity. The VocabWrapper class should:

• store the list of words in an appropriate structure and the name of the associated

file;

• have a constructor that accepts the name of a file, initializing the filename;

• have a get filename function returning the filename;

• have a read vocab function that reads all of the words in from the file of the given

name turning all capital letters to lowercase!!!!!!!!!!!!

• have a word count function returning how many unique words there were in the

associated file; and

• have an overlap count function, accepting another VocabWrapper class, returning a count of how many words their two associated files have in common (after

capitalizations have been removed).

The function similarity should compute the similarity between two inputs of type

VocabWrapper.

Note that the files will be given to you without any punctuation. But you must ensure

all words are represented without capital letters.

Word multiplicity is to be neglected in this homework. Whether a file has the

word “gravitation” appearing once or three-hundred times, as far as our simple similarity

score is concerned, it happened once! Some important details:

1. The user’s files will be in the same folder as the .cpp, .h, or .exe files, and the files

must be saved to the same folder.

2. You may assume the files contain no punctuation marks.

You can test your code against the sample input files and output file provided (yes,

there are typos in the files). A sample output is provided in this document.

The texts provided are samples of 250 words sourced from the following links:

• https://www.gutenberg.org/files/30155/30155-0.txt - The Special and General Theory, by Albert Einstein

• http://www.gutenberg.org/cache/epub/60271/pg60271.txt - From Newton to Einstein, by Benjamin Harrow

• https://www.gutenberg.org/files/52521/52521-0.txt - Grimm’s Fairy Tales

Remarks: there are much better text comparison algorithms out there but to avoid

going too heavy into machine learning and algorithms, this is sufficient. Two very natural

improvements would be to (i) remove “stop words”, i.e., words like “a”, “the”, “and”,

etc., that appear in almost all text and (ii) to consider word frequency. Considering word

semantics and topic representations of the documents would be huge improvements.

If you’re wondering about the math behind the similarity score S: imagine encoding

all words in the English language in {0, 1}N where N is the number of possible words.

We treat each word as being orthogonal to every other distinct word. This means we

could imagine encoding “physics” as (1, 0, 0, 0, ...)T and “bear” as (0, 1, 0, 0, ...)T

, etc. We

neglect the word multiplicity in each document and then compute the cosine similarity

score between two documents to find S. 3

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！