讲解 ECE250 Project 4: Hashing调试R程序

ECE250 Project 4: Hashing

Do not proceed until you have a working solution for Project 3

This project will be a little different from the previous three. In this project, you will start with your Project 3 code and build off of it. This means that all of the commands from Project 3 must still work after you modify your classes for this project. We will be re-testing some of the Project 3 commands during testing of this project.

Hashing

Recall in Project 3 that you stored your countries in an array of size 512. Unless you were very proactive, you likely just stored the country data on a first-come, first-served basis. That is, you likely put the first country you encountered at index 0, the second one at index 1 and so on until you ran out of countries. This means that looking up a specific country would be an O(N) search, where N is the number of countries. Let’s do better. In this project, we are going to use double hashing to get this down to O(1) on average.

Double hashing requires us to use two hashing functions. The first, or primary hash function, is used to compute the initial hash. If, after hashing using the primary hash, we get a collision, we must use the secondary hash function.

The Primary Hash Function

The countries are represented via a three-letter country code, and all letters are capitals. For instance, Canada is CAN. We can treat this like a base-26 number by converting each letter’s position in the alphabet into a number, starting with A = 0 and ending at Z = 25. Say now that our country code is given by three characters C0C1C2 and associated digits D0D1D2 , in that order (so CAN would have C0 = C andD0 = 2, for instance). Then we compute the value:

W = D0 × 262 + D1 × 26 + D2

This gives us a number that we can use to compute an array index. Our primary hash function, then, is:

H1(W) = W%m

Where m is the array size. In our case, m = 512. This value is the array index where we would put this country, provided there are no collisions.

The Secondary Hash Function

Unfortunately, the above hash function may cause collisions, where two distinct country codes hash to the same value. For instance, Canada (CAN) and Andorra (AND) both hash to the value 341. To mitigate this, we use a second hash function if a collision is detected. This secondary hash function is computed as:

Since your array size is 512, which is a power of 2, this ensures that H2(W) is always odd, and therefore is always coprime to the size of the array.

The Overall Hash Function

Let i be the number of times you have computed a hash function for a given country code, starting at i = 0. Then the index of the array that you are looking at is given by:

INDX = (H1(W) + iH2(W)) % m

Deletion and Searching

When a country is deleted from your hash table we have a problem. Presume three countries were inserted and collided with their primary hash and their secondary hash. In this case, the first country to be inserted would be at the primary hash value, the second country to be inserted would be at the second hash value (with i = 1), and the third country to be inserted would be at the third hash value (i = 2). What happens when we delete the second country then search for the third?

If we simply mark the second country’s spot as “empty”, then our hash search will stop too early and the third country will never be found. Therefore, it is imperative that we mark the second country’s spot as “previously occupied”. This now changes two things:

1. When searching for the third country, we continue searching if we find either a currently

occupied or a previously occupied space. This allows us to avoid the problem of the empty space ending the search too early.

2. When inserting a new country (say we are re-inserting the second country after deleting it) we insert it in the first empty space or the first previously occupied space

Note: there are several other ways to deal with this problem. This is the one we require you to use in this lab so that the input/output behaviour works as expected. Therefore, this is not an optional specification.

Program Design and Documentation

In this project, the primary goal is to modify your existing classes to use hashing rather than linear lookup. As such, the design document should reflect the changes you made and how you made them. You may therefore choose to modify your Project 3 design document with the appropriate changes. This means you must still outline your class design, it just may be very similar to what you submitted in Project 3. That’s OK.

You must submit a brief design document for this project that outlines your class design. Guidelines for this document are on Learn.

Input/Output Requirements

The table below reflects only the new commands that your code must respond to. Your code must still respond to all commands from Lab 3. For commands from lab 3 in which the country name is given, you may use linear search. You may no longer assume that the LOAD command will be encountered first – all Project 4 commands should work as specified even if LOAD was not called. However, no Project 3 functions will be called unless LOAD has been called. You may assume that LOAD will be encountered at most once in any input file.

In ECE250, we will be testing your submissions via scripts written to run on Linux. To do this your solution must follow specific formatting requirements.

You must create a test program that contains your main program. This program must read commands from standard input and write to standard output.

The program must respond to the commands shown in Error! Reference source not found.. The outputs listed in the “Output” column must appear as shown. For instance, the first row has “success” as an output. This means that the string “success”, written in lowercase, is expected, and if more input cases exist the output will continue on the next line. The program ends (and destructors are called) when standard input is empty.

Table 1: Testing Commands

Command	Parameters	Description	Output
LOOKUP	Country_Code	Looks up the country code in your table, and outputs both the index in the array where the country was found as well as the number of hashing steps required to find it. Note, the number of hashing steps is always at least 1, since you always compute the primary hash.	index X searches Y when the country is found in the array failure when the country is not found in the array
REMOVE	Country_Code	Remove this country from the hash table, if it exists. If a binary tree has been made, also remove this country from the tree.	success If the given country appears in the hash table. If no binary tree exists but the country appears in the table, that is still a success. failure if the given country does not appear in the hash table
INSERT	Country_Code filename	Insert the country with the given code into the hash table if it is not already there. To do this, open the file given by filename	success If the country is not already in the hash table

and search it for the country with the given code, then insert all of that country’s time series as

appropriate. You may assume

that if this command is called the country exists in the file.

failure

If the country is in the hash table

EXIT

All input files end with EXIT. This command produces no output.

Runtime

Prove that your LOOKUP command has average-case runtime O(1) presuming that the number of collisions is O(1) and worst-case runtime O(N), where N is the number of countries.

Code Profiling

The act of observing how many instructions are run is known as “profiling” the code. We will be using code profiling tools to examine your program’s execution and compare it to your P3 implementation for an input file that heavily searches your array for countries. If your hashing implementation is not correct, we will not see a significant drop in the number of instructions that the P4 code runs compared to P3. Code profiling is included in your valgrind grade for this project. Valgrind expectations for this project are now 0 memory leaks, 0 memory errors, and significant reduction in number of instructions. For example, when I run my project 4 code and compare it to my project 3 code, I see a reduction of approximately 150 million instructions.

Submitting your Program

Once you have completed your solution and tested it comprehensively on your own computer or the project computers, produce your tar file according to the general project guidelines and submit to the dropbox on Learn.

联系我们