辅导 data编程、讲解 C/C++程序语言
Assignment 4
Due Friday by 11:59pm
Points 70
Submitting a file upload
Available Oct 4 at 12am - Dec 24 at 11:59pm
Start Assignment
Assignment 4 (70 Points)
Due Friday October 11 @ 11:59PM
In this assignment, we will improve the parallel solutions of PageRank from Assignment 3. You will
implement different task decomposition and mapping strategies, and observe their effect. You can use
the same serial versions of the pull-based and push-based PageRank from Assignment 3 to check for
the correctness of your results. You should use your parallel solutions " page_rank_pull_parallel.cpp "
and " page_rank_push_parallel_atomic.cpp " as the starting point of this assignment. You do not need to
download a separate tarball for this assignment since all you need was provided in Assignment 3.
You should have completed the Slurm Tutorial (https://canvas.sfu.ca/courses/84236/pages/slurm-tutorial)
which walks you through how to use our servers for your code testing. If you're still have trouble with
Slurm, please post questions on the canvas discussion board. It is strongly recommended to use Slurm
for this assignment instead of CSIL machines, since the graphs you need for checking your programs
are on the /scratch/ drive on the cluster's compute nodes.
General Instructions
1. You are provided with the serial versions and some other tools in the same tarball from Assignment
3 here (https://canvas.sfu.ca/courses/84236/files/24448335?wrap=1)
(https://canvas.sfu.ca/courses/84236/files/24448335/download?download_frd=1) . There is a much
larger file here (https://canvas.sfu.ca/courses/84236/files/24448340?wrap=1)
(https://canvas.sfu.ca/courses/84236/files/24448340/download?download_frd=1) that includes two input
graphs as well as the serial versions. To run a program (e.g., page_rank_pull_parallel.cpp ), follow
the steps below:
Run make page_rank_pull_parallel . This creates a binary file called page_rank_pull _parallel.
Create a slurm job to run the binary file using the following command: ./page_rank_pull --
nThreads 4 --nIterations 10 --inputFile absolute_path_of_input_graph --strategy 1
Detailed descriptions for the command-line arguments is provided below.
2. Please read general instructions from Assignment 3
(https://canvas.sfu.ca/courses/84236/assignments/1006850) since they also apply to this assignment.
AM Assignment 4
1/8
3. You will be asked to print the time spent by different threads on specific code regions. The time spent
by any code region can be computed as follows:
timer t1;
t1.start();
/* ---- Code region whose time is to be measured --- */
double time_taken = t1.stop();
4. If you need to time a sub-section inside a loop, you can do that as follows:
double time_taken = 0.0;
timer t1;
while(True){
/* ---- Code region whose time should not be measured --- */
t1.start();
/* ---- Code region whose time is to be measured --- */
time_taken += t1.stop();
/* ---- Code region whose time should not be measured --- */
}
std::cout << "Time spent on required code region : " << time_taken << "\n";
5. The programs operate on graph datasets. Sample input graphs are available
at /scratch/input_graphs/ on the compute nodes (note they are present on the compute nodes only,
and hence you can access them via slurm only).
$ ls /scratch/input_graphs/*.cs*
rmat.csc ? ? ? ? test_25M_50M.csc ? rmat.csr ? ? ? ? test_25M_50M.csr ?lj.csc ? ? ? ? ?
? ? ? roadNet-CA.csc ? ? ?
web-Google.csc ?lj.csr ? ? ? ? ? ? ? ? roadNet-CA.csr ?? ? ?web-Google.csr
Please read Assignment 3's general instructions if you want to generate more tests graphs.
1. Add a "strategy" Command Line Parameter
For your parallel programs " page_rank_pull_parallel.cpp " and " page_rank_push_parallel_atomic.cpp "
from Assignment 3, you need to add a command line parameter "strategy" which is used to define the
program's task decomposition and mapping strategy. If the parameter is set to 1, i.e., --strategy 1 ,
your parallel program should run the same exact way as specified in Assignment 3. "strategy" values
correspond to the strategies described in the following sections.
Strategy 1:
Vertex-based decomposition and static mapping, the same strategy as your Assignment 3 programs.
Vertices are statically distributed among threads such that each thread performs all computations on n/T
vertices.
2. Edge-based Decomposition for PageRank
Strategy 2: To achieve more load balance among threads, vertices are distributed among threads such
that each thread performs computations on m/T edges. Pull-based PageRank distributes vertices based
AM Assignment 4
2/8
on their in-degrees, while Push-based PageRank distributes vertices based on their out-degrees. Both of
your parallel programs need to support this strategy by setting the strategy parameter to 2 in the
command line ( --strategy 2 ). For example, the pull-based algorithm's pseudo-code is:
Create T threads
Distribute work between threads so that each thread gets to compute on approximately m/T edges
for(i=0; i for each thread in parallel {
for each vertex 'v' allocated to the thread {
for vertex 'u' in inNeighbor(v)
next_page_rank[v] += (current_page_rank[u]/outdegree[u])
}
}
for each thread in parallel {
for each vertex 'v' allocated to the thread {
compute the new_pagerank using the accumulated values in next_page_rank[v].
current_page_rank[v] = new_pagerank
Reset next_page_rank[v] to 0
}
}
}
As an example for how edge-based task decomposition works, a graph has the following vertex
distribution :
v1, 8, 1
v2, 5, 6
v3, 8, 3
v4, 2, 8
v5, 3, 5
v6, 1, 4
v7, 5, 6
v8, 4, 3
The graph has 36 edges. To distribute the work evenly among 4 threads, each thread needs to compute
approximately 36/4=9 edges. Each thread gets assigned vertices until the total assigned edges is >=
(thread_id+1) * m/T. Examining the in-degrees, the pull-based algorithm has the following vertex
distribution:
Thread 0: v1, v2 (13 edges >= 1x9); Thread 1: v3 (8 edges so 13+8 >= 2x9); Thread 2: v4, v5, v6 (6
edges, so 13+8+6 >= 3x9); Thread 3: v7, v8 (9 edges)
For the push-based algorithm, we can use the following vertex distribution based on out-degrees:
Thread 0: v1, v2, v3 (10 edges >= 1x9); Thread 1: v4 (8 edges so 10+8 >= 2x9); Thread 2: v5, v6 (9
edges so 10+8+9 >= 27); Thread 3: v7, v8 (9 edges)
Note that the pull-based algorithm in the above example should have a similar runtime as the vertex?based decomposition since the longest thread needs to compute 13 edges in both cases. However, the
push-based algorithm has a shorter runtime since the longest thread computes 10 edges while the
vertex-based decomposition requires 11 edge computations for the longest thread.
AM Assignment 4
3/8
3. Vertex-based Decomposition for PageRank with Dynamic
Mapping
Strategy 3: To achieve even more load balance among threads, vertices are mapped to threads
dynamically based on the time at which different threads finish their tasks. Instead of allocating
approximately equal numbers of vertices (or edges) to threads, we can dynamically allocate work to
each thread whenever it is free. In this strategy, each thread dynamically gets the next vertex to be
computed until all the vertices are processed. Both of your parallel programs need to support this
strategy by setting the strategy parameter to 3 in the command line ( --strategy 3 ). Below is the
pseudo-code showing dynamic task mapping for the push-based Page Rank algorithm with vertex-based
decomposition:
Create T threads
for each thread in parallel {
for(i=0; i while(true){
u = getNextVertexToBeProcessed();
if(u == -1) break;
? ? ? ? ? ? ? ?edges_processed += outDegree(u) // used in output validation
for vertex v in outNeighbor(u)
next_page_rank[v] += (current_page_rank[u]/outdegree[u])
}
barrier1
while(true){
v = getNextVertexToBeProcessed();
if(v == -1) break;
vertices_processed += 1 // used in output validation
compute the new_pagerank using the accumulated values in next_page_rank[v].
current_page_rank[v] = new_pagerank
Reset next_page_rank[v] to 0
}
barrier2
}
}
You need to implement the above dynamic mapping strategy in your solutions for
both " page_rank_pull_parallel.cpp " and " page_rank_push_parallel_atomic.cpp "
4. Vertex-based Decomposition for PageRank with Coarse-Grained
Dynamic Mapping
Strategy 4: To reduce the time spent by each thread on the getNextVertexToBeProcessed() , we will vary
the task granularity so that each thread receives multiple vertices to be processed each time it
calls getNextVertexToBeProcessed() . Both of your parallel programs need to support this strategy by
setting the strategy parameter to 4 in the command line ( --strategy 4 ). You need to update the dynamic
load distribution logic as follows:
Each thread processes k vertices and then calls getNextVertexToBeProcessed() .
Here, k determines the granularity of the work done by each thread before requesting new work.
AM Assignment 4
4/8
For example,
If k = 1 , the thread calls getNextVertexToBeProcessed() after processing each vertex (exactly the
same way as strategy 3)
If k = 1000 , the thread calls getNextVertexToBeProcessed() after processing 1000 vertices.
The getNextVertexToBeProcessed() function should return 0 , k , 2k , ... depending on the
granularity k .
k should be provided at run time using command-line parameter. e.g: --granularity 100
Below is the pseudo-code showing the logic of our parallel solution:
k = 1000 // granularity
Create T threads
for each thread in parallel {
for(i=0; i while(true){
u = getNextVertexToBeProcessed()
if(u == -1) break;
for (j = 0; j < k; j++) {
edges_processed += outDegree(u) // used in output validation
for vertex v in outNeighbor(u)
next_page_rank[v] += (current_page_rank[u]/outdegree[u]
u++
if(u >= n) break; // n is the total number of vertices in the graph
}
}
barrier1
while(true){
v = getNextVertexToBeProcessed()
if(v == -1) break;
for (j = 0; j < k; j++) {
vertices_processed += 1 // used in output validation
compute the new_pagerank using the accumulated values in next_page_rank[v].
current_page_rank[v] = new_pagerank
Reset next_page_rank[v] to 0
v++
if(v >= n) break; // n is the total number of vertices in the graph
}
}
barrier2
}
}
This strategy should be used with command-line parameter --granularity to specify the granularity. You
need to add support for --granularity to both of your parallel programs " page_rank_pull_parallel.cpp "
and " page_rank_push_parallel_atomic.cpp "
Input and Output Formats:
1. Your program files should be named
page_rank_pull_parallel.cpp and page_rank_push_parallel_atomic.cpp and should support the
following command-line parameters:
--nThreads : The number of threads.
--nIterations : The number of iterations (similar to the serial code). Default is 10.
AM Assignment 4
5/8
--inputFile : The absolute path to the input graph file (similar to the serial code).
--strategy : A value of either 1, 2, 3, or 4 corresponding to the strategies defined above. Any
other value should lead to program termination. The default strategy is 1.
--granularity : A positive integer to be used only in strategy 4. This parameter is ignored for
other strategies. You need to check that granularity is always a positive integer. Zero, negative,
and non-integer values lead to program termination. The default granularity is 1.
2. Your parallel solution must output the following information:
Total number of threads used.
Number of iterations.
Strategy used
Granularity used
For each thread:
Thread id (your threads should be numbered between [0, T) )
Number of vertices processed (across all iterations) - This is the number of vertices processed
only in the second for loop across all iterations. Refer the pseudocode of pagerank above.
Number of edges processed (across all iterations) - This is the number of edges processed
only in the first for loop across all iterations. Refer the pseudocode of pagerank above.
Cumulative time spent waiting at barrier1 (in seconds)
Cumulative time spent waiting at barrier2 (in seconds)
Cumulative time spent waiting at getNextVertexToBeProcessed() (in seconds).
Time taken by the thread (in seconds).
The sum of pageranks of all vertices.
The total time taken for the entire execution, including the time used in allocating vertices or
edges to threads.
3. The sample console output is below. Please note that you should use this same format exactly for all
4 parts. If an output is not generated (e.g., getNextVertex time in strategy 1 and 2, print 0.0 instead).
Using DOUBLE
Number of Threads : 4
Strategy : 4
Granularity : 1
Iterations : 20
Reading graph
Created graph
thread_id, num_vertices, num_edges, barrier1_time, barrier2_time, getNextVertex_time, total_time
0, 4576646, 26069763, 0.001790, 0.001063, 4.067461, 10.366047
1, 4625648, 23284082, 0.001947, 0.001063, 4.107634, 10.366017
2, 4617670, 26393229, 0.001654, 0.001049, 4.029203, 10.365975
3, 4508596, 26353706, 0.001668, 0.001059, 4.101155, 10.365950
Sum of page ranks : 618961.562500
Time taken (in seconds) : 10.366616
Testing
We are not providing a testing script for this assignment. However, you should strictly adhere to the
format above, including all the spaces and commas. You can test your program using similar command
AM Assignment 4
6/8
lines to the test cases of Assignment 3 while adding the strategy parameter. We will not disclose the test
cases used for grading before the assignment due date.
Assignment Report
In addition to your parallel code, you need to submit a report (in pdf format) that answers the following
two questions:
Q1. Run each of your two parallel programs with strategy 4 above using 8 threads on the test_25M_50M
data set with granularities of 10, 100, 1000, 2000. Each of your parallel programs should run 3 times,
each including 20 iterations on the slow cluster using the default floating point data type. {Total number
of runs is 2 (parallel versions) x 4 (different granularities) x 3 (number of runs for each version/thread
combination) = 24 runs]
Plot a graph with average execution time on the y-axis, thread count on the x-axis where each
granularity has 2 bars, one for the average runtime of each of your parallel versions. Your graph should
be something like this (obviously with different values):
Q2. Based on data from the graph in Q1, which of the four granularities is better for pull? And which is
better for the push_atomic algorithm?
Submission Guidelines
Make sure that your solutions folder has the following files and sub-folders. Let's say your solutions
folder is called my_assignment4_solutions . It should contain:
core/ -- The folder containing all core files. It is already available in the assignment package. Do
not modify it or remove any files.
Makefile -- Makefile for the project. This file should not be changed.
page_rank_pull_parallel.cpp
AM Assignment 4
7/8
page_rank_push_parallel_atomic.cpp
report.pdf -- A pdf file that includes answers to questions in the previous section.?
To create the submission file, follow the steps below:
1. Enter in your solutions folder, and remove all the object/temporary files.
$ cd my_assignment4_solutions/
$ make clean
$ rm -rf input_graphs OR mv input_graphs ../ to avoid large tarballs
2. Create the tar.gz file.
$ tar cvzf assignment4.tar.gz *
which creates a compressed tar ball that contains the contents of the folder.
Submit via Canvas by the deadline.
AM Assignment 4
8/8