辅导program编程、辅导Python，c++程序

Program Assignment #2
Due day: NOV. 16, 2021
Problem 1: Matrix-Matrix Multiplication
In the first hands-on lab section, this lab introduces a famous and widely-used example
application in the parallel programming field, namely the matrix-matrix multiplication.
You will complete key portions of the program in the CUDA language to compute this
widely-applicable kernel.
In this lab you will learn:
‧ How to allocate and free memory on GPU.
‧ How to copy data from CPU to GPU.
‧ How to copy data from GPU to CPU.
‧ How to measure the execution times for memory access and computation
respectively.
‧ How to invoke GPU kernels.
Your output should look like this:
Input matrix file name:
Setup host side environment and launch kernel:
Allocate host memory for matrices M and N.
M:
N:
Allocate memory for the result on host side.
Initialize the input matrices.
Allocate device memory.
Copy host memory data to device.
Allocate device memory for results.
Setup kernel execution parameters.
# of threads in a block:
# of blocks in a grid :
Executing the kernel...
Copy result from device to host.
GPU memory access time:
GPU computation time :
GPU processing time :
Check results with those computed by CPU.
Computing reference solution.
CPU Processing time :
CPU checksum:
GPU checksum:
Record your runtime with respect to different input matrix sizes as follows:
Matrix Size GPU Memory
Access Time
(ms)
GPU
Computation
Time (ms)
GPU
Processing
Time (ms)
Ratio of
Computation Time
as compared with
matrix 128x128
8 x 8
128 x 128 1
512 x 512
3072 x 3072
4096 x 4096
What do you see from these numbers?
Problem 2: Matrix-Matrix Multiplication with Tiling and Shared Memory
This lab is an enhanced matrix-matrix multiplication, which uses the features of
shared memory and synchronization between threads in a block. The device shared
memory is allocated for storing the sub-matrix data for calculation, and threads share
memory bandwidth which was overtaxed in previous matrix-matrix multiplication lab.
In this lab you will learn:
‧ How to apply tiling on matrix-matrix multiplication.
‧ How to use shared memory on the GPU.
‧ How to apply thread synchronization in a block.
Your output should look like this.
Input matrix file name:
Setup host side environment and launch kernel:
Allocate host memory for matrices M and N.
M:
N:
Allocate memory for the result on host side.
Initialize the input matrices.
Allocate device memory.
Copy host memory data to device.
Allocate device memory for results.
Setup kernel execution parameters.
# of threads in a block:
# of blocks in a grid :
Executing the kernel...
Copy result from device to host.
GPU memory access time:
GPU computation time :
GPU processing time :
Check results with those computed by CPU.
Computing reference solution.
CPU Processing time :
CPU checksum:
GPU checksum:
Record your runtime with respect to different input matrix sizes as follows:
Matrix Size GPU Memory
Access Time
(ms)
GPU
Computation
Time (ms)
GPU
Processing
Time (ms)
Ratio of
Computation Time
as compared with
matrix 128x128
8 x 8
128 x 128 1
512 x 512
3072 x 3072
4096 x 4096
What do you see from these numbers? Have they improved a lot as compared to the
previous matrix-matrix multiplication implementation?
Problem 3: Matrix-Matrix Multiplication with Tiling and Constant Memory
This lab is an enhanced matrix-matrix multiplication, which uses the features of
constant memory and synchronization between threads in a block. Allocate constant
memory for matrices M and N.
Record your runtime with respect to different input matrix sizes as follows:
Matrix Size GPU Memory
Access Time
(ms)
GPU
Computation
Time (ms)
GPU
Processing
Time (ms)
Ratio of
Computation Time
as compared with
matrix 128x128
8 x 8
128 x 128 1
512 x 512
3072 x 3072
4096 x 4096
What do you see from these numbers? Have they improved a lot as compared to the
previous matrix-matrix multiplication implementation?