DFT Parallelization in OpenMP
The FFT is the most famous library to solve a Fourier Transform (FT) using
the Fast Fourier Transform algorithm. The goal of this exercise is to develop a
fast DFT (Discrete Fourier Transform, a different algorithm to calculate FT) by
parallelizing the serial C code that is provided as DFT_seq.c.
The code calculates direct and inverse DFTs, timing the total time taken for the
computation of the two DFTs. The computational core of the program is the
DFT subroutine that takes idft argument as 1 for the direct DFT and -1 for the
inverse DFT.
The code also prints the value of the first element of the DFT computation. This
is equal to the input size (N). You can use this number to quickly check the
correctness of your implementation.
You will focus on parallelizing the DFT function:
int DFT(int idft, double *xr, double *xi, double *Xr_o, double *Xi_o, int N)
{
// Region 1: main DFT function
for (int k = 0; k < N; k++)
{
for (int n = 0; n < N; n++)
{
// Real part of X[k]
Xr_o[k] += xr[n] * cos(n * k * PI2 / N) + idft * xi[n] * sin(n*k*PI2/N);
// Imaginary part of X[k]
Xi_o[k] += -idft * xr[n] * sin(n * k * PI2 / N) + xi[n] * cos(n*k*PI2/N);
}
}
// Region 2: Normalizing IDFT
if (idft == -1)
{
for (int n = 0; n < N; n++)
{
Xr_o[n] /= N;
Xi_o[n] /= N;
}
}
return 1;
}
In this exercise, N=10000. Please provide a comprehensive report to answer the
following questions. You should plot figures to show your testing results in the
report.
Questions:
1
1. (1 point) Using OpenMP to parallelize the given sequential code in
DFT_seq.c . The goal is to achieve the best possible performance for DFT
with the correct result and no false sharing. Note that the scheduling policy
and where to parallelize will affect your final code performance. Your plot
should reflect the execution time for the runs on 1, 2, 4, 8 and 16 threads.
The performance measurement should be an average of 5 repeated runs.
Note that you need to parallelize all the code regions that are necessary
for performance improvement.
2. (1 point) Zoom into Region 1, which is the main DFT function nested
loops, please write the two OpenMP parallelization versions for outer and
inner loops in the report and discuss which design choice is better for
performance and why. After this, using the one design above that has
the better performance as the baseline, please discuss how the scheduling
policy choices (default, static and dynamic) may impact its final execution
time. Plot them out in a figure to reflect the runs on 1, 2, 4, 8 and 16
threads. Note that please design your scheduling policy that avoids false
sharing. Also, it is possible that they don’t present significant performance
differences. You have to summarize your findings.