The FFT performance benchmark is designed to model a situation in which you repeatedly perform transforms of a particular size. (Any one-time initialization cost is therefore not included in the timing measurements.) This seems to be the most common kind of use for FFT code, especially in cases where performance is important.
Essentially, a given array is repeatedly FFTed and the elapsed time is measured. In order to avoid the possibility of a diverging process, we initialize the array to zero before the FFTs. (It is unlikely that any FFT's speed is input-dependent.)
For the complex transforms, we only time one of the FFTs, either the forward or the backward transform. We make the reasonable assumption that the forward and backward transforms take the same time to compute, and so it is not necessary to measure the performance of both (for most programs, the two cases use the same code).
For the real transforms, we benchmark the forward followed by the backward transforms. The reason for this is that the forward (real to complex) and backward (complex to real) transform implementations are often quite different, and typically both are needed.
perform one-time initializations initialize data to zero complex transform: fft data real transform: fft data then ifft data fft data num_iters = 1 do get start_time for iteration = 1 to num_iters do complex transform: fft data real transform: fft data then ifft data get end_time t = end_time - start_time if (t < 1.0) then num_iters = num_iters * 2 while t < 1.0You might wonder why we initialize the array and compute the FFT once before the timing starts. There are two reasons. First, some codes perform one-time initializations the first time you call them. Second, we don't want to measure the time taken to load instructions and data into the cache on the first call (although this is probably insignificant anyway). Note also that we initialize the data to zero before benchmarking. If we didn't do this, we would be computing a diverging process and would soon be timing floating-point exceptions. Initializing to zero shouldn't affect the results since neither FFT implementations nor modern CPUs are data-dependent.
complex transform "mflops"
= 5 N log2N / (time for one FFT in
µs)
real transform "mflops"
= 5 N log2N / (time for one FFT + one IFFT in µs)
Here, N is the size of the transform (total number of points in multi-dimensional FFTs). "mflops" is in quotes because it is not really the MFLOPS of the FFT, and is likely to be a source of confusion for some readers. Regardless, we believe that it is the best way to report performance, as we shall explain below.
The first number that one might think to report is the elapsed time "t" from above. Since each FFT might be running for a different number of iterations, however, you need to at least divide t by the number of iterations, yielding the time for one FFT. This is still unsatisfactory, because the time for one FFT inherently increases with transform size, making it impossible to compare results for different transform sizes, or even view them together in a single graph.
Since the number of instructions executed in an FFT is O(N log2N), it makes sense to divide the time for one FFT by N log2N; call this quantity t'. Now, t' is comparable even between different transform sizes and would seem a suitable number to report. There is shortcoming, however, if you try to plot t' on a single graph for all FFTs and transform sizes. Since t' is smaller for fast FFTs and larger for slow FFTs, most of the graph is occupied by the slow FFTs, while the fast FFTs are huddled near the bottom where they are difficult to compare. This is unacceptable--it is the fast FFTs, after all, that you are most interested in.
Rather than t', one can report 1/t'. This will yield graphs where the fast FFTs are emphasized and are easy to compare, while the slow FFTs will be clustered around the bottom of the plot. However, by this point you have lost all intuition about the meaning of the magnitudes (as opposed to the relative values) of the numbers that you are reporting. 1/t' is also inconvenient to compare with numbers quoted in the literature.
Instead, we report 5/t', or 5 N log2N / (time for one FFT in µs). The reason for this is that the number of floating point operations in a radix-2 Cooley-Tukey FFT of size N is 5 N log2N (+ O(N)). If we assume that this is also an approximation for the number of operations in any FFT, then 5/t' is roughly equal to the MFLOPS (millions of floating-point operations per second) of the transforms. That is why we call it the "mflops," and it has the advantage that its absolute magnitude has some meaning. (It is also a standard way of reporting FFT performance in the literature.) Note that the relative values of the "mflops" are still the most important quantities, however. For real-complex transforms, the rationale is the same (in this case approximating the flops by half the flops of a radix-2 FFT).
For example, the 167MHz UltraSPARC can perform 2 fp operations at a time, and is thus capable in "theory" (as opposed to reality) of 334 MFLOPS. From the benchmark results, we see that we can achieve about 2/3 of that for small transforms, where floating point computations dominate, and much less for larger transforms that are dominated by memory access.
Some people might propose that we report the actual MFLOPS of each program. Aside from the fact that this is difficult to compute exactly (you have to count how many floating-point operations are performed), it is also useless. The problem is that you cannot make meaningful comparisons of the actual MFLOPS of two programs. One program could have a higher MFLOPS than the other simply by performing lots of superfluous operations in a tight loop! You are interested in how long a program takes to compute an answer, not in how many multiplications and additions it takes to get there.
average of |xi - new_xi| * 2 / (|xi| + |new_xi| + epsilon)
Here, epsilon is a small number to prevent us from dividing by zero.
The original xi consist of pseudo-random numbers
(generated by the rand()
function).
If an FFT implementation does not provide an inverse FFT, then we
construct one using the identity that ifft(x) = fft(x*)*
,
where "*
" denotes complex conjugation. Since complex
conjugation is an exact operation, this procedure does not introduce
additional error and we are measuring the accuracy of the FFT
subroutine exclusively.
First, there are all the powers of two up to the memory limit of your machine. These are by far the most common transform sizes in the real world as they are typically the most efficient to compute. They are also the easiest to code, and many FFT implementations only support transforms of sizes that are powers of two.
Second, there are (more-or-less) randomly selected numbers whose factors are powers of 2, 3, 5, and 7. These were chosen because the FFT is usually fastest for numbers with small prime factors, and so real applications usually try to limit themselves to such sizes even if they don't restrict themselves to powers of two.