Go to the next, previous, or main section.

Methodology

In this section we describe and motivate the methodology of our benchmark. We hope to convince you that our program makes a fair and accurate measurement of the performance and accuracy of FFT software.

Performance Measurement

The FFT performance benchmark is designed to model a situation in which you repeatedly perform transforms of a particular size. (Any one-time initialization cost is therefore not included in the timing measurements.) This seems to be the most common kind of use for FFT code, especially in cases where performance is important.

Essentially, a given array is repeatedly FFTed and the elapsed time is measured. In order to avoid the possibility of a diverging process, we initialize the array to zero before the FFTs. (It is unlikely that any FFT's speed is input-dependent.)

Forward and Backward Transforms

For the complex transforms, we only time one of the FFTs, either the forward or the backward transform. We make the reasonable assumption that the forward and backward transforms take the same time to compute, and so it is not necessary to measure the performance of both (for most programs, the two cases use the same code).

For the real transforms, we benchmark the forward followed by the backward transforms. The reason for this is that the forward (real to complex) and backward (complex to real) transform implementations are often quite different, and typically both are needed.

Timing

It is necessary to time for a long enough period that the resolution of the clock is not an issue, but not for so long that the benchmark takes forever to run. Our solution is to repeatedly double the number of iterations used (starting with 1 iteration) until the elapsed time is at least 1 second. This is repeated for every FFT code and every transform size. So, in pseudo-code, the benchmark process (for one FFT and one transform size) is:
perform one-time initializations
initialize data to zero
complex transform: fft data
real transform: fft data then ifft data
fft data
num_iters = 1

do
     get start_time
     for iteration = 1 to num_iters do
          complex transform: fft data
          real transform: fft data then ifft data
     get end_time
     t = end_time - start_time
     if (t < 1.0) then
          num_iters = num_iters * 2
while t < 1.0
You might wonder why we initialize the array and compute the FFT once before the timing starts. There are two reasons. First, some codes perform one-time initializations the first time you call them. Second, we don't want to measure the time taken to load instructions and data into the cache on the first call (although this is probably insignificant anyway). Note also that we initialize the data to zero before benchmarking. If we didn't do this, we would be computing a diverging process and would soon be timing floating-point exceptions. Initializing to zero shouldn't affect the results since neither FFT implementations nor modern CPUs are data-dependent.

Performance Numbers Reported

Our benchmark reports the "mflops" of each FFT for every transform size. This is defined to be:

complex transform "mflops"
= 5 N log2N / (time for one FFT in µs)

real transform "mflops"
= 5 N log2N / (time for one FFT + one IFFT in µs)

Here, N is the size of the transform (total number of points in multi-dimensional FFTs). "mflops" is in quotes because it is not really the MFLOPS of the FFT, and is likely to be a source of confusion for some readers. Regardless, we believe that it is the best way to report performance, as we shall explain below.

The first number that one might think to report is the elapsed time "t" from above. Since each FFT might be running for a different number of iterations, however, you need to at least divide t by the number of iterations, yielding the time for one FFT. This is still unsatisfactory, because the time for one FFT inherently increases with transform size, making it impossible to compare results for different transform sizes, or even view them together in a single graph.

Since the number of instructions executed in an FFT is O(N log2N), it makes sense to divide the time for one FFT by N log2N; call this quantity t'. Now, t' is comparable even between different transform sizes and would seem a suitable number to report. There is shortcoming, however, if you try to plot t' on a single graph for all FFTs and transform sizes. Since t' is smaller for fast FFTs and larger for slow FFTs, most of the graph is occupied by the slow FFTs, while the fast FFTs are huddled near the bottom where they are difficult to compare. This is unacceptable--it is the fast FFTs, after all, that you are most interested in.

Rather than t', one can report 1/t'. This will yield graphs where the fast FFTs are emphasized and are easy to compare, while the slow FFTs will be clustered around the bottom of the plot. However, by this point you have lost all intuition about the meaning of the magnitudes (as opposed to the relative values) of the numbers that you are reporting. 1/t' is also inconvenient to compare with numbers quoted in the literature.

Instead, we report 5/t', or 5 N log2N / (time for one FFT in µs). The reason for this is that the number of floating point operations in a radix-2 Cooley-Tukey FFT of size N is 5 N log2N (+ O(N)). If we assume that this is also an approximation for the number of operations in any FFT, then 5/t' is roughly equal to the MFLOPS (millions of floating-point operations per second) of the transforms. That is why we call it the "mflops," and it has the advantage that its absolute magnitude has some meaning. (It is also a standard way of reporting FFT performance in the literature.) Note that the relative values of the "mflops" are still the most important quantities, however. For real-complex transforms, the rationale is the same (in this case approximating the flops by half the flops of a radix-2 FFT).

For example, the 167MHz UltraSPARC can perform 2 fp operations at a time, and is thus capable in "theory" (as opposed to reality) of 334 MFLOPS. From the benchmark results, we see that we can achieve about 2/3 of that for small transforms, where floating point computations dominate, and much less for larger transforms that are dominated by memory access.

Some people might propose that we report the actual MFLOPS of each program. Aside from the fact that this is difficult to compute exactly (you have to count how many floating-point operations are performed), it is also useless. The problem is that you cannot make meaningful comparisons of the actual MFLOPS of two programs. One program could have a higher MFLOPS than the other simply by performing lots of superfluous operations in a tight loop! You are interested in how long a program takes to compute an answer, not in how many multiplications and additions it takes to get there.

Accuracy Measurement

Measurement of the accuracy of an FFT is much simpler than measuring its performance. Essentially, all we do is to perform the FFT and inverse FFT of some data, scale it if necessary, and compare the result to the original data. The difference is reported as a "mean fractional error." If x is the original data and new_x is the data after the transforms, then we define the mean fractional error as:

average of |xi - new_xi| * 2 / (|xi| + |new_xi| + epsilon)

Here, epsilon is a small number to prevent us from dividing by zero.

The original xi consist of pseudo-random numbers (generated by the rand() function).

If an FFT implementation does not provide an inverse FFT, then we construct one using the identity that ifft(x) = fft(x*)*, where "*" denotes complex conjugation. Since complex conjugation is an exact operation, this procedure does not introduce additional error and we are measuring the accuracy of the FFT subroutine exclusively.

Default Transform Sizes

Unless a particular transform size is specified, the benchmark is run for a hard-coded selection of "representative" sizes. These sizes fall into two groups.

First, there are all the powers of two up to the memory limit of your machine. These are by far the most common transform sizes in the real world as they are typically the most efficient to compute. They are also the easiest to code, and many FFT implementations only support transforms of sizes that are powers of two.

Second, there are (more-or-less) randomly selected numbers whose factors are powers of 2, 3, 5, and 7. These were chosen because the FFT is usually fastest for numbers with small prime factors, and so real applications usually try to limit themselves to such sizes even if they don't restrict themselves to powers of two.

A Few Words on Bias

As we are the authors of FFTW, you might be justifiably concerned that we could have tilted the benchmark in our favor. In fact, the basic methodology of the benchmark pre-dates FFTW, and we believe its fairness and neutrality should be evident from the discussion above. We have strictly avoided any "tweaking" of the measurements in such a way as to favor a particular FFT code (e.g. fudging data alignments, default transform sizes, etcetera). Feel free to look at the source code if you are worried, and don't hesitate to email us if you have any questions or concerns regarding our methods.
Go to the next, previous, or main section.