New P4 Optimizations

Optimization by Packing Data Structures

The plot below shows the performance of the improved staggered fermion code, su3_rmd_symzk1_asqtad. Specifically, we show the sustained floating point operations per second during the execution of the conjugate gradient routine. The lower red curve is standard MILC 6 code. The green curve is from a version of the same code which has been modified by Steve Gottlieb and Dick Foster to pack the corresponding data structures from the sites of the lattice together. The blue curve adds to this data restructuring the use of SSE math routines (matrix-matrix and matrix-vector multiplies). Finally, the upper red curve adds both SSE math routines and some prefetching of data structures added by hand.

The cache line size on an Pentium 4 is 64 bytes, and the fundamental data structures used in the MILC computations are 72 bytes (su3 matrix) and 24 bytes (su3 vector). When the matrices are traversed with a large stride (normal MILC code), two cache lines must be loaded for each matrix. However, only 72 of the 128 bytes loaded are in general useful. By packing the matrices together, the extra 56 bytes loaded with one matrix belong to another matrix which may be used before expiry from the cache. The speedup on Pentium 4 processors is much more pronounced than on Athlons (see Dual Athlon Results).

Investigation Using Performance Counters

To investigate this performance enhancement, we have instrumented this MILC binary using our TRACE toolkit. With TRACE, we are able to monitor the various Pentium 4 performance counters during the execution of the code and during all other activities of the system (that is, we can record performance counter activity per process). We configured the counters to monitor the DRDY line of the processor, and to count all retired floating point instructions.

The DRDY line on a Pentium 4 is documented to make a transition whenever data is ready on the memory bus. Using the Streams benchmark, we have determined that this is not quite true. In reality, DRDY makes a transition for each set of four transfers (recall that the Pentium 4 has a 100 MHz front side bus, but that it performs four transfers per clock). Each DRDY corresponds to 32 bytes read or written (the data bus is 64 bits wide, or 8 bytes). A cache line on a Pentium 4 is 64 bytes long, or two DRDY transitions.

In the plot below, we show the number of DRDY transitions which occur during each execution of the conjugate gradient subroutine, as a function of the lattice size. The red crosses correspond to the standard MILC 6 code, and the green crosses correspond to the modified code. We clearly see that the modified code requires substantially less activity on the memory bus.

Next, we look at the number of floating point operations per DRDY. Again, the red crosses are the standard MILC code, and the green crosses show the code modified to reorder the data structures.

Finally, we look at memory bandwidth utilization. We calculate memory bus utilization as DRDY transitions multiplied by 32 bytes, divided by time.

The minimum transfer size from main memory to L2 cache is a cache line, or 64 bytes. Since su3 matrices are 72 bytes long (9 single precision complex numbers), two cache lines must be fetched for each matrix. The extra 56 bytes loaded as part of the second cache line may not be useful in the standard MILC code, where the stride in this case between corresponding fields in the lattice structure is 1656 bytes. In the modified MILC code, the extra 56 bytes may be useful, as they correspond to part of the neighboring su3 matrix. Crudely, the efficiency of an su3 matrix load in the standard MILC code is 56%, assuming the extra bytes are not used, and the efficiency of a load in the modified MILC code approaches 100% for very large lattices. Looking at the plot of floating point operations per DRDY above, we see this approximate ratio of performances between the two codes.