The cache line size on an Pentium 4 is 64 bytes, and the fundamental data structures used in the MILC computations are 72 bytes (su3 matrix) and 24 bytes (su3 vector). When the matrices are traversed with a large stride (normal MILC code), two cache lines must be loaded for each matrix. However, only 72 of the 128 bytes loaded are in general useful. By packing the matrices together, the extra 56 bytes loaded with one matrix belong to another matrix which may be used before expiry from the cache. The speedup on Pentium 4 processors is much more pronounced than on Athlons (see Dual Athlon Results).
The DRDY line on a Pentium 4 is documented to make a transition whenever data is ready on the memory bus. Using the Streams benchmark, we have determined that this is not quite true. In reality, DRDY makes a transition for each set of four transfers (recall that the Pentium 4 has a 100 MHz front side bus, but that it performs four transfers per clock). Each DRDY corresponds to 32 bytes read or written (the data bus is 64 bits wide, or 8 bytes). A cache line on a Pentium 4 is 64 bytes long, or two DRDY transitions.
In the plot below, we show the number of DRDY transitions which occur during each execution of the conjugate gradient subroutine, as a function of the lattice size. The red crosses correspond to the standard MILC 6 code, and the green crosses correspond to the modified code. We clearly see that the modified code requires substantially less activity on the memory bus.
Next, we look at the number of floating point operations per DRDY. Again, the red crosses are the standard MILC code, and the green crosses show the code modified to reorder the data structures.
Finally, we look at memory bandwidth utilization. We calculate memory bus utilization as DRDY transitions multiplied by 32 bytes, divided by time.
The minimum transfer size from main memory to L2 cache is a cache line, or 64 bytes. Since su3 matrices are 72 bytes long (9 single precision complex numbers), two cache lines must be fetched for each matrix. The extra 56 bytes loaded as part of the second cache line may not be useful in the standard MILC code, where the stride in this case between corresponding fields in the lattice structure is 1656 bytes. In the modified MILC code, the extra 56 bytes may be useful, as they correspond to part of the neighboring su3 matrix. Crudely, the efficiency of an su3 matrix load in the standard MILC code is 56%, assuming the extra bytes are not used, and the efficiency of a load in the modified MILC code approaches 100% for very large lattices. Looking at the plot of floating point operations per DRDY above, we see this approximate ratio of performances between the two codes.