The following graph compares MILC improved staggered performance as executed on a 2.8 GHz Pentium 4 processor, and on a 2.0 GHz G5 (PPC970) processor. The blue and red curves shows the performance of MILC code with the field major optimization on the P4 and G5, respectively. The green curve shows the performance of the MILC code with both the field major and SSE2 optimizations on the same P4.




Clearly SSE2 optimizations are very effective on the Pentium 4 processor. We have implemented both Altivec and conventional coded assembly language (i.e., using conventional floating point operations) on the G5 processor. The table below shows the in-cache performance of three key SU3 kernels on different flavors of code. Shown are the number of cycles required to execute each kernel. "C" labels unoptimized, original C-language MILC code. "Altivec", "Assembler", and "SSE" label the respective optimized kernels. The "MFlops/GHz" column shows this reduced quantity for each of the optimized flavors; this number should be multiplied by the corresponding processor's clock speed to obtain the speed of this kernel on in-cache operands.

G5 Altivec, G5 Assembler, and Pentium 4 Inlined SSE Timings
G5 Altivec G5 Assembler P4 SSE
FP Ops C Altivec MFlops/GHz C Assembler MFlops/GHz C SSE MFlops/GHz
mult_su3_mat_vec_sum_4dir 282 418 205 1376 418 147 1918 598 249 1133
mult_adj_su3_mat_vec_4dir 264 188 183 1443 188 140 1885 530 320 825
mult_adj_su3_mat_vec_4vec 264 178 184 1435 178 143 1846 534 228 1158



The graph below shows the effects of Altivec and assembler optimizations on the performance of the improved staggered MILC code running on the G5 processor. All versions of the code used the field major optimization. The violet curve has no further optimizations and was obtained using code compiled on the beta IBM C-compiler. The green curve shows the performance of gcc-compiled code with Altivec optimizations. The blue and red curves shows the performance of the code with assembler optimizations compiled with the IBM and gcc compilers, repectively.




The graph below shows compares the performance of improved staggered MILC code on a 2.8 GHz Pentium 4 processor, a 2.8 GHz Pentium 4E ("Prescott") processor, and a 2.0 GHz G5 (PPC970) processor. On each processor, the fastest version of the code was used. On the P4, field major and inline SSE optimizations were used. On the P4E, field major, inline SSE, and prefetch optimzations were used. On the G5, field major and assembler optimiztions were used. The G5 results shown are our best results to date in-cache. The P4E results are the best to date for out of cache running.




The graph below shows the SMP performance of optimized improved staggered MILC code on a dual 2.0 GHz G5 (PPC970) system. Field major and assembler optimizations were used. The red curve shows the performance of a single copy of the code running on the machine. The green and blue curves show the performance of two single-node copies of the code running simultaneously, but uncooperatively. The violet curve shows the performance of two processes running cooperatively via MPI message passing.