The following graph compares MILC improved staggered performance as executed
on a 2.8 GHz Pentium 4 processor, and on a 2.0 GHz G5 (PPC970) processor. The
blue and red curves shows the performance of MILC code with the field major
optimization on the P4 and G5, respectively. The green curve shows the
performance of the MILC code with both the field major and SSE2 optimizations
on the same P4.
Clearly SSE2 optimizations are very effective on the Pentium 4 processor. We
have implemented both Altivec and conventional coded assembly language (i.e.,
using conventional floating point operations) on the G5 processor. The table
below shows the in-cache performance of three key SU3 kernels on different
flavors of code. Shown are the number of cycles required to execute each
kernel. "C" labels unoptimized, original C-language MILC code. "Altivec",
"Assembler", and "SSE" label the respective optimized kernels. The
"MFlops/GHz" column shows this reduced quantity for each of the optimized
flavors; this number should be multiplied by the corresponding processor's
clock speed to obtain the speed of this kernel on in-cache operands.
G5 Altivec, G5 Assembler, and Pentium 4 Inlined SSE Timings
|
G5 Altivec |
G5 Assembler |
P4 SSE |
|
FP Ops |
C |
Altivec |
MFlops/GHz |
C |
Assembler |
MFlops/GHz |
C |
SSE |
MFlops/GHz |
| mult_su3_mat_vec_sum_4dir |
282 |
418 | 205 | 1376 |
418 | 147 | 1918 |
598 | 249 | 1133 |
| mult_adj_su3_mat_vec_4dir |
264 |
188 | 183 | 1443 |
188 | 140 | 1885 |
530 | 320 | 825 |
| mult_adj_su3_mat_vec_4vec |
264 |
178 | 184 | 1435 |
178 | 143 | 1846 |
534 | 228 | 1158 |
The graph below shows the effects of Altivec and assembler optimizations on
the performance of the improved staggered MILC code running on the G5
processor. All versions of the code used the field major optimization. The
violet curve has no further optimizations and was obtained using code compiled
on the beta IBM C-compiler. The green curve shows the performance of
gcc-compiled code with Altivec optimizations. The blue and red curves shows
the performance of the code with assembler optimizations compiled with the IBM
and gcc compilers, repectively.
The graph below shows compares the performance of improved staggered MILC code
on a 2.8 GHz Pentium 4 processor, a 2.8 GHz Pentium 4E ("Prescott") processor,
and a 2.0 GHz G5 (PPC970) processor. On each processor, the fastest version
of the code was used. On the P4, field major and inline SSE optimizations
were used. On the P4E, field major, inline SSE, and prefetch optimzations were
used. On the G5, field major and assembler optimiztions were used. The G5
results shown are our best results to date in-cache. The P4E results are the
best to date for out of cache running.
The graph below shows the SMP performance of optimized improved staggered
MILC code on a dual 2.0 GHz G5 (PPC970) system. Field major and assembler
optimizations were used. The red curve shows the performance of a single copy
of the code running on the machine. The green and blue curves show the
performance of two single-node copies of the code running simultaneously, but
uncooperatively. The violet curve shows the performance of two processes
running cooperatively via MPI message passing.