Comparison of Application Performance on Dual Opteron and Xeon Systems

All results shown below were obtained using an MSI K8D MS-9131 motherboard (dual Opteron, 1.4 GHz), and an SuperMicro P4DPE-Q motherboard (dual Xeon, 2.0 GHz).

MILC Single CPU Performance

The plot below shows sustained floating point performance in the conjugate gradient routine of MILC V6 improved staggered code (su3_rmd_symzk1_asqtad) as executed on a 2.0 GHz dual Xeon system. The four curves are:

For comparison, the plot below shows the performance of the identical binaries on a dual 1.4 GHz Opteron system. Note that there is an additional line showing the performance of an older version of SSE optimizations.


Discussion

The large variability shown on the Opteron curves may be due in part to the NUMA nature of the memory bus. Under Linux 2.4 kernels, processes tend to stay on a given processor. On the Opteron system, it's possible for a process to start on the 2nd processor, but allocate memory which is attached to the 1st processor. Note that similar performance at large lattice sizes is obtained with the 1.4 GHz Opteron and the 2.0 GHz Xeon. Because this code is memory bandwidth bound, the increased memory bandwidth available on the Opteron (STREAMS "Copy" is approximately 1450 MB/sec vs 1240 MB/sec on the Xeon) compensates for its lower clock speed.

MILC SMP Performance

The plot below shows the floating point performance of the conjugate gradient routine of MILC V6 improved staggered code (su3_rmd_symzk1_asqtad) as executed on a 2.0 GHz dual Xeon system. The version executed is the fully optimized version ("temporaries + new inline SSE"). The four curves are:

For comparison, the plot below shows the performance of the identical binaries on a dual 1.4 GHz Opteron system. An MPI version was not available, so only three curves are shown.


Discussion

On the dual Xeon, SMP scaling for large lattices is approximately 65% for a pair of independent processes, and approximately 57% for a pair of cooperative processes.

On the dual Opteron, SMP scaling for large lattices is approximately 86% for a pair of independent processes, using the largest lattice size data point. Note that when two processes are running, the variability observed in single process runs disappears. Further, one of the two processes has higher performance, consistent with using local memory, and the other process has performance consist with using remote memory over the hyperlink transport. We speculate that allocation of local memory to the second process would give performance similar to the first process.

Another good demonstration of scaling comes from running two copies of the STREAMS benchmark. On the dual Xeon, a single copy gives:

Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        1238.0041       0.0260       0.0258       0.0263
Scale:       1238.4840       0.0259       0.0258       0.0259
Add:         1496.6771       0.0321       0.0321       0.0322
Triad:       1494.1154       0.0322       0.0321       0.0323
and two copies run simultaneously give:
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         546.3542       0.0589       0.0586       0.0594
Scale:        547.4675       0.0586       0.0585       0.0588
Add:          628.0831       0.0768       0.0764       0.0772
Triad:        627.0083       0.0767       0.0766       0.0769

Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         541.3547       0.0593       0.0591       0.0599
Scale:        542.8418       0.0591       0.0589       0.0593
Add:          622.3339       0.0774       0.0771       0.0778
Triad:        689.9129       0.0762       0.0696       0.0772

Contrast these results with one and two STREAMS process runs on the Opteron:

(One Process)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        1450.1951       0.0221       0.0221       0.0224
Scale:       1433.1156       0.0223       0.0223       0.0224
Add:         1728.8004       0.0278       0.0278       0.0279
Triad:       1721.6083       0.0279       0.0279       0.0279

(Two Simultaneous Processes)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        1138.4273       0.0290       0.0281       0.0293
Scale:       1163.8065       0.0294       0.0275       0.0298
Add:         1371.2758       0.0371       0.0350       0.0376
Triad:       1305.8740       0.0374       0.0368       0.0377

Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        1265.3694       0.0325       0.0253       0.0344
Scale:       1252.9895       0.0327       0.0255       0.0347
Add:         1369.9414       0.0441       0.0350       0.0465
Triad:       1393.2829       0.0427       0.0345       0.0462

Don Holmgren
Last Modified: 7th Aug 2003