For comparison, the plot below shows the performance of the identical binaries
on a dual 1.4 GHz Opteron system. Note that there is an additional line
showing the performance of an older version of SSE optimizations.
For comparison, the plot below shows the performance of the identical binaries on a dual 1.4 GHz Opteron system. An MPI version was not available, so only three curves are shown.
On the dual Opteron, SMP scaling for large lattices is approximately 86% for a pair of independent processes, using the largest lattice size data point. Note that when two processes are running, the variability observed in single process runs disappears. Further, one of the two processes has higher performance, consistent with using local memory, and the other process has performance consist with using remote memory over the hyperlink transport. We speculate that allocation of local memory to the second process would give performance similar to the first process.
Another good demonstration of scaling comes from running two copies of the STREAMS benchmark. On the dual Xeon, a single copy gives:
Function Rate (MB/s) RMS time Min time Max time Copy: 1238.0041 0.0260 0.0258 0.0263 Scale: 1238.4840 0.0259 0.0258 0.0259 Add: 1496.6771 0.0321 0.0321 0.0322 Triad: 1494.1154 0.0322 0.0321 0.0323and two copies run simultaneously give:
Function Rate (MB/s) RMS time Min time Max time Copy: 546.3542 0.0589 0.0586 0.0594 Scale: 547.4675 0.0586 0.0585 0.0588 Add: 628.0831 0.0768 0.0764 0.0772 Triad: 627.0083 0.0767 0.0766 0.0769 Function Rate (MB/s) RMS time Min time Max time Copy: 541.3547 0.0593 0.0591 0.0599 Scale: 542.8418 0.0591 0.0589 0.0593 Add: 622.3339 0.0774 0.0771 0.0778 Triad: 689.9129 0.0762 0.0696 0.0772
Contrast these results with one and two STREAMS process runs on the Opteron:
(One Process) Function Rate (MB/s) RMS time Min time Max time Copy: 1450.1951 0.0221 0.0221 0.0224 Scale: 1433.1156 0.0223 0.0223 0.0224 Add: 1728.8004 0.0278 0.0278 0.0279 Triad: 1721.6083 0.0279 0.0279 0.0279 (Two Simultaneous Processes) Function Rate (MB/s) RMS time Min time Max time Copy: 1138.4273 0.0290 0.0281 0.0293 Scale: 1163.8065 0.0294 0.0275 0.0298 Add: 1371.2758 0.0371 0.0350 0.0376 Triad: 1305.8740 0.0374 0.0368 0.0377 Function Rate (MB/s) RMS time Min time Max time Copy: 1265.3694 0.0325 0.0253 0.0344 Scale: 1252.9895 0.0327 0.0255 0.0347 Add: 1369.9414 0.0441 0.0350 0.0465 Triad: 1393.2829 0.0427 0.0345 0.0462