Note: For additional benchmarks on other architectures, see this link.

Comparison of Application Performance on Dual G5, and Pentium 4 Systems

All results shown below were obtained using an Apple dual G5 PowerMac (2.0 GHz), and an Intel D875PBZ motherboard (800 MHz FSB Pentium 4, 2.8 GHz). The Apple was running OS X 10.2 ("Jaguar") with Darwin Kernel 6.7.5, and the Pentium 4 system was running RedHat 7.1 with a 2.4.21 kernel. All codes on the Apple were compiled with the IBM VAC (6.0) and XLF (8.1) Beta compilers. All codes on the Pentium 4 system were compiled with the gcc 2.95.3 compiler.

MILC Single CPU Performance

The plot below shows sustained floating point performance in the conjugate gradient routine of MILC V6 improved staggered code (su3_rmd_symzk1_asqtad) as a function of lattice size as executed on a 2.0 GHz dual G5 system. The two curves are:

For comparison, the plot below shows the performance of these codes on the 2.8 GHz Pentium 4 system.


At Fermilab, the production MILC codes used all have additional SSE2 optimizations (see "Inline SSE MILC Math Routines"). The plot below shows the performance of the G5 and Pentium 4 on the code with the "Field Major" optimization, as well as the performance of code with SSE2 and "Field Major" optimizations on the Pentium 4 system.

Discussion

The curves clearly show the L2 cache boundary. In production runs, typical lattice sizes per node are of the order of 10 MBytes, so we are most interested in the large memory portion of the curves where performance is dominated by memory bandwidth. The SSE2 optimizations are very effective on the Pentium 4. We have not yet re-coded these matrix algebra routines (3x3, 3x2, and 3x1 complex matrices) in AltiVec, but would anticipate improvements in performance on the G5. (Note: these codes are single precision, so AltiVec routines are possible. We also have double precision versions of the code for SSE2, which cannot be re-coded using AltiVec.)

MILC SMP Performance

The plot below shows the floating point performance of the conjugate gradient routine of MILC V6 improved staggered code as a function of lattice size (su3_rmd_symzk1_asqtad) as executed on the 2.0 GHz dual G5 system. The version executed is the fully optimized version ("temporaries + new inline SSE"). The four curves are:

Discussion

Scaling on the dual G5 on lattices above 10 MB is about 70% for cooperative MILC processes communicating via MPI, and about 80% for unrelated processes running simultaneously. "Scaling" is defined as aggregate performance divided by theoretical aggregate performance, where the latter is twice the performance of a single process. For comparison with dual Xeon and dual Opteron systems, see this link.


STREAMS Benchmark

Shown below are the results from McCalpin's STREAMS benchmark on the G5 and Pentium 4 systems. For the G5, the results of both the standard STREAMS and the MPI version running with one and two processes are shown.

Pentium 4

[Compiled with "gcc -O3", gcc 2.95.3]
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 8000000, Offset = 0
Total memory required = 183.1 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity appears to be less than one microsecond.
Each test below will take on the order of 38768 microseconds.
   (= -2147483648 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        2848.7459       0.0450       0.0449       0.0457
Scale:       2851.4134       0.0453       0.0449       0.0457
Add:         3470.4630       0.0556       0.0553       0.0557
Triad:       3456.0981       0.0557       0.0556       0.0561

G5

[Compiled with "cc -O5", IBM VAC 6.0 Beta]
bash-2.05a$ ./stream_d 
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 8000000, Offset = 0
Total memory required = 183.1 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 46828 microseconds.
   (= 46828 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        2923.5762       0.0439       0.0438       0.0441
Scale:       2566.9303       0.0500       0.0499       0.0502
Add:         2304.2857       0.0834       0.0833       0.0836
Triad:       2339.1539       0.0822       0.0821       0.0823

[Compiled with "f77 -O5", IBM XLF 8.1 Beta]
bash-2.05a$ mpirun -np 1 stream_mpi
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Number of processors =  1
 Array size =    4000000
 Offset     =          0
 The total memory requirement is      91.6 MB (     91.6MB/task)
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:       2910.9423       .0220       .0220       .0220
Scale:      2577.4299       .0249       .0248       .0249
Add:        2385.3296       .0403       .0402       .0404
Triad:      2456.4980       .0391       .0391       .0392
 -----------------------------------------------
 Solution Validates!
 -----------------------------------------------

bash-2.05a$ mpirun -np 2 stream_mpi
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Number of processors =  2
 Array size =    4000000
 Offset     =          0
 The total memory requirement is     183.1 MB (     91.6MB/task)
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:       2780.1300       .0465       .0460       .0469
Scale:      2779.6478       .0464       .0460       .0469
Add:        2649.7750       .0728       .0725       .0736
Triad:      2777.8189       .0698       .0691       .0709
 -----------------------------------------------
 Solution Validates!
 -----------------------------------------------

Power Consumption

The 2.0 GHz dual G5 used here was configured with 512 MB of PC3200 memory in two DIMMs, and a 150 GB hard drive. Under full CPU load (two large MILC processes, minimal disk activity) the power consumption was:
Current:       2.52 A   (standard 120 VAC service)
Power:         293  W
KVA:           299  VA
Power Factor:  0.97

The measured power consumption on the 2.8 GHz Pentium 4 (1 GB of memory in 4 PC3200 DIMMs, and an 80 GB hard drive) was:
Current:       1.35 A
Power:         119  W
KVA:           159  VA
Power Factor:  0.74

Don Holmgren
Last Modified: 15th Oct 2003