USQCD Machine Performance



MachineProcessorNodesDWF per nodeClover per nodeasqtad per node6n Equivalence
pion3.2 GHz Single CPU Single Core Pentium 9405181728 MFlops1120 MFlops1594 MFlops0.683 6n-node-hour
kaon2.0 GHz Dual CPU Dual Core Opteron6004696 MFlops3180 MFlops3832 MFlops1.757 6n-node-hour
jpsi2.1 GHz Dual CPU Quad Core Opteron85610061 MFlops7423 MFlops9563 MFlops4.04 6n-node-hour
6n3.0 GHz Single CPU Dual Core Pentium2562900 MFlops1408 MFlops1960 MFlops1.00 6n-node-hour
7n1.9 GHz Dual CPU Quad Core Opteron3968800 MFlops5148 MFlops6300 MFlops3.1 6n-node-hour
QCDOC400 MHz PPC Core estimated12288336 MFlops360 MFlops0.122 6n-node-hour
BlueGene/P850 MHz Quad Core PowerPC 8504 cores/node, 1024 nodes/rack2560 MFlops2511 MFlops2680 MFlops1.08 6n-node-hour
Cray XT42.1 GHz Quad Core Opteron4 cores/node, 7,832 nodes3060 MFlops3392 MFlops4084 MFlops2.22 6n-node-hour

The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the pion, kaon, jpsi, 6n, and 7n clusters, and on the ANL BG/P, the ORNL XT4 and the QCDOC. All performance numbers are single precision unless otherwise noted.

For pion, the asqtad numbers were taken on 64-node runs, 14^4 local lattice per node. The DWF numbers were taken with the Pochinsky SSE-inverter CG-timer on 128-node runs using Ls=16, averaging the performance of 32x8x8x8 and 32x8x8x12 local lattice runs together. The Clover numbers were taken with Chroma using 128-node runs of 24x24x24x128.

The DWF, Clover and asqtad performance figures for kaon, jpsi, 6n, and 7n used 128-process (32-node, 16-node, 64-node,and 16-node respectively) runs, with 4, 2, or 8 processes per node, one process per core. DWF and Clover data were taken with Chroma. kaon and jpsi Clover runs used 6^3x64 local (per core) lattices, and DWF runs used 14x7x7x16 local (per core) lattices with Ls=16. Clover performance on 7n used 4^3x8 local volumes per process, DWF performance used global volume 24^3x64 (Ls=16), and asqtad performance used local volume 14^4. Clover performance on 6n is an average of 24^3x128 and 32^3x128 global volume runs using Chroma, DWF performance is an average of 32x8x8x8 and 32x8x8x12 (Ls=16) runs using the Pochinsky SSE-inverter CG-timer, and asqtad performance used local volume 14^4.

The QCDOC DWF (double precision) and asqtad (single precision) estimates are based on the observed peak performance of the double precision conjugate gradient codes on early motherboards, scaled to 400 MHz. Clover performance data are not available.

The BG/P asqtad result is the average of the performance of 6^4 and 8^4 local volumes, and is single precision. The DWF result is double precision, using 4^4 (Ls=16) local volumes. The Clover result used 4096 cores.

The XT4 Clover performance figure is an average of 2048-core runs using 24^3x128 global volume measured with two different binaries (MPI, and MPI plus QMT), and 32^3x256 global volume using an MPI plus QMT binary. DWF performance was measured with 2048 cores, a 32^3x64 global volume (Ls=16), and the MDWF inverter. Performance for asqtad used 128-process runs with 14^4 local volumes.

The final column of the table gives the 6n-equivalence for each of the USQCD resources. All except the Cray XT4 use the ratio of the average performance of asqtad and DWF; the XT4 uses the ratio of the average performance of the asqtad and clover inverters. Also, the QCDOC 6n-equivalence figure of 0.122 has been assigned to be consistent with prior years' accounting, rather than using the estimated DWF and asqtad performance values.