|
USQCD Machine Performance
The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the pion, kaon, jpsi, 6n, and 7n clusters, and on the ANL BG/P, the ORNL XT4 and the QCDOC. All performance numbers are single precision unless otherwise noted. For pion, the asqtad numbers were taken on 64-node runs, 14^4 local lattice per node. The DWF numbers were taken with the Pochinsky SSE-inverter CG-timer on 128-node runs using Ls=16, averaging the performance of 32x8x8x8 and 32x8x8x12 local lattice runs together. The Clover numbers were taken with Chroma using 128-node runs of 24x24x24x128. The DWF, Clover and asqtad performance figures for kaon, jpsi, 6n, and 7n used 128-process (32-node, 16-node, 64-node,and 16-node respectively) runs, with 4, 2, or 8 processes per node, one process per core. DWF and Clover data were taken with Chroma. kaon and jpsi Clover runs used 6^3x64 local (per core) lattices, and DWF runs used 14x7x7x16 local (per core) lattices with Ls=16. Clover performance on 7n used 4^3x8 local volumes per process, DWF performance used global volume 24^3x64 (Ls=16), and asqtad performance used local volume 14^4. Clover performance on 6n is an average of 24^3x128 and 32^3x128 global volume runs using Chroma, DWF performance is an average of 32x8x8x8 and 32x8x8x12 (Ls=16) runs using the Pochinsky SSE-inverter CG-timer, and asqtad performance used local volume 14^4. The QCDOC DWF (double precision) and asqtad (single precision) estimates are based on the observed peak performance of the double precision conjugate gradient codes on early motherboards, scaled to 400 MHz. Clover performance data are not available. The BG/P asqtad result is the average of the performance of 6^4 and 8^4 local volumes, and is single precision. The DWF result is double precision, using 4^4 (Ls=16) local volumes. The Clover result used 4096 cores. The XT4 Clover performance figure is an average of 2048-core runs using 24^3x128 global volume measured with two different binaries (MPI, and MPI plus QMT), and 32^3x256 global volume using an MPI plus QMT binary. DWF performance was measured with 2048 cores, a 32^3x64 global volume (Ls=16), and the MDWF inverter. Performance for asqtad used 128-process runs with 14^4 local volumes. The final column of the table gives the 6n-equivalence for each of the USQCD resources. All except the Cray XT4 use the ratio of the average performance of asqtad and DWF; the XT4 uses the ratio of the average performance of the asqtad and clover inverters. Also, the QCDOC 6n-equivalence figure of 0.122 has been assigned to be consistent with prior years' accounting, rather than using the estimated DWF and asqtad performance values. |