USQCD Machine Performance



ClusterProcessorNodesDWF per nodeClover per nodeasqtad per node
qcd2.8 GHz Single CPU Single Core P4E1241400 MFlops1096 MFlops1017 MFlops
pion3.2 GHz Single CPU Single Core Pentium 9405181729 MFlops1120 MFlops1594 MFlops
kaon2.0 GHz Dual CPU Dual Core Opteron6004696 MFlops3180 MFlops3832 MFlops
7N1.9 GHz Dual CPU Quad Core Opteron3968800 MFlops5148 MFlops6300 MFlops
6N3.0 GHz Single CPU Dual Core Pentium2562900 MFlops1408 MFlops1960 MFlops
4G2.8 GHz Single CPU Single Core Xeon3841582 MFlops636 MFlops1249 MFlops
BlueGene/P850 MHz Quad Core PowerPC 8504 cores/node, 1024 nodes/rack2560 MFlops2680 MFlops
Cray XT42.6 GHz Dual Core Opteron???2660 MFlops2340 MFlops

The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the qcd, pion, kaon, 6N, and 4G clusters, and on the ANL BG/P and the ORNL XT4. For qcd and pion, the asqtad numbers were taken on 64-node runs, 14^4 local lattice per node, and the DWF numbers were taken on 64-node runs using Ls=16, averaging the performance of 32x8x8x8 and 32x8x8x12 local lattice runs together. The DWF, Clover and asqtad performance figures for kaon, 6N, and 7N use 128-process (32-node, 64-node, and 16-node respectively) runs, with 4, 2, or 8 processes per node, one process per core. Clover performance on 7N used 128 processes with 4^3x8 local volumes per process. The DWF and Clover performance runs for 4G used single panels (128 node jobs, 1 core/node) with mesh layouts of 1x4x4x8. The BG/P and XT4 DWF performance measurements used local volumes of 4^4 (Ls=16) and 6x6x6x4 per core, respectively. The BG/P asqtad result is the average of the performance of 6^4 and 8^4 local volumes, and is single precision. The BG/P DWF result is double precision.