Dual Xeon (Foster) Results

Thanks to Dell, we were given the opportunity to benchmark MILC version 6 code on a dual Xeon system, a Dell Precision 530. This processors were 1.4 GHz, and the system had 2 GB of physical memory.

Optimization by Packing Data Structures

The plot below shows the performance of the improved staggered fermion code, su3_rmd_symzk1_asqtad. The red curve is standard MILC 6 code. The green curve is from a version of the same code which has been modified by Steve Gottlieb and Dick Foster to pack the corresponding data structures from the sites of the lattice together.

The cache line size on a P4 is 64 bytes, and the fundamental data structures used in the MILC computations are 72 bytes (su3 matrix) and 24 bytes (su3 vector). When the matrices are traversed with a large stride (normal MILC code), two cache lines must be loaded for each matrix. However, only 72 of the 128 bytes loaded are in general useful. By packing the matrices together, the extra 56 bytes loaded with one matrix belong to another matrix which may be used before expiry from the cache. The speedup on Pentium 4 processors is very pronounced (see New P4 Optimizations).

SMP Performance

The plot below shows, in red, the performance of a single binary, as well as the simultaneous performance of two copies of the same binary running on each of the two processors. The binary used is from the standard MILC 6 distribution.

The plot below shows the performance of MILC 6 code modified to pack similar data structures together.

The plot below shows the scaling on the two types of MILC 6 code as a function of lattice size. In this case, two independent copies of the same code are run at the same time.

The plot below shows the scaling of the MILC 6 code with data layout modifications when the MPI version of that code is run cooperatively on the two processors. Single process performance is shown. Both standard MPI with shared memory (version 1.2.1) and VMI's version are presented.