Dual Athlon MP (Palomino) Results

Thanks to Linux Networx, we were given the opportunity to benchmark MILC version 6 code on a dual Athlon MP motherboard. This processors were 1.2 GHz, and the system had 1 GB of physical memory.

Optimization by Packing Data Structures

The plot below shows the performance of the improved staggered fermion code, su3_rmd_symzk1_asqtad. The red curve is standard MILC 6 code. The green curve is from a version of the same code which has been modified by Steve Gottlieb and Dick Foster to pack the corresponding data structures from the sites of the lattice together. The blue curve adds to this data restructuring the use of SSE math routines (matrix-matrix and matrix-vector multiplies).

Palomino chips support SSE binaries, though AMD warns that performance will not necessarily match the speedup on Intel chips. That is, SSE instructions are supplied primarily for binary compatibility. Indeed, on the MILC code the SSE optimized code runs no faster. We have yet to implement the math routines in 3DNow!.

The cache line size on an Athlon is 64 bytes, and the fundamental data structures used in the MILC computations are 72 bytes (su3 matrix) and 24 bytes (su3 vector). When the matrices are traversed with a large stride (normal MILC code), two cache lines must be loaded for each matrix. However, only 72 of the 128 bytes loaded are in general useful. By packing the matrices together, the extra 56 bytes loaded with one matrix belong to another matrix which may be used before expiry from the cache. The speedup on Pentium 4 processors is much more pronounced (see New P4 Optimizations).

SMP Performance

The plot below shows, in red, the performance of a single binary, as well as the simultaneous performance of two copies of the same binary running on each of the two processors. Scaling is quite good on this motherboard for this code.