Dual Athlon MP (Palomino) Results
Thanks to Linux Networx, we were given the opportunity to benchmark MILC
version 6 code on a dual Athlon MP motherboard. This processors were 1.2 GHz,
and the system had 1 GB of physical memory.
Optimization by Packing Data Structures
The plot below shows the performance of the improved staggered fermion code,
su3_rmd_symzk1_asqtad. The red curve is standard MILC 6 code. The green
curve is from a version of the same code which has been modified by Steve
Gottlieb and Dick Foster to pack the corresponding data structures from the
sites of the lattice together. The blue curve adds to this data restructuring
the use of SSE math routines (matrix-matrix and matrix-vector multiplies).
Palomino chips support SSE binaries, though AMD warns that performance will
not necessarily match the speedup on Intel chips. That is, SSE instructions
are supplied primarily for binary compatibility. Indeed, on the MILC code the
SSE optimized code runs no faster. We have yet to implement the math routines
in 3DNow!.
The cache line size on an Athlon is 64 bytes, and the fundamental data
structures used in the MILC computations are 72 bytes (su3 matrix) and 24
bytes (su3 vector). When the matrices are traversed with a large stride
(normal MILC code), two cache lines must be loaded for each matrix. However,
only 72 of the 128 bytes loaded are in general useful. By packing the
matrices together, the extra 56 bytes loaded with one matrix belong to another
matrix which may be used before expiry from the cache. The speedup on Pentium
4 processors is much more pronounced (see New P4
Optimizations).
SMP Performance
The plot below shows, in red, the performance of a single binary, as well as
the simultaneous performance of two copies of the same binary running on each
of the two processors. Scaling is quite good on this motherboard for this code.