The cache line size on a P4 is 64 bytes, and the fundamental data structures used in the MILC computations are 72 bytes (su3 matrix) and 24 bytes (su3 vector). When the matrices are traversed with a large stride (normal MILC code), two cache lines must be loaded for each matrix. However, only 72 of the 128 bytes loaded are in general useful. By packing the matrices together, the extra 56 bytes loaded with one matrix belong to another matrix which may be used before expiry from the cache. The speedup on Pentium 4 processors is very pronounced (see New P4 Optimizations).
The plot below shows the performance of MILC 6 code modified to pack similar
data structures together.
The plot below shows the scaling on the two types of MILC 6 code as a function of lattice size. In this case, two independent copies of the same code are run at the same time.
The plot below shows the scaling of the MILC 6 code with data layout modifications when the MPI version of that code is run cooperatively on the two processors. Single process performance is shown. Both standard MPI with shared memory (version 1.2.1) and VMI's version are presented.