Optimizing MILC Math Routines with SSE

The Pentium III ("Coppermine") processors on the QCDHOME cluster support Intel's Streaming SIMD Extensions, or SSE. See Intel's SSE site for more information. Microprocessors with SSE capabilities have an 8 register, 128-bit wide stack. These registers are named xmm0 - xmm7. The SSE instruction set allows the programmer to break each of these registers into fields of 8, 16, and 32 bit widths. Various arithmetic instructions (add, subtract, multiply, shifts, logical) can be used simultaneously across all of the subfields. For floating point, the xmm registers will hold four values, and the processor can execute the same floating point operation on the four values simultaneously. SSE2, supported on the Pentium IV processor, allows 64 bit subfields and so supports SIMD operations on double precision operands.

The various SU3 matrix and vector operations performed by MILC are promising candidates for vectorization. We'll use the matrix-vector multiplication routine, mult_su3_mat_vec (see the m_matvec.c source module) as an example.

C version

Here's the core C code from the MILC library:

void mult_su3_mat_vec( su3_matrix *a, su3_vector *b, su3_vector *c ){
int i,j,k;
register float t,ar,ai,br,bi,cr,ci;

        ar=a->e[i][0].real; ai=a->e[i][0].imag;
        br=b->c[0].real; bi=b->c[0].imag;
        cr=ar*br; t=ai*bi; cr -= t;
        ci=ar*bi; t=ai*br; ci += t;

        ar=a->e[i][1].real; ai=a->e[i][1].imag;
        br=b->c[1].real; bi=b->c[1].imag;
        t=ar*br; cr += t; t=ai*bi; cr -= t;
        t=ar*bi; ci += t; t=ai*br; ci += t;

        ar=a->e[i][2].real; ai=a->e[i][2].imag;
        br=b->c[2].real; bi=b->c[2].imag;
        t=ar*br; cr += t; t=ai*bi; cr -= t;
        t=ar*bi; ci += t; t=ai*br; ci += t;

The defined types are:

typedef struct {           /* standard complex number declaration for single- */
   float real;             /* precision complex numbers                       */
   float imag;
} complex;

typedef struct { complex c[3]; } su3_vector;

typedef struct { complex e[3][3]; } su3_matrix;

SSE Version

For writing SSE code, we've used the NASM assembler, available from http://nasm.2y.net. NASM has a very clean syntax, with operand ordering matching Intel's documentation.

The source code for the SSE version of mult_su3_mat_vec is available here. To understand the code, let's start with the prototype and a typical call:

         void sse_mult_su3_mat_vec(su3_matrix *, su3_vector *, su3_vector *);

         su3_matrix a;
         su3_vector b, c;
         sse_mult_su3_mat_vec( &a, &b, &c);

In the first lines of the SSE code, the addresses of the arguments are loaded into registers eax, ebx, and ecx after first saving the contents of these registers on the stack:

  global sse_mult_su3_mat_vec
        push            ebp
        mov             ebp,esp
        push            eax
        push            ebx
        push            ecx
        mov             eax,[ebp+8]                     ; su3_matrix *a
        mov             ebx,[ebp+12]                    ; su3_vector *b
        mov             ecx,[ebp+16]                    ; su3_vector *c
Note the declaration of the global entry point, which allows the object module from this SSE code to be linked to the caller's C code.

The strategy of the code is to load and save the b vector into xmm0-2, preshuffled so as to line up real and imaginary parts for multiplication with each row of the matrix.

        ;  bring in real and imaginary b vector
        movups          xmm0,[ebx]                      ; c1i,c1r,c0i,c0r
        movaps          xmm1,xmm0
        shufps          xmm1,xmm1,0xB1                  ; c1r,c1i,c0r,c0i
        movhps          xmm2,[ebx+16]                   ; c2i,c2r,x,x
        shufps          xmm2,xmm2,0xEB                  ; c2i,c2r,c2r,c2i

        ; xmm0:   c1i, c1r, c0i, c0r
        ; xmm1:   c1r, c1i, c0r, c0i
        ; xmm2:   c2i, c2r, c2r, c2i
The comments here label the real and imaginary parts of a generic set of complex operands, c0, c1, and c2. The movups and movaps operations move data from memory to the xmm registers, and between the registers. movaps can be used whenever the operands are 16-byte aligned, or when moving between registers. movups is slower, but does not require alignment. shufps is used to shuffle the 32-bit subfields of the SSE operands. Refer to the Intel Instruction Set Reference Manual for details on the bit field taken by shufps to specify the movement of subfields. In this case, from the comments we see that shufps was used twice to arrange the three complex pairs of the SU3 vector into three xmm registers, with both real-imaginary and imaginary-real orderings.

Next, the code brings in the first row of the matrix into xmm3-4, with shuffling used as shown by the comment:

        ; bring in first row of a matrix
        movups          xmm3,[eax]                      ; c1i,c1r,c0i,c0r
        movups          xmm4,[eax+8]                    ; c2i,c2r,c1i,c1r
        shufps          xmm4,xmm4,0xEE                  ; c2i,c2r,c2i,c2r

        ; xmm3:   a1i, a1r, a0i, a0r
        ; xmm4:   a2i, a2r, a2i, a2r

Three four-wide multiplies are performed with the mulps instruction to form the various real.real, imag.imag, real.imag, and imag.real combinations of the two vectors:

        movaps          xmm5,xmm3
        mulps           xmm3,xmm0
        mulps           xmm5,xmm1
        mulps           xmm4,xmm2

        ; xmm3:  c1i.i, c1r.r, c0i.i, c0r.r
        ; xmm4:  c2i.i, c2r.r, c2i.r, c2r.i
        ; xmm5:  c1i.r, c1r.i, c0i.r, c0r.i

Next, shuffles are used to prepare the multiplied pairs for vectorized summing:

        movaps          xmm6,xmm5
        shufps          xmm6,xmm3,0x4E
        shufps          xmm5,xmm3,0xE4

        ; xmm4:  c2i.i, c2r.r, c2i.r, c2r.i
        ; xmm5:  c1i.i, c1r.r, c0i.r, c0r.i
        ; xmm6:  c0i.i, c0r.r, c1i.r, c1r.i

Finally, a pair of four-wide adds forms the unsummed real and imaginary pieces of the result of multiplying the first row of the matrix by the vector:

        addps   xmm4,xmm5
        addps   xmm4,xmm6

        ; xmm4:  i-sum, r-sum, i.r-sum, r.i-sum  [0]
Next, analogous code multiplies the second row of the matrix by the vector, storing the unsummed real and imaginary pieces into xmm5. By use of shufps, the unsummed pieces in xmm4-5 can be rearranged to optimize the required addition:

        shufps          xmm5,xmm7,0x22
        shufps          xmm4,xmm7,0x77
        ; xmm4:  i.r-sum[1], i-sum[1], i.r-sum[0], i-sum[0]
        ; xmm5:  r.i-sum[1], r-sum[1], r.i-sum[0], r-sum[0]

        xorps           xmm4,[negate]
        addps           xmm5,xmm4
        movups          [ecx],xmm5

Here, the negate bit mask is used to multiply the imag.imag pieces by -1 before summing with the real.real pieces:

          align           16
  negate: dd              0x80000000
          dd              0x00000000
          dd              0x80000000
          dd              0x00000000

The final result of multiplying the vector by the first two rows, a pair of complex numbers, is stored with a single movups operation.

What remains is the multiplication of the third row of the matrix with the vector. This is identical to the prior row by vector multiplies, with the exception that a movhps operation is used at the end to move the single complex number to the third position of the result SU3 vector (see the code).

At the end of the routine, the original values of the registers used to access the operands are reloaded from the stack, and a ret is used to return to the calling program:

        pop     ecx
        pop     ebx
        pop     eax
        mov     esp,ebp
        pop     ebp

Timing the Improvement

test_mul.c code can be used to test the correctness of the SSE routine, and to time the speeds of the MILC C and SSE versions. It initializes the a and b operands with random numbers, then in a loop calls the MILC and SSE versions iter times, as specified by the user. The rdtscll macro from /usr/include/asm/msr.h is used to access the Pentium cycle clock. The output of test_mul.c shows the matrix and vector operands, and the results from the C and SSE codes, as well as the number of cycles used by each routine. Running in the real time queue (running as root required) is necessary for accurate timings.

A sample run, with one million iterations:

  qcdhome:~/version5/libraries$ su -c './test_mul 1000000'
   0.4+1.6i  1.7+0.9i  1.6-1.6i    -0.5-1.4i     4.2-1.4i     4.2-1.4i
   1.7+0.1i  0.3+0.2i -2.0-1.1i     2.0-2.0i     2.7-0.4i     2.7-0.4i
  -1.4+0.7i -2.0+0.1i -2.0-0.9i    -1.3-0.4i     0.2+7.7i     0.2+7.7i
  Time per iteration:
    MILC: 133
    SSE:  103

Implemented Routines and Their Timings

The table below shows the routines currently with SSE implementations, along with their timings on Pentium III, Pentium IV, and Athlon MP chips. The subroutines in red are used heavily in the D-slash routine. The timings include the overheads associated with subroutine calls and argument passing - approximately 16 cycles on a Pentium 4 when 3 arguments are passed by reference. Note: See the inline implementation for improved timings on many of these routines.

Pentium III (Coppermine), Pentium IV, and Athlon MP Timings
Pentium III Pentium IV Athlon MP
mult_su3_mat_vec 131 102 162 96 100 101
mult_adj_su3_mat_vec 128 98 116 101 99 100
mult_su3_mat_vec_sum_4dir 599 369 603 368 506 380
mult_adj_su3_mat_vec_4dir 500 304 608 362 411 332
mult_adj_su3_mat_hwvec 297 178 276 185 199 182
mult_su3_mat_hwvec 297 184 276 176 199 189
mult_su3_nn 421 257 399 251 405 270
mult_su3_na 400 255 386 253 414 270
scalar_mult_add_su3_matrix 142 64 152 74 139 57
su3_projector 222 107 234 102 258 99

To convert the results in this tables to MFlop ratings, multiply the number of operations by the CPU clock speed and divide by the listed cycle count. For example, an SU3 matrix-matrix multiply requires 108 multiplies and 90 adds. On a 1.4 GHz Pentium 4 processor, the MFlops for the mult_su3_nn routine are:

   (108 + 90 Flop) * (1400 Mcycles/sec) / 251 = 1104 MFlop/sec

Using GCC Instead of NASM

One disadvantage of using NASM for SSE routines is that writing inline functions is not possible. Recent versions of the GNU binutils (2.11.2) and gcc (2.95.3) packages allow use of inlined SSE code. For detailed examples, see Martin Luescher's code and SSE-HOWTO document. Look at a sample of his macros in sse.h to get the flavor of inlining assembler in GCC.

New: See the inline version discussion for information about a NASM-to-inline-gcc converter.

Kernel Requirements

The operating system needs to be aware of the SSE/SSE2 register stack, since when switching processes this stack is part of a given program's context. In the stable Linux kernels, only the 2.4.x series support SSE/SSE2 by default. For earlier kernels, you'll have to apply a patch. For 2.2.18, we've used Andrea Arcangeli's PIII-7 patch, available from ftp.kernel.org.
Don Holmgren
Last Modified: 28th Jan 2002