- NEW! Inline SSE routines for GCC
- C Example - SU3 Matrix-Vector Multiplication
- SSE Version of SU3 Matrix-Vector Multiplication
- Timings of SSE and C Subroutines
- Using GCC Instead of NASM
- Kernel Requirements
- Catalog of MILC Math Routines
- Tar of SSE Additions to Version5

The Pentium III (

The various SU3 matrix and vector operations performed by MILC are promising
candidates for vectorization. We'll use the matrix-vector multiplication
routine, **mult_su3_mat_vec** (see the
m_matvec.c
source module) as an example.

void mult_su3_mat_vec( su3_matrix *a, su3_vector *b, su3_vector *c ){ int i,j,k; register float t,ar,ai,br,bi,cr,ci; for(i=0;i<3;i++){ ar=a->e[i][0].real; ai=a->e[i][0].imag; br=b->c[0].real; bi=b->c[0].imag; cr=ar*br; t=ai*bi; cr -= t; ci=ar*bi; t=ai*br; ci += t; ar=a->e[i][1].real; ai=a->e[i][1].imag; br=b->c[1].real; bi=b->c[1].imag; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; ar=a->e[i][2].real; ai=a->e[i][2].imag; br=b->c[2].real; bi=b->c[2].imag; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; c->c[i].real=cr; c->c[i].imag=ci; } }The defined types are:

typedef struct { /* standard complex number declaration for single- */ float real; /* precision complex numbers */ float imag; } complex; typedef struct { complex c[3]; } su3_vector; typedef struct { complex e[3][3]; } su3_matrix;

The source code for the SSE version of **mult_su3_mat_vec** is available
here. To understand the code, let's
start with the prototype and a typical call:

void sse_mult_su3_mat_vec(su3_matrix *, su3_vector *, su3_vector *); su3_matrix a; su3_vector b, c; sse_mult_su3_mat_vec( &a, &b, &c);

In the first lines of the SSE code, the addresses of the arguments are loaded
into registers *eax, ebx, * and *ecx* after first saving the
contents of these registers on the stack:

global sse_mult_su3_mat_vec sse_mult_su3_mat_vec: push ebp mov ebp,esp push eax push ebx push ecx mov eax,[ebp+8] ; su3_matrix *a mov ebx,[ebp+12] ; su3_vector *b mov ecx,[ebp+16] ; su3_vector *cNote the declaration of the global entry point, which allows the object module from this SSE code to be linked to the caller's C code.

The strategy of the code is to load and save the **b** vector into **xmm0-2**,
preshuffled so as to line up real and imaginary parts for multiplication with
each row of the matrix.

; bring in real and imaginary b vector movups xmm0,[ebx] ; c1i,c1r,c0i,c0r movaps xmm1,xmm0 shufps xmm1,xmm1,0xB1 ; c1r,c1i,c0r,c0i movhps xmm2,[ebx+16] ; c2i,c2r,x,x shufps xmm2,xmm2,0xEB ; c2i,c2r,c2r,c2i ; xmm0: c1i, c1r, c0i, c0r ; xmm1: c1r, c1i, c0r, c0i ; xmm2: c2i, c2r, c2r, c2iThe comments here label the real and imaginary parts of a generic set of complex operands,

`movups`

and `movaps`

operations move data from memory to the `movaps`

can be used whenever
the operands are 16-byte aligned, or when moving between registers.
`movups`

is slower, but does not require alignment.
`shufps`

is used to shuffle the 32-bit subfields of the SSE
operands. Refer to the Intel
Instruction Set Reference Manual
for details on the bit field taken by `shufps`

to specify the
movement of subfields. In this case, from the comments we see that
`shufps`

was used twice to arrange the three complex pairs of the
SU3 vector into three
Next, the code brings in the first row of the matrix into **xmm3-4**, with
shuffling used as shown by the comment:

; bring in first row of a matrix movups xmm3,[eax] ; c1i,c1r,c0i,c0r movups xmm4,[eax+8] ; c2i,c2r,c1i,c1r shufps xmm4,xmm4,0xEE ; c2i,c2r,c2i,c2r ; xmm3: a1i, a1r, a0i, a0r ; xmm4: a2i, a2r, a2i, a2r

Three four-wide multiplies are performed with the `mulps`

instruction to form the various real.real, imag.imag, real.imag, and imag.real
combinations of the two vectors:

movaps xmm5,xmm3 mulps xmm3,xmm0 mulps xmm5,xmm1 mulps xmm4,xmm2 ; xmm3: c1i.i, c1r.r, c0i.i, c0r.r ; xmm4: c2i.i, c2r.r, c2i.r, c2r.i ; xmm5: c1i.r, c1r.i, c0i.r, c0r.i

Next, shuffles are used to prepare the multiplied pairs for vectorized summing:

movaps xmm6,xmm5 shufps xmm6,xmm3,0x4E shufps xmm5,xmm3,0xE4 ; xmm4: c2i.i, c2r.r, c2i.r, c2r.i ; xmm5: c1i.i, c1r.r, c0i.r, c0r.i ; xmm6: c0i.i, c0r.r, c1i.r, c1r.i

Finally, a pair of four-wide adds forms the unsummed real and imaginary pieces of the result of multiplying the first row of the matrix by the vector:

addps xmm4,xmm5 addps xmm4,xmm6 ; xmm4: i-sum, r-sum, i.r-sum, r.i-sum [0]Next, analogous code multiplies the second row of the matrix by the vector, storing the unsummed real and imaginary pieces into

`shufps`

, the unsummed pieces in

shufps xmm5,xmm7,0x22 shufps xmm4,xmm7,0x77 ; xmm4: i.r-sum[1], i-sum[1], i.r-sum[0], i-sum[0] ; xmm5: r.i-sum[1], r-sum[1], r.i-sum[0], r-sum[0] xorps xmm4,[negate] addps xmm5,xmm4 movups [ecx],xmm5

Here, the `negate`

bit mask is used to multiply the imag.imag
pieces by -1
before summing with the real.real pieces:

align 16 negate: dd 0x80000000 dd 0x00000000 dd 0x80000000 dd 0x00000000The final result of multiplying the vector by the first two rows, a pair of complex numbers, is stored with a single

`movups`

operation.
What remains is the multiplication of the third row of the matrix with the
vector. This is identical to the prior row by vector multiplies, with the
exception that a `movhps`

operation is used at the end to move the
single complex number to the third position of the result SU3 vector (see the
code).

At the end of the routine, the original values of the registers used to access
the operands are reloaded from the stack, and a `ret`

is used to
return to the calling program:

pop ecx pop ebx pop eax mov esp,ebp pop ebp ret

`test_mul.c`

code can be used to test the
correctness of the SSE routine, and to time the speeds of the MILC C and SSE
versions. It initializes the `iter`

times, as specified by the user. The `rdtscll`

macro from
`test_mul.c`

shows the matrix and vector operands,
and the results from the C and SSE codes, as well as the number of cycles used
by each routine. Running in the real time queue (running as A sample run, with one million iterations:

qcdhome:~/version5/libraries$ su -c './test_mul 1000000' Password: 0.4+1.6i 1.7+0.9i 1.6-1.6i -0.5-1.4i 4.2-1.4i 4.2-1.4i 1.7+0.1i 0.3+0.2i -2.0-1.1i 2.0-2.0i 2.7-0.4i 2.7-0.4i -1.4+0.7i -2.0+0.1i -2.0-0.9i -1.3-0.4i 0.2+7.7i 0.2+7.7i Time per iteration: MILC: 133 SSE: 103

Pentium III | Pentium IV | Athlon MP | ||||
---|---|---|---|---|---|---|

MILC | SSE | MILC | SSE | MILC | SSE | |

mult_su3_mat_vec | 131 | 102 | 162 | 96 | 100 | 101 |

mult_adj_su3_mat_vec | 128 | 98 | 116 | 101 | 99 | 100 |

mult_su3_mat_vec_sum_4dir | 599 | 369 | 603 | 368 | 506 | 380 |

mult_adj_su3_mat_vec_4dir | 500 | 304 | 608 | 362 | 411 | 332 |

mult_adj_su3_mat_hwvec | 297 | 178 | 276 | 185 | 199 | 182 |

mult_su3_mat_hwvec | 297 | 184 | 276 | 176 | 199 | 189 |

mult_su3_nn | 421 | 257 | 399 | 251 | 405 | 270 |

mult_su3_na | 400 | 255 | 386 | 253 | 414 | 270 |

scalar_mult_add_su3_matrix | 142 | 64 | 152 | 74 | 139 | 57 |

su3_projector | 222 | 107 | 234 | 102 | 258 | 99 |

To convert the results in this tables to MFlop ratings, multiply the number of
operations by the CPU clock speed and divide by the listed cycle count. For
example, an SU3 matrix-matrix multiply requires 108 multiplies and 90 adds.
On a 1.4 GHz Pentium 4 processor, the MFlops for the `mult_su3_nn`

routine are:

(108 + 90 Flop) * (1400 Mcycles/sec) / 251 = 1104 MFlop/sec

New: See the inline version discussion for information about a NASM-to-inline-gcc converter.

Don Holmgren Last Modified: 28th Jan 2002