Infiniband Performance
In June, we remotely tested the performance of an Infiniband-connected cluster
owned by TopSpin. The cluster had 16
dual 2.4 GHz Xeon nodes, based on Tyan 2722S2-533 motherboards built with E7501 chipsets
(533 MHz FSB). Topspin two port 4x HCA's were used, connected to a TopSpin 360
Infiniband switch. The graphs below show comparisons with our current lattice
QCD cluster, which was built using 2.4 GHz dual Xeons on SuperMicro P4DPE-Q
motherboards with E7500 chipsets (400 MHz FSB), using Myrinet 2000 as an
interconnect (M3F-PCI64B-2 interfaces, M3-E128 switch).
All of the graphs shown below are also links. Click on any graph to download
an encapsulated PostScript version.
Pallas Sendrecv
The Pallas Sendrecv
benchmark measures the aggregate bidirectional bandwidth
obtained using MPI_Sendrecv calls. We ran this test between two nodes on each
cluster.
We note that the measured Infiniband bandwidth is close to the 800 MByte/sec
limit commonly observed on PCI-X buses for bidirectional traffic. Also, the
new Myrinet M3F-PCIXD-2 interfaces are reported to sustain
489 MByte/sec summed bidirectional bandwidth, limited by the 2.50+2.50 Myrinet
hardware link layer.
Netpipe
We used the MPI version of the Netpipe benchmark from Ames Lab
to measure the one-way bandwidth and latency of ping-ponged MPI send calls.
We ran this test between two nodes on each cluster.
On the bandwidth plot, we note again that the Infiniband result is close to
the PCI-X limit, and that newer Myrinet interfaces achieve very close to 250 MByte/sec.
Shown below is a Netpipe network signature graph, in which bandwidth
is plotted against transfer time for many different message sizes. The
intercept on the abscissa gives the zero-length message latency, and the
horizontal asymptote gives the saturation bandwidth.
We note from this graph that the latency for MPI messaging on Infiniband using
MVAPICH is 7 microseconds, and the corresponding Myrinet latency is 11
microseconds. Newer Myrinet interfaces are reported to have improved latencies.
MILC Scaling
Shown below are scaling curves for the MILC improved
staggered code. On the two clusters, we ran a constant lattice size per node
calculation at various lattice sizes and on varying numbers of nodes. Each
curve shows performance for a given lattice size L^4, where L was 4, 6, 8, 10,
12, or 14. Each lattice size was run on these combinations of nodes: single,
2, 4, 8, and 16 nodes. Use a wide browser window to see these graphs
side by side. Again, click on the graphs to download encapsulated
postscript versions.
The increased performance on the Infiniband cluster is due to both the
increased memory bandwidth available with the E7501 chipset and the 533 MHz
front side bus processors, and the higher bandwidth and lower latency of
Infiniband compared with Myrinet.

Don Holmgren
Last Modified: 8th Sep 2003