I've got a zx2000 (1.5 GHZ/6 MB Madison processor, 2 GB PC2100 RAM, general system details at http://www.openpa.net/systems/hp_zx2000.html) that I use for testing and benchmarking. Obviously there some difference in performance characteristics between this machine and a gazillion-processor Altix, but it's usually not too far off. If there's any code you want tested feel free to email me (replace spambox with michael if you think your email will upset SpamAssassin). It's running Debian with ICC 10.1 20080801. It's also got GCC 4.1.2, but IME using GCC instead of ICC on IA64 results in somewhat reduced performance, to say the least.

For a couple of reference points, here's some numbers against an Opteron 1210 machine (1.8 GHz/2x1 MB dual core, 2 GB PC2-6400) with GCC 4.3.2 running Solaris 10 (32-bit mode), and a Core 2 Q6600 (2.4 GHz/2x4 MB quad core, 8 GB PC6400) with Visual Studio 2005 running Windows (64-bit mode). Note that these are as much a test of compilers as a test of architectures. I've spent a bit of effort tuning the compilers (and adjusting the code to help the compilers), but someone who really knows their stuff can probably get a bit more oomph out of them.

The first test is my Mersenne Twister based PRNG library, which uses the Ziggurat algorithm for Gaussian and exponential distributions. On x86, it's accelerated using SSE (or MMX, if there's no SSE), though falls back to tweaked C++ code (many register 64-bit, many register 32 bit, and register constrained 32-bit for things like x86). It's mostly integer code, though things like the Gaussian tail distribution are transcendental-fp limited (sqrt, log). The test involves filling a buffer of 1000 samples, with the returned number being in processor cycles per sample. I've included SSE and non-SSE numbers for the Opteron and Core 2 machines, and both 32-bit and 64-bit mode for the Core 2. The PRNG benchmark only uses a single thread, so gains no speedup on multicore machines.

Test               Itanium Opteron       Core 2 64bit  Core 2 32bit
Variant                       SSE No SSE    SSE No SSE    SSE No SSE
Uniform u32          3.72    5.02  24.20   3.07   4.51   3.14   9.85
Uniform fp32        14.90    5.11  27.12   2.89   6.39   3.02  12.65
Uniform fp64        11.43   10.39  70.72   5.71  11.20   5.81  36.55
Gaussian tail fp64 113.58  225.82 324.47 197.30 188.54 108.73 254.24
Gaussian fp64       34.30   49.73  96.43  26.47  24.54  25.80  58.81
Exponential fp32    28.59   29.50  47.58  17.22  15.18  19.71  27.64
Exponential fp64    45.04   53.55 101.23  29.02  25.07  30.46  59.44

Cycle for cycle, the Itanium holds it's ground more or less against the Opteron. Against the Core 2 in 64-bit mode with SSE, it gets thumped pretty bad in everything except the Gaussian tail distribution. This is mainly due to the sheer amount of integer grunt that's available on the Core 2 when you fully use SSE2. Even the Opteron, executing pretty much exactly the same code as the Core 2, can't keep up since the SSE integer units are only half as wide there. I'd expect a quad-code (Barcelona) Opteron would have a much better showing here.

If you take away the advantage of hand-optimised SSE, but stay in 64 bit mode, the Itanium is a bit more competitive. Though it still only manages to beat the Core 2 in one distribution (uniform u32). Unfortunately, I can't run the Opteron box in 64-bit mode, so no results there.

Finally, if you kick the x86 machines back into 32-bit mode and take away the tuned SSE code, the shortage of registers takes the legs out from under the compilers. The Itanium takes the crown in nearly all the tests, and GCC on the Opteron simply implodes.


The second test is a monte-carlo raytracer that tracks the path of ions through a gas-filled solenoid, simulating interactions (scattering and charge exchange) between the ion and the gas. At the core it's a 4th/5th adaptive Runge-Kutta-Fehlberg integrator that does bilinear sampling for the magnetic field, and uses the above PRNG library to sample for scattering and the charge exchange events. It's primarily fp limited, since the working set is very small. It is minimally SSE-accelerated, since both GCC and ICC make a complete mess of the autovectorization and I haven't had time to go in and do it all by hand. The main RKF calculations are not vectorized. It also can do GPGPU acceleration using DX10, but I'm leaving that out here. Time is in seconds for 1000 ions (with gas) or 200000 ions (without gas), rounded to the nearest 100th, and since it's monte-carlo it obviously scales linearly with the number of cores.

Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE  SSE No SSE   SSE No SSE
With gas      23.50 11.52  12.32 3.25   3.40  5.77   4.67
Without gas   20.41 10.03  10.51 3.48   3.35  5.13   5.69

Obviously, with the most cores and the most raw clock speed, the Core 2 completely dominates. Scaled to a 1 GHz single core (ie: multiplied by the number of cores and the speed in GHz) gives a bit more of an idea of efficiency:

Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE   SSE No SSE   SSE No SSE
With gas      35.25 41.47  44.35 31.20  32.64 55.39  44.83
Without gas   30.62 36.11  37.84 33.41  32.16 49.25  54.62

When MADDs are the bulk of the work (no gas, so basic RKF integration) the Itanium comes out on top by a hair. The Core 2, or more likely the MSVC compiler, really struggles in 32 bit mode, though can come close to the Itanium in 64-bit mode. SSE only has a minimal (in fact, negative for the Core 2 in 64-bit mode) effect here since most of the code doesn't use it. When the gas interactions are added, the Itanium drops back behind the 64-bit Core 2 but still comes in front of the Opteron in 32-bit mode. It would have been interesting to see how the Opteron did in 64-bit mode.


So for these two (admittedly rather limited in scope) tests, the Itanium is relatively competitive on a clock-for-clock bases to a Core 2 in 64-bit mode in floating-point dominated tests where the integration kernel hasn't been vectorized using SSE. Once the workload becomes a bit more branchy and integer-heavy, it drops behind slightly. In situations that have had a lot of SSE tuning, though, such as the PRNG code, the Core 2 really dominates.

Of course, clock for clock doesn't help all that much when the top-end Core 2 is running about twice as fast as the top-end Itanium, and is much cheaper. And this is the basic problem for the Itanium - the top speed bin of the Itanium has only gone up 166 MHz (11%) since June 2003, and core IPC hasn't gone up much either. All that's changed from a performance point of view is that there's more cache, a faster bus, and more cores per socket. This obviously has some benefit to more memory-hungry software, but you have to wonder how well a Nehalem would do if you gave it a similar amount of cache.


The main thing I've seen going for the Itanium in HPC is SGI's NUMALink. A colleague of mine is developing some quantum mechanics simulation stuff, and scaling on the ANU Altix is great. Scaling on a Woodcrest Xeon cluster using Infiniband ... poor to the point of almost not worth going outside a single node. Hopefully, with the Nehalem and Tukwila sharing the same socket we might be able to get NUMALinked Nehalems, which would really throw a curveball into the HPC interconnect market.


Cheers,
Michael
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to