Re: [Beowulf] itanium vs. x86-64

Michael Brown Thu, 12 Feb 2009 08:03:55 -0800

I've got a zx2000 (1.5 GHZ/6 MB Madison processor, 2 GB PC2100 RAM, generalsystem details at http://www.openpa.net/systems/hp_zx2000.html) that I usefor testing and benchmarking. Obviously there some difference in performancecharacteristics between this machine and a gazillion-processor Altix, butit's usually not too far off. If there's any code you want tested feel freeto email me (replace spambox with michael if you think your email will upsetSpamAssassin). It's running Debian with ICC 10.1 20080801. It's also got GCC4.1.2, but IME using GCC instead of ICC on IA64 results in somewhat reducedperformance, to say the least.

For a couple of reference points, here's some numbers against an Opteron1210 machine (1.8 GHz/2x1 MB dual core, 2 GB PC2-6400) with GCC 4.3.2running Solaris 10 (32-bit mode), and a Core 2 Q6600 (2.4 GHz/2x4 MB quadcore, 8 GB PC6400) with Visual Studio 2005 running Windows (64-bit mode).Note that these are as much a test of compilers as a test of architectures.I've spent a bit of effort tuning the compilers (and adjusting the code tohelp the compilers), but someone who really knows their stuff can probablyget a bit more oomph out of them.

The first test is my Mersenne Twister based PRNG library, which uses theZiggurat algorithm for Gaussian and exponential distributions. On x86, it'saccelerated using SSE (or MMX, if there's no SSE), though falls back totweaked C++ code (many register 64-bit, many register 32 bit, and registerconstrained 32-bit for things like x86). It's mostly integer code, thoughthings like the Gaussian tail distribution are transcendental-fp limited(sqrt, log). The test involves filling a buffer of 1000 samples, with thereturned number being in processor cycles per sample. I've included SSE andnon-SSE numbers for the Opteron and Core 2 machines, and both 32-bit and64-bit mode for the Core 2. The PRNG benchmark only uses a single thread, sogains no speedup on multicore machines.


Test               Itanium Opteron       Core 2 64bit  Core 2 32bit
Variant                       SSE No SSE    SSE No SSE    SSE No SSE
Uniform u32          3.72    5.02  24.20   3.07   4.51   3.14   9.85
Uniform fp32        14.90    5.11  27.12   2.89   6.39   3.02  12.65
Uniform fp64        11.43   10.39  70.72   5.71  11.20   5.81  36.55
Gaussian tail fp64 113.58  225.82 324.47 197.30 188.54 108.73 254.24
Gaussian fp64       34.30   49.73  96.43  26.47  24.54  25.80  58.81
Exponential fp32    28.59   29.50  47.58  17.22  15.18  19.71  27.64
Exponential fp64    45.04   53.55 101.23  29.02  25.07  30.46  59.44

Cycle for cycle, the Itanium holds it's ground more or less against theOpteron. Against the Core 2 in 64-bit mode with SSE, it gets thumped prettybad in everything except the Gaussian tail distribution. This is mainly dueto the sheer amount of integer grunt that's available on the Core 2 when youfully use SSE2. Even the Opteron, executing pretty much exactly the samecode as the Core 2, can't keep up since the SSE integer units are only halfas wide there. I'd expect a quad-code (Barcelona) Opteron would have a muchbetter showing here.

If you take away the advantage of hand-optimised SSE, but stay in 64 bitmode, the Itanium is a bit more competitive. Though it still only manages tobeat the Core 2 in one distribution (uniform u32). Unfortunately, I can'trun the Opteron box in 64-bit mode, so no results there.

Finally, if you kick the x86 machines back into 32-bit mode and take awaythe tuned SSE code, the shortage of registers takes the legs out from underthe compilers. The Itanium takes the crown in nearly all the tests, and GCCon the Opteron simply implodes.

The second test is a monte-carlo raytracer that tracks the path of ionsthrough a gas-filled solenoid, simulating interactions (scattering andcharge exchange) between the ion and the gas. At the core it's a 4th/5thadaptive Runge-Kutta-Fehlberg integrator that does bilinear sampling for themagnetic field, and uses the above PRNG library to sample for scattering andthe charge exchange events. It's primarily fp limited, since the working setis very small. It is minimally SSE-accelerated, since both GCC and ICC makea complete mess of the autovectorization and I haven't had time to go in anddo it all by hand. The main RKF calculations are not vectorized. It also cando GPGPU acceleration using DX10, but I'm leaving that out here. Time is inseconds for 1000 ions (with gas) or 200000 ions (without gas), rounded tothe nearest 100th, and since it's monte-carlo it obviously scales linearlywith the number of cores.


Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE  SSE No SSE   SSE No SSE
With gas      23.50 11.52  12.32 3.25   3.40  5.77   4.67
Without gas   20.41 10.03  10.51 3.48   3.35  5.13   5.69

Obviously, with the most cores and the most raw clock speed, the Core 2completely dominates. Scaled to a 1 GHz single core (ie: multiplied by thenumber of cores and the speed in GHz) gives a bit more of an idea ofefficiency:


Test        Itanium Opteron      Core 2 64bit Core 2 32bit
Variant               SSE No SSE   SSE No SSE   SSE No SSE
With gas      35.25 41.47  44.35 31.20  32.64 55.39  44.83
Without gas   30.62 36.11  37.84 33.41  32.16 49.25  54.62

When MADDs are the bulk of the work (no gas, so basic RKF integration) theItanium comes out on top by a hair. The Core 2, or more likely the MSVCcompiler, really struggles in 32 bit mode, though can come close to theItanium in 64-bit mode. SSE only has a minimal (in fact, negative for theCore 2 in 64-bit mode) effect here since most of the code doesn't use it.When the gas interactions are added, the Itanium drops back behind the64-bit Core 2 but still comes in front of the Opteron in 32-bit mode. Itwould have been interesting to see how the Opteron did in 64-bit mode.

So for these two (admittedly rather limited in scope) tests, the Itanium isrelatively competitive on a clock-for-clock bases to a Core 2 in 64-bit modein floating-point dominated tests where the integration kernel hasn't beenvectorized using SSE. Once the workload becomes a bit more branchy andinteger-heavy, it drops behind slightly. In situations that have had a lotof SSE tuning, though, such as the PRNG code, the Core 2 really dominates.

Of course, clock for clock doesn't help all that much when the top-end Core2 is running about twice as fast as the top-end Itanium, and is muchcheaper. And this is the basic problem for the Itanium - the top speed binof the Itanium has only gone up 166 MHz (11%) since June 2003, and core IPChasn't gone up much either. All that's changed from a performance point ofview is that there's more cache, a faster bus, and more cores per socket.This obviously has some benefit to more memory-hungry software, but you haveto wonder how well a Nehalem would do if you gave it a similar amount ofcache.

The main thing I've seen going for the Itanium in HPC is SGI's NUMALink. Acolleague of mine is developing some quantum mechanics simulation stuff, andscaling on the ANU Altix is great. Scaling on a Woodcrest Xeon cluster usingInfiniband ... poor to the point of almost not worth going outside a singlenode. Hopefully, with the Nehalem and Tukwila sharing the same socket wemight be able to get NUMALinked Nehalems, which would really throw acurveball into the HPC interconnect market.



Cheers,

Michael

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] itanium vs. x86-64

Reply via email to