Re: [Beowulf] itanium vs. x86-64

Vincent Diepeveen Mon, 09 Feb 2009 17:27:33 -0800


On Feb 9, 2009, at 11:45 PM, Mark Hahn wrote:

I have been working on itanium machines for a year now and Iactually foundthe hw pretty elegant and the dev software stack on top of it(compiler,
profiler etc) pretty handy.


And at how

aren't all the same tools available on x86_64? or were youreferring to, eg,
something SGI-specific?
but now with the Tukwila switching to the QuickPath, how do youguys think
Itanium will perform in comparison to Xeon's and Opteron's ?
this change would be interesting if it meant that the next-gennumalinkbox could take nehalems rather than ia64. I can't reallyunderstand why Intel has stuck with ia64 this long - perhaps theeconomy will provide
the fig-leaf necessary to dump it.


Of course there is a few points to adress:

a) SMT works a lot better at in order cores than HT at out of order.

Usually out of order, prime number FFT (DWT actually) typesoftware seems to profit about 5% there.Branchy and memory dependant codes seem optimistically be ableto get around 10-20% scaling improvement.

The big difference is that at in order cores, see power6, youcan basically hide the branch misprediction penalties,

      which is not possible at x64

This was of course one of the many promises of in order cores.

b) intel always ran behind in process technology with itanium andnever managed to clock it very high. IBM didn't make this mistakewith power6,they clocked it to 5Ghz. Intel always ran behind in clockspeed withitanium. Now you can take that for granted. You can also say: "hehwhy am i paying $7500 for a 1.5Ghz itanium2 processor (at its releaseand for quite some period of time it was priced like that) and whyhas it been clocked that much lower, and is it process technologiesbehind on the rest?"

You can say things like: "it needs more verification than cheapo x86cpu's". However that excuse is only valid for 6 months and no longerthan that.After 6 months it HAS to have the same process technology like x86cpu's. If x86 Xeon/Opteron type cpu's get produced in 65 nm, youcan't get away with 90nm, let alone 130nm parts. Tukwila is 65nm orso? All processors that should still release in 2009/2010 should beat least 45nm, as that's thestandard now. There is no way you can compete with a 4 core 65nmprocessor with a 45 nm x86 hpc cpu that has 8 cores.Now one would have a good excuse in case you go speak about sharedmemory machines, but realize latency plays a crucial role there.

SGI altix3000 series is basically 280ns (random 8 byte reads from bigbuffer) shared memory until 4 sockets.


So you can compare with 4 socket machines very well.

Above 4 socket usage, SGI has shared memory, but latency dropsinstantly to 700 ns, up to 5000-7000 nanoseconds for 62 sockets.

700 ns is also the latency the 'glued together' 8 socket machines hadin those days having 8 P4 Xeon MP's.Reality is simply that majority of parallel software doesn't workwell with such latencies/bandwidth.

c) 4 core tukwila 2.4Ghz or so? It is of course factor 4 slower or sothan the upcoming 8 core Xeon MP 3.4Ghz or so (just guessing) forinteger codes

d) itanium2 core was just simply a bad performer thanks to a verysilly design. No instructions on the L2 cache, yet equipping it withan ultra tiny L1 instruction cache of 32KB. This where instructionsize is huge and it uses bundles of 2 x 3 instructions, so 6 intotal. Additionally it is those instructionbundles that had to replace branch mispredicts in a clever manner. Sothat means you really need to execute a LOT of code and thereforehave a bigneed for a HUGE L1i cache. Opteron and core2/i7 total own itanium2there at many software programs.

e) itanium only could get bought from companies that reallyovercharged for the hardware. Now we can forget about the power suchparts eat, as power is really cheap for big companies that eat a lot.Up to factor 20 to what you pay at home (and if your governmentbuilding is paying a 'normal' price, then now it is the time to startnegotiating there).

f) itanium2 total focussed upon floating point, yet that is about theleast interesting thing for such hardware to do; there always havebeen cheaper floating point solutions than itanium2. Let's forgetabout the disaster called itanium-1. A wrong focus. Integer speedsimply matters for such expensive cpu's. This is why IBM could sellpower6.

g) itanium2 had just 2 integer units, this where total bundle is 6.Power6 has same mistake if you ask me, yet at least can make up someof that mistake by a 5Ghz frequency. The x86 cpu's total finish bothprocessors there in a skilled manner for many software programs.

h) i ran a few things on brandnew itanium2 1.3Ghz at the time and wasamazed that opteron 2.2 Ghz (already existing for months) was factor3 faster for some simplistic programs like random number generators.

It appeared simply that the instruction set of itanium2 was missing alot of simplistic instructions. Rotate for example. Now excuse me,cryptographic spoken i'm no big hero, as frankly i know nothing aboutcryptography. Yet some instructions get used EVERYWHERE. Now we'llexcuse it for not having a division instruction.

Now what most scientists do not realize is that for FFT, somethingmost of them use, even if they don't know theory behind it (who am ito claim so),that in floating point this gives nonstop rounding errors. In shortinteger FFT's are really important to prove something correct. Infact those would run faster if hardware would offer more support. Sojust make 2 integer multiplication units and soon all FFT's getrewritten to integer and run faster and with LESS errors. In fact noerror at all. So no chance for an error backtracking at all. Thatworst case that could happen every time in floating point is not there

simply in integers.

Yet itanium2 does not have support for 64 bits integers very much. Itdoesn't have an instruction to multiply a * b and get 128 bitsprecise answer back.

To do that you have to simulate it with floating point instructions.

That sucks, let's use polite wording here.

Of course when buying a machine, no one *gets the idea* to test this.

A more general comment on the above:

You can DEFINITELY blame most HPC guys to focus too much uponfloating point. I get impression they don't realize that in a doubleprecision floating point you can pack LESS bits than in a 64 bitsintegers for FFT and that there is no error in integers.

These HPC guys are really good in creating their own problem. Itaniumis a clear demonstration of how HPC guys mess up a lot

and are nerds easy to manipulate by some clever marketing guys.

So i guess intel is doing what you can expect from a hardwarecompany. For years it tries to get away with the cheapest possibledesigned cpu ever for HPC, and grabs money out of the market with it,until the last professor who still believes the years 90 and start21th century propaganda, has retiredfrom the hardware commission and gets replaced by a ruthlesscommissioner who gives the contract to the company in an open bidwithout stupidnerd demands (that already give in advance the contract job to just 1company which then can ask any price of course).

Everything relevant already runs for years at x86 cpu's by now, evenISPs and telecommunication do by now.That will only increase. With respect to HPC Itanium never was fasterthan x86 processors that is the problem.In floating point of course x86 you could get for same price of 1itanium quad socket node a sporthall filled with x64's,and in integers the x86 cpu's were just a lot faster. Intel hasfooled everyone there with good marketing.

i) See what is happening now. There is quite some rumour abouttukwila, in meantime of course the 8 core Xeon MP is going to kick buttand get good sales, AND THEREFORE OF COURSE PRIORITY OF INTELEVERYWHERE.

This is something you can explain to a nerd forever, and he stillwill not understand it. Nor will the majority of this list.


If intel has a bunch of cpu's:
    i7's / Nehalems versus Itanium versus Larrabee

Guess what has the LOWEST priority then?

It is obvious that itanium never got priority internal at intel. Itis questionable whether something like a larrabee version for HPCis a good idea to give serious attention knowing that the Xeonplatform in the end will get a bigger priority anyway. There is somuch more money

at stake in the Xeon platform.

Want support for your itanium cpu? Be glad if the 12 year old nephewof the Bangalore (India) stationned engineer helps you.

   "Are you se-ure about your question is proper, mista"
X questions and at every question:
   "Yes, Yes ,Yes, Yes, Yes, Yes, Yes".
But never the answer you want to have.

That is very bad for HPC guys used to service and with a bigtechnological demand.

j) for years HPC used 64 bits processors and x86 was 32 bits. Thathas changed however. Though 32 bits still runs a lot faster andshould run faster,the tiny processors are 64 bits now also. So that already loses a lotof the market to the always cheaper x64 processors

k) everything is massive parallel nowadays, so for a lot of differentsoftware it doesn't matter whether you use 1 big fast processor or 10cheap low power processors which together are just as fast; thisunder the condition your network solves somehow the bandwidth problemand provided that it is cheaper to produce those 10 tiny processorsthan 1 big fast processor. In short there is always the option now touse tiny processors for a big cluster for really a lot of software.As long as that this software scales somehow.

l) The biggest problem by far of itanium is, will be and was that aF1 car driver you don't let him toy with just 2 bicycles,

he needs a F1 car to drive in.

If it cannot compete at price with a x86 processor any special HPCprocessor has to face the hard condition that it has to be a lot faster.


Vincent

(why am I down on ia64? mainly the sense of unfulfilled promise:the ISA wassupposed to provide some real advantage, and afaikt never has. theVLIW-ieISA was intended to avoid clock scaling problems created by CISCdecode andOOO, no? but the ia64 seems to have only distinguished itself byrelativelylarge caches and offering cache-coherency hooks to SGI. have otherpeople had the experience ia64 doing OK on code with regular/unrollable/prefetchable data patterns, but poorly otherwise?)
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] itanium vs. x86-64

Reply via email to