On Feb 9, 2009, at 11:45 PM, Mark Hahn wrote:

I have been working on itanium machines for a year now and I actually found the hw pretty elegant and the dev software stack on top of it (compiler,
profiler etc) pretty handy.


And at how

aren't all the same tools available on x86_64? or were you referring to, eg,
something SGI-specific?

but now with the Tukwila switching to the QuickPath, how do you guys think
Itanium will perform in comparison to Xeon's and Opteron's ?

this change would be interesting if it meant that the next-gen numalink box could take nehalems rather than ia64. I can't really understand why Intel has stuck with ia64 this long - perhaps the economy will provide
the fig-leaf necessary to dump it.


Of course there is a few points to adress:

a) SMT works a lot better at in order cores than HT at out of order.

Usually out of order, prime number FFT (DWT actually) type software seems to profit about 5% there. Branchy and memory dependant codes seem optimistically be able to get around 10-20% scaling improvement.

The big difference is that at in order cores, see power6, you can basically hide the branch misprediction penalties,
      which is not possible at x64

This was of course one of the many promises of in order cores.

b) intel always ran behind in process technology with itanium and never managed to clock it very high. IBM didn't make this mistake with power6, they clocked it to 5Ghz. Intel always ran behind in clockspeed with itanium. Now you can take that for granted. You can also say: "heh why am i paying $7500 for a 1.5Ghz itanium2 processor (at its release and for quite some period of time it was priced like that) and why has it been clocked that much lower, and is it process technologies behind on the rest?"

You can say things like: "it needs more verification than cheapo x86 cpu's". However that excuse is only valid for 6 months and no longer than that. After 6 months it HAS to have the same process technology like x86 cpu's. If x86 Xeon/Opteron type cpu's get produced in 65 nm, you can't get away with 90nm, let alone 130nm parts. Tukwila is 65nm or so? All processors that should still release in 2009/2010 should be at least 45nm, as that's the standard now. There is no way you can compete with a 4 core 65nm processor with a 45 nm x86 hpc cpu that has 8 cores. Now one would have a good excuse in case you go speak about shared memory machines, but realize latency plays a crucial role there.

SGI altix3000 series is basically 280ns (random 8 byte reads from big buffer) shared memory until 4 sockets.

So you can compare with 4 socket machines very well.
Above 4 socket usage, SGI has shared memory, but latency drops instantly to 700 ns, up to 5000-7000 nanoseconds for 62 sockets.

700 ns is also the latency the 'glued together' 8 socket machines had in those days having 8 P4 Xeon MP's. Reality is simply that majority of parallel software doesn't work well with such latencies/bandwidth.

c) 4 core tukwila 2.4Ghz or so? It is of course factor 4 slower or so than the upcoming 8 core Xeon MP 3.4Ghz or so (just guessing) for integer codes

d) itanium2 core was just simply a bad performer thanks to a very silly design. No instructions on the L2 cache, yet equipping it with an ultra tiny L1 instruction cache of 32KB. This where instruction size is huge and it uses bundles of 2 x 3 instructions, so 6 in total. Additionally it is those instruction bundles that had to replace branch mispredicts in a clever manner. So that means you really need to execute a LOT of code and therefore have a big need for a HUGE L1i cache. Opteron and core2/i7 total own itanium2 there at many software programs.

e) itanium only could get bought from companies that really overcharged for the hardware. Now we can forget about the power such parts eat, as power is really cheap for big companies that eat a lot. Up to factor 20 to what you pay at home (and if your government building is paying a 'normal' price, then now it is the time to start negotiating there).

f) itanium2 total focussed upon floating point, yet that is about the least interesting thing for such hardware to do; there always have been cheaper floating point solutions than itanium2. Let's forget about the disaster called itanium-1. A wrong focus. Integer speed simply matters for such expensive cpu's. This is why IBM could sell power6.

g) itanium2 had just 2 integer units, this where total bundle is 6. Power6 has same mistake if you ask me, yet at least can make up some of that mistake by a 5Ghz frequency. The x86 cpu's total finish both processors there in a skilled manner for many software programs.

h) i ran a few things on brandnew itanium2 1.3Ghz at the time and was amazed that opteron 2.2 Ghz (already existing for months) was factor 3 faster for some simplistic programs like random number generators.

It appeared simply that the instruction set of itanium2 was missing a lot of simplistic instructions. Rotate for example. Now excuse me, cryptographic spoken i'm no big hero, as frankly i know nothing about cryptography. Yet some instructions get used EVERYWHERE. Now we'll excuse it for not having a division instruction.

Now what most scientists do not realize is that for FFT, something most of them use, even if they don't know theory behind it (who am i to claim so), that in floating point this gives nonstop rounding errors. In short integer FFT's are really important to prove something correct. In fact those would run faster if hardware would offer more support. So just make 2 integer multiplication units and soon all FFT's get rewritten to integer and run faster and with LESS errors. In fact no error at all. So no chance for an error backtracking at all. That worst case that could happen every time in floating point is not there
simply in integers.

Yet itanium2 does not have support for 64 bits integers very much. It doesn't have an instruction to multiply a * b and get 128 bits precise answer back.
To do that you have to simulate it with floating point instructions.

That sucks, let's use polite wording here.

Of course when buying a machine, no one *gets the idea* to test this.

A more general comment on the above:

You can DEFINITELY blame most HPC guys to focus too much upon floating point. I get impression they don't realize that in a double precision floating point you can pack LESS bits than in a 64 bits integers for FFT and that there is no error in integers.

These HPC guys are really good in creating their own problem. Itanium is a clear demonstration of how HPC guys mess up a lot
and are nerds easy to manipulate by some clever marketing guys.

So i guess intel is doing what you can expect from a hardware company. For years it tries to get away with the cheapest possible designed cpu ever for HPC, and grabs money out of the market with it, until the last professor who still believes the years 90 and start 21th century propaganda, has retired from the hardware commission and gets replaced by a ruthless commissioner who gives the contract to the company in an open bid without stupid nerd demands (that already give in advance the contract job to just 1 company which then can ask any price of course).

Everything relevant already runs for years at x86 cpu's by now, even ISPs and telecommunication do by now. That will only increase. With respect to HPC Itanium never was faster than x86 processors that is the problem. In floating point of course x86 you could get for same price of 1 itanium quad socket node a sporthall filled with x64's, and in integers the x86 cpu's were just a lot faster. Intel has fooled everyone there with good marketing.

i) See what is happening now. There is quite some rumour about tukwila, in meantime of course the 8 core Xeon MP is going to kick butt and get good sales, AND THEREFORE OF COURSE PRIORITY OF INTEL EVERYWHERE.

This is something you can explain to a nerd forever, and he still will not understand it. Nor will the majority of this list.

If intel has a bunch of cpu's:
    i7's / Nehalems versus Itanium versus Larrabee

Guess what has the LOWEST priority then?

It is obvious that itanium never got priority internal at intel. It is questionable whether something like a larrabee version for HPC is a good idea to give serious attention knowing that the Xeon platform in the end will get a bigger priority anyway. There is so much more money
at stake in the Xeon platform.

Want support for your itanium cpu? Be glad if the 12 year old nephew of the Bangalore (India) stationned engineer helps you.
   "Are you se-ure about your question is proper, mista"
X questions and at every question:
   "Yes, Yes ,Yes, Yes, Yes, Yes, Yes".
But never the answer you want to have.

That is very bad for HPC guys used to service and with a big technological demand.

j) for years HPC used 64 bits processors and x86 was 32 bits. That has changed however. Though 32 bits still runs a lot faster and should run faster, the tiny processors are 64 bits now also. So that already loses a lot of the market to the always cheaper x64 processors

k) everything is massive parallel nowadays, so for a lot of different software it doesn't matter whether you use 1 big fast processor or 10 cheap low power processors which together are just as fast; this under the condition your network solves somehow the bandwidth problem and provided that it is cheaper to produce those 10 tiny processors than 1 big fast processor. In short there is always the option now to use tiny processors for a big cluster for really a lot of software. As long as that this software scales somehow.

l) The biggest problem by far of itanium is, will be and was that a F1 car driver you don't let him toy with just 2 bicycles,
he needs a F1 car to drive in.

If it cannot compete at price with a x86 processor any special HPC processor has to face the hard condition that it has to be a lot faster.

Vincent

(why am I down on ia64? mainly the sense of unfulfilled promise: the ISA was supposed to provide some real advantage, and afaikt never has. the VLIW-ie ISA was intended to avoid clock scaling problems created by CISC decode and OOO, no? but the ia64 seems to have only distinguished itself by relatively large caches and offering cache-coherency hooks to SGI. have other people had the experience ia64 doing OK on code with regular/ unrollable/prefetchable data patterns, but poorly otherwise?)

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to