Re: [Beowulf] X5500

Vincent Diepeveen Fri, 03 Apr 2009 14:16:58 -0700

Hi Ellis,

First of all most big clusters are really expensive forms of computing.
Hardware outdates really rapidly.


Most companies sell products, they hardly do research.

Research is exclusively government domain, at least in most Europeanstates.If a company is carrying out research, it usually gets paid to alarge degree by subsidy.

Recently i was for example at treasury department organisation:www.senternovum.nl

They've got big budgets (considering how tiny the nation is overhere) to subsidize

new initiatives. That for largest part goes to BIG companies.

So majority of clusters are real generic clusters where obviouslymemory is extremely

important.

Now i'm sure you like to hear how a few companies that DO need bigcrunching power

are doing it, but i'm not sure they all like to see it posted here.

Sometimes the realization that you can put into action bigcalculation power for a specificarea, where before nearly no crunching power was used, is already toomuch information for

competitors.

The other part of the clusters is of course big military terrain.They search for the holy grail.I'm of course in the area of public holy grail search in games(computer chess), though differencebetween the mathematical holy grail searchers and game tree search isnot that much from absolute

viewpoint seen.

When searching after a holy grail, you can tolerate more errors ofcourse, as it is about finding

that lucky shot, or approaching with all kind of errors.

The reason why i can tolerate errors in RAM a very little bit morethan others

is because i already store a CRC in the hashtables of Diep.

You might call that paranoia, but that CRC checking REALLY is important.

In shit case the CRC error might also come from different coreswriting in the RAM,and of course it is TOO EXPENSIVE to have a lock. You can save thatout by storing CRCwith a simple XOR, that's way faster and really gets rid of a singlebitflip easily.

When i'd have 2 bitflips at 32 bits interval at the same time, now*that* would be nasty,

as XOR doesn't detect that.

So basically what runs usually within L1 and L2 i definitely can'ttolerate errors, it wouldcrash the applicatoin in many cases. Within the ram that storeshashtables, you can to some

extend recover from errors.

For the holy grail searchers, there is 2 different areas of course.

One area is the real number crunching types where everything isembarrassingly parallel.There is usually not too big RAM requirements, so everything runsexclusively within L1/L2.These guys really are power hungry and most of them won't have whatyou even would beable to call a cluster. It's just some sort of specialized monstermachine with special programmed

or special designed hardware.

If you'd calculate effectively the number of gflops per dollar ofwhat these guys get with nowadays gpu's,that's of course really cheap compared to the classical definition ofa cluster.


Yet again all these classical clusters have an ECC requirement simply.

Know 1 researcher who is redoing an application and also gets itgranted to do a research a second timein order to check whether some calculation mistake of the hardwaremessed up?

Not at all, there is good examples of some round off error producedby old clusters, that gave some differenceto existing quantum mechanica, to explain that by adding a new theoryto it.

Of course 30 years later refuted by someone who FINALLY did do arecalculation in a correct manner and didn'tget that round off error and concluded that the result he actuallysaw was for a change a CORRECT result andthat the error the others had in the quantum mechanica theory wascaused by amateurism of a whole generation

of researchers, most of them seen by society as really clever.

The best researcher you can easily fool with hardware, simply becausethat's not necessarily his expertise or his

expectation that it makes a mistake.

If you're gonna calculate at hundreds of cores, you sure get somebitflips in RAM.


ECC is a requirement then.

They don't have 15 years of time like i had for my chess software, tobuild in a CRC check myself,

as of course majority of 'users machines' don't have ECC memory.

Vincent

On Apr 3, 2009, at 6:05 AM, Ellis Wilson wrote:


Vincent Diepeveen wrote:

Bill,

the ONLY price that matters is that of ECC ram when posting in acluster

group.


If there is 1 commission that EVER puts a signature underneath a
production cluster
without ECC ram using x86 processors (gpu's is yet another new thing
that is interesting
to discuss), then please inform me, as they qualify for a full and
thorough investigation
by a range of shrinks and psychologists, on how group behaviour could
lead to such a
total unqualified and naive and total wrong decision; resulting of
course in the direct

firing of the entire commission and decommissioning them to northpart of

Norway where they can count the number of iceblocks they see afloat,
this for the rest of
their life until retirement age,.

So in short i can completely ignore your posting.

ECC is a requirement, not a luxury.


Though entertainingly put, it would be an error to say "ECC is a
requirement" for everyone in a "cluster group".  I can think of more
than just a few purposes for clusters that truly do not require the
accuracy guaranteed by ECC RAM.

Actually as far as errors of the grossest nature go, the only trulybad

one to make on this list is to take something that is true for one

sector of clustering and apply it to the whole. Now thats justdumping

oil on the torches.

Ellis


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] X5500

Reply via email to