Hi Ellis,

First of all most big clusters are really expensive forms of computing.
Hardware outdates really rapidly.

Most companies sell products, they hardly do research.

Research is exclusively government domain, at least in most European states. If a company is carrying out research, it usually gets paid to a large degree by subsidy.

Recently i was for example at treasury department organisation: www.senternovum.nl

They've got big budgets (considering how tiny the nation is over here) to subsidize
new initiatives. That for largest part goes to BIG companies.

So majority of clusters are real generic clusters where obviously memory is extremely
important.

Now i'm sure you like to hear how a few companies that DO need big crunching power
are doing it, but i'm not sure they all like to see it posted here.

Sometimes the realization that you can put into action big calculation power for a specific area, where before nearly no crunching power was used, is already too much information for
competitors.

The other part of the clusters is of course big military terrain. They search for the holy grail. I'm of course in the area of public holy grail search in games (computer chess), though difference between the mathematical holy grail searchers and game tree search is not that much from absolute
viewpoint seen.

When searching after a holy grail, you can tolerate more errors of course, as it is about finding
that lucky shot, or approaching with all kind of errors.

The reason why i can tolerate errors in RAM a very little bit more than others
is because i already store a CRC in the hashtables of Diep.

You might call that paranoia, but that CRC checking REALLY is important.

In shit case the CRC error might also come from different cores writing in the RAM, and of course it is TOO EXPENSIVE to have a lock. You can save that out by storing CRC with a simple XOR, that's way faster and really gets rid of a single bitflip easily.

When i'd have 2 bitflips at 32 bits interval at the same time, now *that* would be nasty,
as XOR doesn't detect that.

So basically what runs usually within L1 and L2 i definitely can't tolerate errors, it would crash the applicatoin in many cases. Within the ram that stores hashtables, you can to some
extend recover from errors.

For the holy grail searchers, there is 2 different areas of course.

One area is the real number crunching types where everything is embarrassingly parallel. There is usually not too big RAM requirements, so everything runs exclusively within L1/L2. These guys really are power hungry and most of them won't have what you even would be able to call a cluster. It's just some sort of specialized monster machine with special programmed
or special designed hardware.

If you'd calculate effectively the number of gflops per dollar of what these guys get with nowadays gpu's, that's of course really cheap compared to the classical definition of a cluster.

Yet again all these classical clusters have an ECC requirement simply.

Know 1 researcher who is redoing an application and also gets it granted to do a research a second time in order to check whether some calculation mistake of the hardware messed up?

Not at all, there is good examples of some round off error produced by old clusters, that gave some difference to existing quantum mechanica, to explain that by adding a new theory to it.

Of course 30 years later refuted by someone who FINALLY did do a recalculation in a correct manner and didn't get that round off error and concluded that the result he actually saw was for a change a CORRECT result and that the error the others had in the quantum mechanica theory was caused by amateurism of a whole generation
of researchers, most of them seen by society as really clever.

The best researcher you can easily fool with hardware, simply because that's not necessarily his expertise or his
expectation that it makes a mistake.

If you're gonna calculate at hundreds of cores, you sure get some bitflips in RAM.

ECC is a requirement then.

They don't have 15 years of time like i had for my chess software, to build in a CRC check myself,
as of course majority of 'users machines' don't have ECC memory.

Vincent

On Apr 3, 2009, at 6:05 AM, Ellis Wilson wrote:


Vincent Diepeveen wrote:
Bill,

the ONLY price that matters is that of ECC ram when posting in a cluster
group.


If there is 1 commission that EVER puts a signature underneath a
production cluster
without ECC ram using x86 processors (gpu's is yet another new thing
that is interesting
to discuss), then please inform me, as they qualify for a full and
thorough investigation
by a range of shrinks and psychologists, on how group behaviour could
lead to such a
total unqualified and naive and total wrong decision; resulting of
course in the direct
firing of the entire commission and decommissioning them to north part of
Norway where they can count the number of iceblocks they see afloat,
this for the rest of
their life until retirement age,.

So in short i can completely ignore your posting.

ECC is a requirement, not a luxury.

Though entertainingly put, it would be an error to say "ECC is a
requirement" for everyone in a "cluster group".  I can think of more
than just a few purposes for clusters that truly do not require the
accuracy guaranteed by ECC RAM.

Actually as far as errors of the grossest nature go, the only truly bad
one to make on this list is to take something that is true for one
sector of clustering and apply it to the whole. Now thats just dumping
oil on the torches.

Ellis









_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to