Re: [Beowulf] ECC Memory and Job Failures

2009-04-24 Thread John Hearns
2009/4/24 Jason Clinton : > > > At Advanced Clustering, we use this reporting facility in our Breakin > software--we run BLAS-optimized linpack from a RAM filesystem and watch for > EDAC messages. > Good stuff!Clustering - you're doing it right. (Pardon my lolspeak). ___

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread Pfenniger Daniel
Prentice Bisbal wrote: Gerry Creager wrote: David Mathog wrote: Huw Lynes wrote: http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html Apparently someone ran a large cluster job with both ECC and none-ECC RAM. They consistently got the wrong answer when foregoing ECC.

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread Geoff Jacobs
John Hearns wrote: > 2009/4/24 Prentice Bisbal : >> Last time this issue came up, he included links to several papers on >> this topic published by Boeing. As you go up in the atmosphere, the >> [prevalence|probability|concentration] of cosmic rays goes up >> significantly. Boeing has done a lot of

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Geoff Jacobs
Залетнев Дмитрий wrote: > >> is it possible to have a single multicored machine as a cluster? >> >> -- >> Jonathan Aquilina > > I have a 2-core machine with a lot of memory and GLAN NIC PC, yes? > and two PS3 with 8 cores each with 256 MB of RAM and slow GLAN network, > thanks to PS3's "hyper

Re: [Beowulf] Re: ECC Memory and Job Failures

2009-04-24 Thread David Mathog
John Hearns wrote: > I don't have time to Google at the moment - I'm making obeisances at > the altar of a supercomputer - but wasn;t there a large dose for every > Concorde flight? Did the flight crew have to wear radiation badges, or > am I going dotty in my old age? > http://www.britishairwa

Re: [Beowulf] ECC Memory and Job Failures

2009-04-24 Thread Derek R.
Huw, I've seen similar cases. A not-to-be-named company that I worked at decided to cut corners (and save cash) by purchasing a non-ECC cluster to expand their processing systems. Needless to say, jobs failed or returned incorrect results. All what you need to to do is multiply utlization ( > 90%)

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Залетнев Дмитрий
> is it possible to have a single multicored machine as a cluster? > > -- > Jonathan Aquilina I have a 2-core machine with a lot of memory and GLAN NIC and two PS3 with 8 cores each with 256 MB of RAM and slow GLAN network, thanks to PS3's "hypervisor", AIX-based native PS3 system, on witch

Re: [Beowulf] ECC Memory and Job Failures

2009-04-24 Thread Jason Clinton
On Fri, Apr 24, 2009 at 12:49 AM, John Hearns wrote: > 2009/4/23 Nifty Tom Mitchell : > > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote: > > > > IMO Running on a large cluster without multiple bit detection and a > minimum of one bit > > correction ECC is silly. > > > > Further running

Re: [Beowulf] ECC Memory and Job Failures

2009-04-24 Thread John Hearns
2009/4/24 Joshua Baker-LePain : > On Fri, 24 Apr 2009 at 6:49am, John Hearns wrote > >> Plus couple that with command-line utilities to flash BMC, CMC and >> BIOSes and you've got a winner. > > But can you twiddle BIOS settings from within the OS? I do not think so. One day. Come the revolution, c

Re: [Beowulf] ECC Memory and Job Failures

2009-04-24 Thread Joshua Baker-LePain
On Fri, 24 Apr 2009 at 6:49am, John Hearns wrote Plus couple that with command-line utilities to flash BMC, CMC and BIOSes and you've got a winner. But can you twiddle BIOS settings from within the OS? -- Joshua "don't cross the streams" Baker-LePain QB3 Shared Cluster Sysadmin UCSF _

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Mark Hahn
im impressed with the different views everyone has. i dont know how I still don't quite understand what differences you're referring to. many of you would agree with me a multicore processor lets say a quad is 4 nodes in one. could one say it like that? no. a multicore processor

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread John Hearns
2009/4/24 Prentice Bisbal : > > Last time this issue came up, he included links to several papers on > this topic published by Boeing. As you go up in the atmosphere, the > [prevalence|probability|concentration] of cosmic rays goes up > significantly. Boeing has done a lot of research on this topic

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Glen Beane
On 4/24/09 9:16 AM, "Prentice Bisbal" wrote: Glen Beane wrote: > > > > On 4/24/09 3:03 AM, "Jonathan Aquilina" wrote: > > im impressed with the different views everyone has. i dont know how > many of you would agree with me a multicore processor lets say a > quad is 4 nodes in one

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Galen Arnold
For SMP machines, there's also the numa view of a node, which may make sense if you're willing to tweak the batch system to the cpuset level: [arno...@co-login ~]$ numactl --hardware available: 16 nodes (0-15) node 0 size: 7875 MB node 0 free: 7268 MB node 1 size: 7888 MB node 1 free: 7315 MB nod

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Prentice Bisbal
Glen Beane wrote: > > > > On 4/24/09 3:03 AM, "Jonathan Aquilina" wrote: > > im impressed with the different views everyone has. i dont know how > many of you would agree with me a multicore processor lets say a > quad is 4 nodes in one. could one say it like that? > > I would not

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread Prentice Bisbal
Gerry Creager wrote: > David Mathog wrote: >> Huw Lynes wrote: >> >>> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html >>> >>> >>> Apparently someone ran a large cluster job with both ECC and none-ECC >>> RAM. They consistently got the wrong answer when foregoing ECC. >> >

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread Robert G. Brown
On Fri, 24 Apr 2009, John Hearns wrote: 2009/4/24 Greg Lindahl : On Thu, Apr 23, 2009 at 10:20:02PM -0400, Robert G. Brown wrote: On clusters of ~300 nodes over burnin times of weeks, I was easily able to see the difference between sea level, Boulder, and Albuquerque, with 2000's-era memory.

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Glen Beane
On 4/24/09 8:07 AM, "Jonathan Aquilina" wrote: in a way arent multicore processors taking for example 4 single core machines and merging them into one with 1 for core processor? I think the term "node" is a loaded term in HPC. This is what comes to mind when I hear node, and I'm sure a lot

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Jonathan Aquilina
in a way arent multicore processors taking for example 4 single core machines and merging them into one with 1 for core processor? On 4/24/09, Glen Beane wrote: > > > > On 4/24/09 3:03 AM, "Jonathan Aquilina" wrote: > > im impressed with the different views everyone has. i dont know how many of

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Glen Beane
On 4/24/09 3:03 AM, "Jonathan Aquilina" wrote: im impressed with the different views everyone has. i dont know how many of you would agree with me a multicore processor lets say a quad is 4 nodes in one. could one say it like that? I would not. To me a node is a physical thing. One or mor

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread Robert G. Brown
On Fri, 24 Apr 2009, John Hearns wrote: 2009/4/24 Robert G. Brown : I don't think memory is all that unstable, especially down where I live. In Denver, maybe.  I think you need a lot of RAM, for a long time, to see a lot of radiation induced errors, or a source of high energy particles. I t

Re: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

2009-04-24 Thread John Hearns
2009/4/24 Greg Lindahl : > On Thu, Apr 23, 2009 at 10:20:02PM -0400, Robert G. Brown wrote: > > > On clusters of ~300 nodes over burnin times of weeks, I was easily > able to see the difference between sea level, Boulder, and > Albuquerque, with 2000's-era memory. H. All we need is IEEE-1588 p

Re: [Beowulf] 1 multicore machine cluster

2009-04-24 Thread Jonathan Aquilina
im impressed with the different views everyone has. i dont know how many of you would agree with me a multicore processor lets say a quad is 4 nodes in one. could one say it like that? On Thu, Apr 23, 2009 at 9:06 PM, Nifty Tom Mitchell wrote: > On Tue, Apr 21, 2009 at 09:46:07PM +0200, Jonathan