Re: [Beowulf] Node Drop-Off

2006-12-18 Thread Eric W. Biederman
"Vincent Diepeveen" <[EMAIL PROTECTED]> writes: > I wouldn't rule out that linux kernel simply has bugs there. The testing of > those kernels is total amateuristic. No the testing it is totally open. Just because you can't see how a process works doesn't make it better. Eric ___

Re: [Beowulf] Node Drop-Off

2006-12-05 Thread Joshua Baker-LePain
On Tue, 5 Dec 2006 at 7:03am, Joshua Baker-LePain wrote Sure, there are many less than good Tier 1s out there, so caveat ^^ *sigh* That should read 'Tier 2s', of course. That'll teach me to post before coffee. beowulfer. But you can some who, IMH

Re: [Beowulf] Node Drop-Off

2006-12-05 Thread Joshua Baker-LePain
On Mon, 4 Dec 2006 at 7:01pm, Robert G. Brown wrote This is really the basic difference between tier 1 and tier 2. You can save short term money with the latter, but have to do things like just plain throw out hardware -- after sweating over it for a long time, nagging your tier 2 vendor, getti

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Mark Hahn
were doing the running. Any toplevel AMD exec would do anything to crank up quality control rather than have to endure the sight of an old AMD and our system vendor (HP) did a good job replacing >1500 opteron 252's in my cluster this year. this was the result of a "test escape", which did ev

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Robert G. Brown
On Mon, 4 Dec 2006, Jim Lux wrote: At 10:59 AM 12/4/2006, Robert G. Brown wrote: On Mon, 4 Dec 2006, Jim Lux wrote: Processors are a high dollar item for something quite compact, they're sort of commodity (at least as far as the end user is concerned), so they're ripe for all the fiddles tha

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Jim Lux
At 10:59 AM 12/4/2006, Robert G. Brown wrote: On Mon, 4 Dec 2006, Jim Lux wrote: Processors are a high dollar item for something quite compact, they're sort of commodity (at least as far as the end user is concerned), so they're ripe for all the fiddles that have been used on such items for m

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Robert G. Brown
On Mon, 4 Dec 2006, Jim Lux wrote: Processors are a high dollar item for something quite compact, they're sort of commodity (at least as far as the end user is concerned), so they're ripe for all the fiddles that have been used on such items for millenia. Hey, didn't Archimedes get famous for

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Jim Lux
At 07:15 AM 12/4/2006, Tim Moore wrote: Update to node drop-off: The AMD engineer with whom I talked was amazed that such CPUs made it beyond quality control. He also suggested that the vendor may have inadvertently mixed returned (previously fetermined to be flawed processors) with the new

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Vincent Diepeveen
essage - From: "Tim Moore" <[EMAIL PROTECTED]> To: Sent: Monday, December 04, 2006 4:15 PM Subject: Re: [Beowulf] Node Drop-Off Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I had a cluster run for 3 years without failure and

RE: [Beowulf] Node Drop-Off

2006-12-04 Thread Tony Ladd
Tony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tim Moore Sent: Monday, December 04, 2006 10:16 AM To: beowulf@beowulf.org Subject: Re: [Beowulf] Node Drop-Off Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I

Re: [Beowulf] Node Drop-Off

2006-12-04 Thread Tim Moore
Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I had a cluster run for 3 years without failure and I upgraded the Opteron 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was at it, upgraded the OS and security. Some readers prov

Re: [Beowulf] Node Drop-Off

2006-11-13 Thread Gerald Davies
On 11/12/06, Tim Moore <[EMAIL PROTECTED]> wrote: Hello All - I have a compute node that has started dropping off. When I say drop off, I mean the node (while running a job) will lose all connectivity and the machine does not respond. I have viewed the logs and can find no reason for the node

Re: [Beowulf] Node Drop-Off

2006-11-13 Thread Chris Samuel
On Sunday 12 November 2006 16:13, Tim Moore wrote: > Has anyone ever seen such behavior? Others have mentioned about attaching consoles, etc, but it's also worth trawling through any logs in /var/log to see if anything is showing up there too. Check dmesg whilst the node is under load, if you'

Re: [Beowulf] Node Drop-Off

2006-11-13 Thread John Hearns
Mark Hahn wrote: we (and the vendor) regard this as grounds for repair (usually the power supply). I backup what Mark says. a) attach a console to the machine, either a serial line or a monitor/keyboard b) run memtest on it, followed by CPUburn or some other compute-intensive task for a da

Re: [Beowulf] Node Drop-Off

2006-11-12 Thread Mark Hahn
I have a compute node that has started dropping off. When I say drop off, I mean the node (while running a job) will lose all connectivity and the machine does not respond. I have viewed the logs and can find no reason for the node to cease functioning. if you connect a console to such a nod

[Beowulf] Node Drop-Off

2006-11-12 Thread Tim Moore
Hello All - I have a compute node that has started dropping off. When I say drop off, I mean the node (while running a job) will lose all connectivity and the machine does not respond. I have viewed the logs and can find no reason for the node to cease functioning. Let me state that this b