Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Mark Hahn wrote: > there's a semi-recent kernel feature which allows the kernel to avoid > user-space by putting console traffic onto the net directly > see Documentation/networking/netconsole.txt Now that looks very interesting. Thanks for the pointer! Cheers Carsten _

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Robert G. Brown wrote: > > "putting a cheap monitor on a suspect or crashed node" > One monitor to > 1300 1U server is not practical :) > Or even after a crash. If the primary graphics card is being used as a > console, the frame buffer will probably retain the last kernel oops > written to

Re: [Beowulf] Re: Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi all Lawrence Stewart wrote: > [...] > A month or two later, the department calls in to inquire "Where's the > numbers > report?" After some confusion back and forth, it seems that the department > had been dutifully filing the abend dumps in a row of file cabinets, and > wanted > to know why th

Re: [Beowulf] Re: Monitoring crashing machines

2008-09-09 Thread Lawrence Stewart
On Sep 9, 2008, at 7:41 PM, Robert G. Brown wrote: On Tue, 9 Sep 2008, David Mathog wrote: word. In the old days some of those crash events spewed garbage to the printer, and that resulted in a ream of nonsense on the floor, and more often than not, the paper mashed into an accordian behi

Re: [Beowulf] Re: Monitoring crashing machines

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, David Mathog wrote: word. In the old days some of those crash events spewed garbage to the printer, and that resulted in a ream of nonsense on the floor, and more often than not, the paper mashed into an accordian behind a pinfeed jam. Nobody said it was EASY back then, ri

[Beowulf] Re: Monitoring crashing machines

2008-09-09 Thread David Mathog
"Robert G. Brown" <[EMAIL PROTECTED]> wrote: > > One last method (from back in the dark ages): > > "putting a tty-output printer on as a console printer" Better yet, set up the serial port as a console, then attach another machine via a serial line, and just have the 2nd machine log everythin

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Mike Davis
could be we don't know how to ask; I'm not aware of HP actually offering such a kit. or how much we'd be willing to pay. it is an interesting question: not just how much does downtime cost you, but what are the kinds of failures you see and expect? our clusters have been remarkably robust,

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Greg Lindahl
On Tue, Sep 09, 2008 at 06:41:01PM -0400, Mark Hahn wrote: >> You don't have your own spares kit? For big clusters like yours, it >> doesn't cost much. > > could be we don't know how to ask; I'm not aware of HP actually offering > such a kit. or how much we'd be willing to pay. Well, I always b

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Mark Hahn
I _do_ wish it was a bit more common to have onsite spares. not sure why vendors (HP at least) don't like to do this. maybe just that it might get kicked around or otherwise abused... You don't have your own spares kit? For big clusters like yours, it doesn't cost much. could be we don't kno

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, Mark Hahn wrote: for small sites or individuals, it make a lot of sense (for the vendor) to try to filter out some of the randomness of support calls before committing a person. of course, a good CRM system would help this - perhaps that's why RGB gets satisfaction from Dell

[Beowulf] SLAs was Re: GPU boards and cluster servers.

2008-09-09 Thread Lux, James P
> Again, I'm not picking on Dell specifically. I've seen this behavior > with other large vendors. My point is that "on-site support" usually > isn't always, so don't believe the hype. I think highly of HP service and HP hardware in general. we always spec onsite/NBD support. at first, we spe

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Greg Lindahl
On Tue, Sep 09, 2008 at 05:46:50PM -0400, Mark Hahn wrote: > I _do_ wish it was a bit more common to have onsite spares. not sure > why vendors (HP at least) don't like to do this. maybe just that it > might > get kicked around or otherwise abused... You don't have your own spares kit? For b

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Mark Hahn
Again, I'm not picking on Dell specifically. I've seen this behavior with other large vendors. My point is that "on-site support" usually isn't always, so don't believe the hype. I think highly of HP service and HP hardware in general. we always spec onsite/NBD support. at first, we spent a l

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Prentice Bisbal
Robert G. Brown wrote: > On Mon, 8 Sep 2008, Greg Lindahl wrote: > >> On Mon, Sep 08, 2008 at 02:58:36PM -0400, Prentice Bisbal wrote: >> >>> I think these trends have more to do with the cheap cost of Dell >>> Hardware and Dell's sales force and marketing to upper management than >>> they do w

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, Carsten Aulbert wrote: We did get a few messages, albeit not from the kernel when an error happened. I'll have another look today, maybe I did something wrong. If your kernel is out and out crashing, you might not get anything at all. In that case, let me add: "putting a

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Greg Lindahl
On Tue, Sep 09, 2008 at 02:12:02PM -0400, Robert G. Brown wrote: > If I buy e.g. a Dell > laptop (as I have for six or seven years now) I pay a single, easily > budgeted price and if it breaks (as it has six or seven times now over > the years -- I USE my laptop, run hard and put up wet), a nice m

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, Carsten Aulbert wrote: My question now, is there a cute little way to gather all the console outputs of > 1000 nodes? The nodes don't have physical serial cables attached to them - nor do we want to use many concentrators to achieve this - but the off-the-shelf Supermicro box

Re: [Beowulf] Re: GPU boards and cluster servers.

2008-09-09 Thread Robert G. Brown
On Mon, 8 Sep 2008, Greg Lindahl wrote: On Mon, Sep 08, 2008 at 02:58:36PM -0400, Prentice Bisbal wrote: I think these trends have more to do with the cheap cost of Dell Hardware and Dell's sales force and marketing to upper management than they do with any technical advantages Dell has over

Re: [Beowulf] Re: Re: GPU boards and cluster servers.

2008-09-09 Thread Prentice Bisbal
Jeff Johnson wrote: > >> A Xeon is a Xeon is a Xeon. >> > This is a very true statement. > > Unfortunately for many, the commonality ends where the processor and > socket meet. There is a great deal of deviation in motherboard designs. > Some are much better than others and it is not always ba

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Mark Hahn
We did get a few messages, albeit not from the kernel when an error happened. I'll have another look today, maybe I did something wrong. there's a semi-recent kernel feature which allows the kernel to avoid user-space by putting console traffic onto the net directly see Documentation/networkin

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Loic Tortay
Carsten Aulbert wrote: [server console management for many servers with conserver] > We use conserver to get serial console access to almost all our machines. Below is the forwarded answer to your messages from my coworker who's in charge of this. The tools he created for interfacing IPMI and con

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Lawrence Stewart
Carsten Aulbert wrote: > Hi all, > > I would tend to guess this problem is fairly common and many solutions > are already in place, so I would like to enquirer about your solutions > to the problem: > > In our large cluster we have certain nodes going down with I/O hard disk > errors. We have some

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Perry E. Metzger
Carsten Aulbert <[EMAIL PROTECTED]> writes: > For the time being we are experimenting with using "script" in many > "screen" environment which should be able to monitor ipmitool's SoL > output, but somehow that strikes me as inefficient as well. First, you should probably never want script+screen

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi, Geoff Galitz wrote: > You can also configure any standard (distribution shipped) syslog to log > remotely to your head node or even a seperate logging master. Anything that > gets reported to the syslog facility can be reported/archived in this > manner, you just need to dig into the document

RE: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Geoff Galitz
>Does this capture (almost) everything what happens to a machine? w have >not yet looked into syslog-ng but a looks into your config files would >be very nice. You can also configure any standard (distribution shipped) syslog to log remotely to your head node or even a seperate logging master.

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi thanks for the reply Reuti wrote: > I setup syslog-ng on the nodes to log to the headnode. There each node > will have a distinct file e.g. "/var/log/nodes/node42.messages". If you > are interested, I could post my configuration files for headnode and > clients. Does this capture (almost) eve

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Reuti
Hi, Am 09.09.2008 um 09:53 schrieb Carsten Aulbert: Hi all, I would tend to guess this problem is fairly common and many solutions are already in place, so I would like to enquirer about your solutions to the problem: In our large cluster we have certain nodes going down with I/O hard disk

[Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi all, I would tend to guess this problem is fairly common and many solutions are already in place, so I would like to enquirer about your solutions to the problem: In our large cluster we have certain nodes going down with I/O hard disk errors. We have some suspicion about the causes but would