Re: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it?

2009-10-23 Thread Rahul Nabar
On Thu, Oct 22, 2009 at 11:31 PM, Don Holmgren wrote: > > > You need to modify /etc/inittab to add a getty for your serial > line (to give you a login prompt). > > If in the BIOS you redirected COMn, the corresponding Linux serial > port is ttyS(n-1).  Add this line to inittab to to run agetty on

Re: [Beowulf] Mature open source hierarchical storage management

2009-10-23 Thread Greg Lindahl
On Fri, Oct 23, 2009 at 01:56:17PM -0700, Jon Forrest wrote: > They were all very fragile, but I think this was mostly due to > one prevailing problem. This is that, at the time, the OSs didn't > have hooks in the places necessary for an HSM system to do the > right thing. The hooks thing you're

Re: [Beowulf] Build a Beowulf Cluster

2009-10-23 Thread Gus Correa
Hi Tony Check these: http://www.rocksclusters.org/wordpress/ http://www.rocksclusters.org/roll-documentation/base/5.2/ http://www.clustermonkey.net/ http://www.phy.duke.edu/~rgb/Beowulf/beowulf.php http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php If you search the list archives you will fi

Re: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

2009-10-23 Thread Greg Lindahl
On Fri, Oct 23, 2009 at 09:01:28AM -0700, ed in 92626 wrote: > You could also do something at the system level to prevent it. If the system > boots and the previous_uptime is less that one hour shut down the system. > The WD timer will not wake it up. You have 2 power failures 15 minutes apart. Y

Re: [Beowulf] Mature open source hierarchical storage management

2009-10-23 Thread Jon Forrest
Carl Thomas wrote: HI all, We are currently in the midst of planning a major refresh of our existing HPC cluster. It is expected that our storage will consist of a combination of fast fibre channel and SATA based disk and we would like to implement a system whereby user files are automaticall

Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Gerry Creager
Greg Lindahl wrote: On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: 2. Some errors are hardware precipitated. Aging, out-of-warranty aging, hardware can sometimes need such a reboot compromise for one-off random errors. Maybe all the "nice" clusters out there never have this issue

Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Rahul Nabar
On Fri, Oct 23, 2009 at 1:23 PM, Greg Lindahl wrote: > On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: > >> 2. Some errors are hardware precipitated. Aging, out-of-warranty >> aging, hardware can sometimes need such a reboot compromise for >> one-off random errors. >> >> Maybe all the

Re: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

2009-10-23 Thread Kevin Abbey
I tried this on a Supermicro board and a Sun box. On both systems the system would reboot randomly so I tuned it off. This is a serious problem of false positives. In a cluster, you may need to notify the scheduler in someway when a node reboots. Can someone elaborate on this? Specifically

Re: [Beowulf] BIOS & monitor power saving

2009-10-23 Thread ed in 92626
I believe ANALOG is just your monitor telling you it's receiving analog input, as opposed to digital. Power saving mode is telling you the monitor is in PS mode and if you tell the BIOS to sleep the monitor is setup it comply with the request. Ed On Fri, Oct 23, 2009 at 6:59 AM, wrote: > Hi

Re: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

2009-10-23 Thread ed in 92626
On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minute

[Beowulf] Mature open source hierarchical storage management

2009-10-23 Thread Carl Thomas
HI all, We are currently in the midst of planning a major refresh of our existing HPC cluster. It is expected that our storage will consist of a combination of fast fibre channel and SATA based disk and we would like to implement a system whereby user files are automatically migrated to and from sl

Re: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

2009-10-23 Thread akshar bhosale
hi rahul, same thing happens at our side.node gets reboot due to asr and it doesnt crash.can u suggest any remedy? On Fri, Oct 23, 2009 at 6:26 AM, Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading t

[Beowulf] eth-mlx4-0/15

2009-10-23 Thread Robert Kubrick
I noticed my machine has 16 drivers in the /proc/interrupts table marked as eth-mlx4-0 to 15, in addition to the usual mlx-async and mlx-core drivers. The server runs Linux Suse RT, has an infiniband interface, OFED 1.1 drivers, and 16 Xeon MP cores , so I'm assuming all these eth-mlx4 driv

[Beowulf] Build a Beowulf Cluster

2009-10-23 Thread Tony Miranda
Hi everyone, anyone could help me explaning how to build a beowulf cluster? An web site, a list of parameters anything updated. Cause i only found in the internet posts that are really old. Thanks a lot. Tony Miranda. ___ Beowulf mailing list, Beowu

Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Greg Lindahl
On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: > 2. Some errors are hardware precipitated. Aging, out-of-warranty > aging, hardware can sometimes need such a reboot compromise for > one-off random errors. > > Maybe all the "nice" clusters out there never have this issue but for > me

Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Rahul Nabar
On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn wrote: > >> My philosophy though would be to leave a machine down till the cause of >> the crash is established. > > absolutely.  this is not an obvious principle to some people, though: > it depends on whether your model of failures involves luck or cau

RE: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Mark Hahn
You could imagine jobs which checkpoint often, and automatically restart themselves from a checkpoint if a machine fails like this. I find that apps (custom or commercial) normally need some help to restart. (some need to be pointed at the checkpoint to start with, others need to be told it's a

Re: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it?

2009-10-23 Thread Prentice Bisbal
> The relevant log snippet is: > > checking for EVP_aes_128_cbc in -lcrypto... no > checking for MD5_Init in -lcrypto... no > checking for MD2_Init in -lcrypto... no > ** The lanplus interface requires an SSL library with EVP_aes_128_cbc defined. > > Any clues what I am doing wrong? I used: "./c

Re: [Beowulf] Nahalem / PCIe I/O write latency?

2009-10-23 Thread Vincent Diepeveen
On Oct 22, 2009, at 7:17 PM, Patrick Geoffray wrote: Hey Larry, Larry Stewart wrote: Does anyone know, or know where to find out, how long it takes to do a store to a device register on a Nahelem system with a PCIexpress device? Are you asking for latency or throughput ? For latency, it

[Beowulf] BIOS & monitor power saving

2009-10-23 Thread tomislav . maric
___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

RE: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

2009-10-23 Thread Hearns, John
Of course, one might say, a well configured HPC compute-node shouldn't be getting to a hung point anyways; but in-practice I see a few nodes every month that can be resurrected by a simple reboot. Admittedly these nodes are quite senile. I think that this is an interesting concept - and don't wan