On Thu, Oct 22, 2009 at 11:31 PM, Don Holmgren wrote:
>
>
> You need to modify /etc/inittab to add a getty for your serial
> line (to give you a login prompt).
>
> If in the BIOS you redirected COMn, the corresponding Linux serial
> port is ttyS(n-1). Add this line to inittab to to run agetty on
On Fri, Oct 23, 2009 at 01:56:17PM -0700, Jon Forrest wrote:
> They were all very fragile, but I think this was mostly due to
> one prevailing problem. This is that, at the time, the OSs didn't
> have hooks in the places necessary for an HSM system to do the
> right thing.
The hooks thing you're
Hi Tony
Check these:
http://www.rocksclusters.org/wordpress/
http://www.rocksclusters.org/roll-documentation/base/5.2/
http://www.clustermonkey.net/
http://www.phy.duke.edu/~rgb/Beowulf/beowulf.php
http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php
If you search the list archives you will fi
On Fri, Oct 23, 2009 at 09:01:28AM -0700, ed in 92626 wrote:
> You could also do something at the system level to prevent it. If the system
> boots and the previous_uptime is less that one hour shut down the system.
> The WD timer will not wake it up.
You have 2 power failures 15 minutes apart. Y
Carl Thomas wrote:
HI all,
We are currently in the midst of planning a major refresh of our
existing HPC cluster.
It is expected that our storage will consist of a combination of fast
fibre channel and SATA based disk and we would like to implement a
system whereby user files are automaticall
Greg Lindahl wrote:
On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.
Maybe all the "nice" clusters out there never have this issue
On Fri, Oct 23, 2009 at 1:23 PM, Greg Lindahl wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>>
>> Maybe all the
I tried this on a Supermicro board and a Sun box. On both systems the
system would reboot randomly so I tuned it off. This is a serious
problem of false positives. In a cluster, you may need to notify the
scheduler in someway when a node reboots. Can someone elaborate on
this? Specifically
I believe ANALOG is just your monitor telling you it's receiving analog
input, as opposed to digital.
Power saving mode is telling you the monitor is in PS mode and if you tell
the BIOS to sleep the
monitor is setup it comply with the request.
Ed
On Fri, Oct 23, 2009 at 6:59 AM, wrote:
> Hi
On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar wrote:
> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading through my IPMI
> manual. In principle it sounds neat: If the system hangs then get it
> to reboot after, say, 5 minute
HI all,
We are currently in the midst of planning a major refresh of our existing
HPC cluster.
It is expected that our storage will consist of a combination of fast fibre
channel and SATA based disk and we would like to implement a system whereby
user files are automatically migrated to and from sl
hi rahul,
same thing happens at our side.node gets reboot due to asr and it doesnt
crash.can u suggest any remedy?
On Fri, Oct 23, 2009 at 6:26 AM, Rahul Nabar wrote:
> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading t
I noticed my machine has 16 drivers in the /proc/interrupts table
marked as eth-mlx4-0 to 15, in addition to the usual mlx-async and
mlx-core drivers.
The server runs Linux Suse RT, has an infiniband interface, OFED 1.1
drivers, and 16 Xeon MP cores , so I'm assuming all these eth-mlx4
driv
Hi everyone,
anyone could help me explaning how to build a beowulf cluster?
An web site, a list of parameters anything updated. Cause i only found
in the internet posts that are really old.
Thanks a lot.
Tony Miranda.
___
Beowulf mailing list, Beowu
On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
> 2. Some errors are hardware precipitated. Aging, out-of-warranty
> aging, hardware can sometimes need such a reboot compromise for
> one-off random errors.
>
> Maybe all the "nice" clusters out there never have this issue but for
> me
On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn wrote:
>
>> My philosophy though would be to leave a machine down till the cause of
>> the crash is established.
>
> absolutely. this is not an obvious principle to some people, though:
> it depends on whether your model of failures involves luck or cau
You could imagine jobs which checkpoint often, and automatically restart
themselves from
a checkpoint if a machine fails like this.
I find that apps (custom or commercial) normally need some help to restart.
(some need to be pointed at the checkpoint to start with, others need to
be told it's a
> The relevant log snippet is:
>
> checking for EVP_aes_128_cbc in -lcrypto... no
> checking for MD5_Init in -lcrypto... no
> checking for MD2_Init in -lcrypto... no
> ** The lanplus interface requires an SSL library with EVP_aes_128_cbc defined.
>
> Any clues what I am doing wrong? I used: "./c
On Oct 22, 2009, at 7:17 PM, Patrick Geoffray wrote:
Hey Larry,
Larry Stewart wrote:
Does anyone know, or know where to find out, how long it takes to
do a store to a device register on a Nahelem system with a
PCIexpress device?
Are you asking for latency or throughput ? For latency, it
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.
I think that this is an interesting concept - and don't wan
21 matches
Mail list logo