Re: [Beowulf] cold spare storage?

2017-08-17 Thread Bill Broadley via Beowulf
On 08/17/2017 11:10 AM, Alex Chekholko wrote: > The Google paper from a few years ago showed essentially no correlations > between > the things you ask about and failure rates. So... do whatever is most > convenient for you. Backblaze also has a pretty large data set, granted not as big as googl

Re: [Beowulf] cold spare storage?

2017-08-17 Thread Benson Muite
On 08/17/2017 09:54 PM, mathog wrote: On 17-Aug-2017 11:10, Alex Chekholko wrote: The Google paper from a few years ago showed essentially no correlations between the things you ask about and failure rates. So... do whatever is most convenient for you. This one? http://research.google.c

Re: [Beowulf] cold spare storage?

2017-08-17 Thread mathog
On 17-Aug-2017 11:10, Alex Chekholko wrote: The Google paper from a few years ago showed essentially no correlations between the things you ask about and failure rates. So... do whatever is most convenient for you. This one? http://research.google.com/archive/disk_failures.pdf They didn'

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Gus Correa
On 08/17/2017 12:35 PM, Joe Landman wrote: On 08/17/2017 12:00 PM, Faraz Hussain wrote: I noticed an mpi job was taking 5X longer to run whenever it got the compute node lusytp104 . So I ran qperf and found the bandwidth between it and any other nodes was ~100MB/sec. This is much lower than

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Joe Landman
On 08/17/2017 02:02 PM, Scott Atchley wrote: I would agree that the bandwidth points at 1 GigE in this case. For IB/OPA cards running slower than expected, I would recommend ensuring that they are using the correct amount of PCIe lanes. Turns out, there is a really nice open source tool tha

Re: [Beowulf] cold spare storage?

2017-08-17 Thread Alex Chekholko
The Google paper from a few years ago showed essentially no correlations between the things you ask about and failure rates. So... do whatever is most convenient for you. On Thu, Aug 17, 2017 at 10:44 AM mathog wrote: > (Originally posted here: > > > https://stackoverflow.com/questions/45719853

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Scott Atchley
I would agree that the bandwidth points at 1 GigE in this case. For IB/OPA cards running slower than expected, I would recommend ensuring that they are using the correct amount of PCIe lanes. On Thu, Aug 17, 2017 at 12:35 PM, Joe Landman wrote: > > > On 08/17/2017 12:00 PM, Faraz Hussain wrote:

[Beowulf] cold spare storage?

2017-08-17 Thread mathog
(Originally posted here: https://stackoverflow.com/questions/45719853/enterprise-spare-drives-better-on-shelf-or-spun-down-in-enclosure but nobody has answered.) Hi all, Some Dell servers I recently started managing have spare disks in their array enclosures. megacli showed the spares as: F

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Joe Landman
On 08/17/2017 12:00 PM, Faraz Hussain wrote: I noticed an mpi job was taking 5X longer to run whenever it got the compute node lusytp104 . So I ran qperf and found the bandwidth between it and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec between all the other nodes. Any ti

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread John Hearns via Beowulf
Faraz, I really suggest you examine the Intel Cluster Checker. I guess that you cannot take down a production cluster to run an entire Cluster checker run, however these are the types of faults which ICC is designed to find. You can define a smal lset of compute nodes to run on, including this node

[Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Faraz Hussain
I noticed an mpi job was taking 5X longer to run whenever it got the compute node lusytp104 . So I ran qperf and found the bandwidth between it and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec between all the other nodes. Any tips on how to debug further? I haven't tried