On 08/17/2017 11:10 AM, Alex Chekholko wrote:
> The Google paper from a few years ago showed essentially no correlations
> between
> the things you ask about and failure rates. So... do whatever is most
> convenient for you.
Backblaze also has a pretty large data set, granted not as big as googl
On 08/17/2017 09:54 PM, mathog wrote:
On 17-Aug-2017 11:10, Alex Chekholko wrote:
The Google paper from a few years ago showed essentially no correlations
between the things you ask about and failure rates. So... do whatever is
most convenient for you.
This one?
http://research.google.c
On 17-Aug-2017 11:10, Alex Chekholko wrote:
The Google paper from a few years ago showed essentially no
correlations
between the things you ask about and failure rates. So... do whatever
is
most convenient for you.
This one?
http://research.google.com/archive/disk_failures.pdf
They didn'
On 08/17/2017 12:35 PM, Joe Landman wrote:
On 08/17/2017 12:00 PM, Faraz Hussain wrote:
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
On 08/17/2017 02:02 PM, Scott Atchley wrote:
I would agree that the bandwidth points at 1 GigE in this case.
For IB/OPA cards running slower than expected, I would recommend
ensuring that they are using the correct amount of PCIe lanes.
Turns out, there is a really nice open source tool tha
The Google paper from a few years ago showed essentially no correlations
between the things you ask about and failure rates. So... do whatever is
most convenient for you.
On Thu, Aug 17, 2017 at 10:44 AM mathog wrote:
> (Originally posted here:
>
>
> https://stackoverflow.com/questions/45719853
I would agree that the bandwidth points at 1 GigE in this case.
For IB/OPA cards running slower than expected, I would recommend ensuring
that they are using the correct amount of PCIe lanes.
On Thu, Aug 17, 2017 at 12:35 PM, Joe Landman wrote:
>
>
> On 08/17/2017 12:00 PM, Faraz Hussain wrote:
(Originally posted here:
https://stackoverflow.com/questions/45719853/enterprise-spare-drives-better-on-shelf-or-spun-down-in-enclosure
but nobody has answered.)
Hi all,
Some Dell servers I recently started managing have spare disks in their
array enclosures. megacli showed the spares as:
F
On 08/17/2017 12:00 PM, Faraz Hussain wrote:
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
~1GB/sec between all the other nodes. Any ti
Faraz,
I really suggest you examine the Intel Cluster Checker.
I guess that you cannot take down a production cluster to run an entire
Cluster checker run, however these are the types of faults which ICC is
designed to find. You can define a smal lset of compute nodes to run on,
including this node
I noticed an mpi job was taking 5X longer to run whenever it got the
compute node lusytp104 . So I ran qperf and found the bandwidth
between it and any other nodes was ~100MB/sec. This is much lower than
~1GB/sec between all the other nodes. Any tips on how to debug
further? I haven't tried
11 matches
Mail list logo