On 6/12/2018 10:14 PM, Joe Obernberger wrote:
Thank you Shawn. It looks like it is being applied. This could be
some sort of chain reaction where:
Drive or server fails. HDFS starts to replicate blocks which causes
network congestion. Solr7 can't talk, so initiates a replication
process which causes more network congestion....which causes more
replicas to replicate, and which eventually causes HBase (we run
HBase+Solr on the same machines) to also not be able to talk. That is
my running hypothesis anyway!
I was also thinking that there was a possibility that a lot of
replications were happening at once. At 75 megabytes per second each,
it would only take a few of them to saturate a link at 2 gigabits, even
if the load sharing between gigabit links is perfect. (and depending on
the type of bonding in use, it might not be perfect)
75 MB per second is in the neighborhood of 700 megabits per second, so
if three of those are happening at the same time and the disks can
actually keep up, it would be enough to fill a 2Gb/s link.
We've made a change to limit how much bandwidth HDFS can use. One
issue that we have seen is that the replicas fail to replicate, and
retry, over and over. I believe they are getting a timeout error; is
that parameter adjustable?
To have any idea whether it's adjustable, I would need to know exactly
what timeout is being exceeded. Can you share the full error for
anything you're seeing?
Thanks,
Shawn