Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger Tue, 21 Nov 2017 05:08:31 -0800

Hi All - we have a system with 45 physical boxes running solr 6.6.1using HDFS as the index. The current index size is about 31TBytes. With 3x replication that takes up 93TBytes of disk. Our main collectionis split across 100 shards with 3 replicas each. The issue that we'rerunning into is when restarting the solr6 cluster. The shards go intorecovery and start to utilize nearly all of their network interfaces. If we start too many of the nodes at once, the shards will go into arecovery, fail, and retry loop and never come up. The errors arerelated to HDFS not responding fast enough and warnings from theDFSClient. If we stop a node when this is happening, the script willforce a stop (180 second timeout) and upon restart, we have lock files(write.lock) inside of HDFS.

The process at this point is to start one node, find out the lock files,wait for it to come up completely (hours), stop it, delete thewrite.lock files, and restart. Usually this second restart is faster,but it still can take 20-60 minutes.

The smaller indexes recover much faster (less than 5 minutes). Should wehave not used so many replicas with HDFS? Is there a better way weshould have built the solr6 cluster?


Thank you for any insight!

-Joe

Recovery Issue - Solr 6.6.1 and HDFS

Reply via email to