On 7/24/2016 8:04 PM, forest_soup wrote: > We have a 5 node solrcloud. When a solr node's disk had issue and > Raid5 downgraded, a recovery on the node was triggered. But there's a > hanging happens. The node disappears in the live_nodes list.
In my opinion, RAID5 (and RAID6) are bad ways to handle storage. Cost per usable gigabyte is the only real advantage, but the performance problems are not worth that advantage. If you care more about capacity than performance, then it might be OK. Under normal circumstances (no failed disk), if you're writing to the array at all, all I/O (both read and write) is slow. RAID5 can have awesome read performance, but *only* if the array is health and there is no writing happening at the same time. If you lose a disk, the parity reads required to reconstruct the missing data cause REALLY bad performance. When you replace the failed disk and it is rebuilding, performance is even worse. The additional load is often enough to cause a second disk to fail, which for RAID5 means the entire array is lost. These I/O performance issues cause really big problems for Solr and zookeeper. There's no surprise to me that a degraded RAID5 array has issues like you describe. Thanks, Shawn