We are running into an issue when doing distributed queries on Solr 4.10.4. We do not use SolrCloud but instead keep track of shards that need to be searched based on date ranges.
We have been running distributed queries without incident for several years now, but we only recently upgraded to 4.10.4 from 4.8.1. The query is relatively simple and involves 4 shards, including the aggregator itself. For a while the server that is acting as the aggregator for the distributed query handles the requests fine, but after an indefinite amount of usage (in the range of 2-4 hours) it starts hanging on all distributed queries while serving non-distributed versions (no shards list is included) of the same query quickly (9 ms). CPU, Heap and System Memory Usage do not seem unusual compared to other servers. I had initially suspect that distributed searches combined with faceting might be part of the issue, since I had seen some long-running threads that seemed to spend a long time in the FastLRUCache when getting facets for a single field. However, in the latest case of blocked queries, I am not seeing that. We have two slaves that replicate from a master, and we were saw the issue recur after a while of client usage, ruling out a hardware issue. Does anyone have any suggestions for potential avenues of attack for getting to the bottom of this? Or are there any known issues that could be implicated in this? - Ronald S. Wood