Thanks I’ll try that. Is the Thread Dump view in the Solr Admin panel not reliable for diagnosing thread hangs?
On a different note, I am considering introducing a dedicated aggregator to avoid using a shard both for search and aggregation, in case there is an issue there. Ronald S. Wood | Senior Software Developer 857-991-7681 (mobile) Smarsh 100 Franklin St. Suite 903 | Boston, MA 02210 1-866-SMARSH-1 | 971-998-9967 (fax) www.smarsh.com <http://www.smarsh.com/> Immediate customer support: Call 1-866-762-7741 (x2) or visit www.smarsh.com/support <http://www.smarsh.com/support> On 7/2/15, 3:56 PM, "Ryan, Michael F. (LNG-DAY)" <michael.r...@lexisnexis.com> wrote: >Try running jstack on the aggregator - that will show you where the threads >are hanging. > >-Michael > >-----Original Message----- >From: Ronald Wood [mailto:rw...@smarsh.com] >Sent: Thursday, July 02, 2015 3:37 PM >To: solr-user@lucene.apache.org >Subject: Distributed queries hang in a non-SolrCloud environment, Solr 4.10.4 > > >We are running into an issue when doing distributed queries on Solr 4.10.4. We >do not use SolrCloud but instead keep track of shards that need to be searched >based on date ranges. > >We have been running distributed queries without incident for several years >now, but we only recently upgraded to 4.10.4 from 4.8.1. > >The query is relatively simple and involves 4 shards, including the aggregator >itself. > >For a while the server that is acting as the aggregator for the distributed >query handles the requests fine, but after an indefinite amount of usage (in >the range of 2-4 hours) it starts hanging on all distributed queries while >serving non-distributed versions (no shards list is included) of the same >query quickly (9 ms). > >CPU, Heap and System Memory Usage do not seem unusual compared to other >servers. > >I had initially suspect that distributed searches combined with faceting might >be part of the issue, since I had seen some long-running threads that seemed >to spend a long time in the FastLRUCache when getting facets for a single >field. However, in the latest case of blocked queries, I am not seeing that. > >We have two slaves that replicate from a master, and we were saw the issue >recur after a while of client usage, ruling out a hardware issue. > >Does anyone have any suggestions for potential avenues of attack for getting >to the bottom of this? Or are there any known issues that could be implicated >in this? > >- Ronald S. Wood