What do you mean by JVM level? Run Solr on different ports on the same machine? If you have a 32 core box would you run 2,3,4 JVMs?
On Sun, Dec 4, 2016 at 8:46 PM, Jeff Wartes <jwar...@whitepages.com> wrote: > > Here’s an earlier post where I mentioned some GC investigation tools: > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/ > 201604.mbox/%3c8f8fa32d-ec0e-4352-86f7-4b2d8a906...@whitepages.com%3E > > In my experience, there are many aspects of the Solr/Lucene memory > allocation model that scale with things other than documents returned. > (such as cardinality, or simply index size) A single query on a large index > might consume dozens of megabytes of heap to complete. But that heap should > also be released quickly after the query finishes. > The key characteristic of a memory leak is that the software is allocating > memory that it cannot reclaim. If it’s a leak, you ought to be able to > reproduce it at any query rate - have you tried this? A run with, say, half > the rate, over twice the duration? > > I’m inclined to agree with others here, that although you’ve correctly > attributed the cause to GC, it’s probably less an indication of a leak, and > more an indication of simply allocating memory faster than it can be > reclaimed, combined with the long pauses that are increasingly unavoidable > as heap size goes up. > Note that in the case of a CMS allocation failure, the fallback full-GC is > *single threaded*, which means it’ll usually take considerably longer than > a normal GC - even for a comparable amount of garbage. > > In addition to GC tuning, you can address these by sharding more, both at > the core and jvm level. > > > On 12/4/16, 3:46 PM, "Shawn Heisey" <apa...@elyograg.org> wrote: > > On 12/3/2016 9:46 PM, S G wrote: > > The symptom we see is that the java clients querying Solr see > response > > times in 10s of seconds (not milliseconds). > <snip> > > Some numbers for the Solr Cloud: > > > > *Overall infrastructure:* > > - Only one collection > > - 16 VMs used > > - 8 shards (1 leader and 1 replica per shard - each core on separate > VM) > > > > *Overview from one core:* > > - Num Docs:193,623,388 > > - Max Doc:230,577,696 > > - Heap Memory Usage:231,217,880 > > - Deleted Docs:36,954,308 > > - Version:2,357,757 > > - Segment Count:37 > > The heap memory usage number isn't useful. It doesn't cover all the > memory used. > > > *Stats from QueryHandler/select* > > - requests:78,557 > > - errors:358 > > - timeouts:0 > > - totalTime:1,639,975.27 > > - avgRequestsPerSecond:2.62 > > - 5minRateReqsPerSecond:1.39 > > - 15minRateReqsPerSecond:1.64 > > - avgTimePerRequest:20.87 > > - medianRequestTime:0.70 > > - 75thPcRequestTime:1.11 > > - 95thPcRequestTime:191.76 > > These times are in *milliseconds*, not seconds .. and these are even > better numbers than you showed before. Where are you seeing 10 plus > second query times? Solr is not showing numbers like that. > > If your VM host has 16 VMs on it and each one has a total memory size > of > 92GB, then if that machine doesn't have 1.5 terabytes of memory, you're > oversubscribed, and this is going to lead to terrible performance... > but > the numbers you've shown here do not show terrible performance. > > > Plus, on every server, we are seeing lots of exceptions. > > For example: > > > > Between 8:06:55 PM and 8:21:36 PM, exceptions are: > > > > 1) Request says it is coming from leader, but we are the leader: > > update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_ > 1456430020/&wt=javabin&version=2 > > > > 2) org.apache.solr.common.SolrException: Request says it is coming > from > > leader, but we are the leader > > > > 3) org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 4) null:org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 5) org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 6) null:org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 7) org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: No live > SolrServers > > available to handle this request. Zombie server list: > > [HOSTA_ca_1_1456429897] > > > > 8) null:org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: No live > SolrServers > > available to handle this request. Zombie server list: > > [HOSTA_ca_1_1456429897] > > > > 9) org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 10) null:org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 11) org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > > > 12) null:org.apache.solr.common.SolrException: > > org.apache.solr.client.solrj.SolrServerException: Tried one server > for read > > operation and it timed out, so failing fast > > These errors sound like timeouts, possibly caused by long GC pauses ... > but as already mentioned, the query handler statistics do not indicate > long query times. If a long GC were to happen during a query, then the > query time would be long as well. > > The core information above doesn't include the size of the index on > disk. That number would be useful for telling you whether there's > enough memory. > > As I said at the beginning of the thread, I haven't seen anything here > to indicate a memory leak, and others are using version 4.10 without > any > problems. If there were a memory leak in a released version of Solr, > many people would have run into problems with it. > > Thanks, > Shawn > > > > -- Bill Bell billnb...@gmail.com cell 720-256-8076