OK, thanks for wrapping this up!
On Mon, Aug 31, 2015 at 10:08 AM, Rallavagu <rallav...@gmail.com> wrote: > Erick, > > Apologies for missing out on status on indexing (replication) issues as I > have originally started this thread. After implementing CloudSolrServer > instead of CouncurrentUpdateSolrServer things were much better. I simply > wanted to follow up on understanding the memory behavior better though we > have tuned both heap and physical memory a while ago. > > Thanks > > On 8/24/15 9:09 AM, Erick Erickson wrote: >> >> bq: As a follow up, the default is set to "NRTCachingDirectoryFactory" >> for DirectoryFactory but not MMapDirectory. It is mentioned that >> NRTCachingDirectoryFactory "caches small files in memory for better >> NRT performance". >> >> NRTCachingDirectoryFactory also uses MMapDirectory under the covers as >> well as "caches small files in memory...." >> so you really can't separate out the two. >> >> I didn't mention this explicitly, but your original problem should >> _not_ be happening in a well-tuned >> system. Why your nodes go into a down state needs to be understood. >> The connection timeout is >> the only clue so far, and the usual reason here is that very long GC >> pauses are happening. If this >> continually happens, you might try turning on GC reporting options. >> >> Best, >> Erick >> >> >> On Mon, Aug 24, 2015 at 2:47 AM, Rallavagu <rallav...@gmail.com> wrote: >>> >>> As a follow up, the default is set to "NRTCachingDirectoryFactory" for >>> DirectoryFactory but not MMapDirectory. It is mentioned that >>> NRTCachingDirectoryFactory "caches small files in memory for better NRT >>> performance". >>> >>> Wondering if the this would also consume physical memory to the amount of >>> MMap directory. Thoughts? >>> >>> On 8/18/15 9:29 AM, Erick Erickson wrote: >>>> >>>> >>>> Couple of things: >>>> >>>> 1> Here's an excellent backgrounder for MMapDirectory, which is >>>> what makes it appear that Solr is consuming all the physical memory.... >>>> >>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>>> >>>> 2> It's possible that your transaction log was huge. Perhaps not likely, >>>> but possible. If Solr abnormally terminates (kill -9 is a prime way to >>>> do >>>> this), >>>> then upon restart the transaction log is replayed. This log is rolled >>>> over >>>> upon >>>> every hard commit (openSearcher true or false doesn't matter). So, in >>>> the >>>> scenario where you are indexing a whole lot of stuff without committing, >>>> then >>>> it can take a very long time to replay the log. Not only that, but as >>>> you >>>> do >>>> replay the log, any incoming updates are written to the end of the >>>> tlog.. >>>> That >>>> said, nothing in your e-mails indicates this could be a problem and it's >>>> frankly not consistent with the errors you _do_ report but I thought >>>> I'd mention it. >>>> See: >>>> >>>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ >>>> You can avoid the possibility of this by configuring your autoCommit >>>> interval >>>> to be relatively short (say 60 seconds) with openSearcher=false.... >>>> >>>> 3> ConcurrentUpdateSolrServer isn't the best thing for bulk loading >>>> SolrCloud, >>>> CloudSolrServer (renamed CloudSolrClient in 5.x) is better. CUSS sends >>>> all >>>> the docs to some node, and from there that node figures out which >>>> shard each doc belongs on and forwards the doc (actually in batches) to >>>> the >>>> appropriate leader. So doing what you're doing creates a lot of cross >>>> chatter >>>> amongst nodes. CloudSolrServer/Client figures that out on the client >>>> side >>>> and >>>> only sends packets to each leader that consist of only the docs the >>>> belong >>>> on >>>> that shard. You can getnearly linear throughput with increasing numbers >>>> of >>>> shards this way. >>>> >>>> Best, >>>> Erick >>>> >>>> On Tue, Aug 18, 2015 at 9:03 AM, Rallavagu <rallav...@gmail.com> wrote: >>>>> >>>>> >>>>> Thanks Shawn. >>>>> >>>>> All participating cloud nodes are running Tomcat and as you suggested >>>>> will >>>>> review the number of threads and increase them as needed. >>>>> >>>>> Essentially, what I have noticed was that two of four nodes caught up >>>>> with >>>>> "bulk" updates instantly while other two nodes took almost 3 hours to >>>>> completely in sync with "leader". I have "tickled" other nodes by >>>>> sending >>>>> an >>>>> update thinking that it would initiate the replication but not sure if >>>>> that >>>>> caused other two nodes to eventually catch up. >>>>> >>>>> On similar note, I was using "CouncurrentUpdateSolrServer" directly >>>>> pointing >>>>> to leader to bulk load Solr cloud. I have configured the chunk size and >>>>> thread count for the same. Is this the right practice to bulk load >>>>> SolrCloud? >>>>> >>>>> Also, the maximum number of connections per host parameter for >>>>> "HttpShardHandler" is in solrconfig.xml I suppose? >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> On 8/18/15 8:28 AM, Shawn Heisey wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 8/18/2015 8:18 AM, Rallavagu wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks for the response. Does this cache behavior influence the delay >>>>>>> in catching up with cloud? How can we explain solr cloud replication >>>>>>> and what are the option to monitor and take proactive action (such as >>>>>>> initializing, pausing etc) if needed? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I don't know enough about your setup to speculate. >>>>>> >>>>>> I did notice this exception in a previous reply: >>>>>> >>>>>> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting >>>>>> for >>>>>> connection from pool >>>>>> >>>>>> I can think of two things that would cause this. >>>>>> >>>>>> One cause is that your servlet container is limiting the number of >>>>>> available threads. A typical jetty or tomcat default for maxThreads >>>>>> is >>>>>> 200, which can easily be exceeded by a small Solr install, especially >>>>>> if >>>>>> it's SolrCloud. The jetty included with Solr sets maxThreads to >>>>>> 10000, >>>>>> which is effectively unlimited except for extremely large installs. >>>>>> If >>>>>> you are providing your own container, this will almost certainly need >>>>>> to >>>>>> be raised. >>>>>> >>>>>> The other cause is that your install is extremely busy and you have >>>>>> run >>>>>> out of available HttpClient connections. The solution in this case is >>>>>> to increase the maximum number of connections per host in the >>>>>> HttpShardHandler config, which defaults to 20. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> https://wiki.apache.org/solr/SolrConfigXml#Configuration_of_Shard_Handlers_for_Distributed_searches >>>>>> >>>>>> There might be other causes for that exception, but I think those are >>>>>> the most common causes. Depending on how things are set up, you have >>>>>> problems with both. >>>>>> >>>>>> Thanks, >>>>>> Shawn >>>>>> >>>>> >>> >