I may have a bit of good news. The ulimit of open files was set to 4096. I just chose a random high limit (100000) and it seems to be working better now. I still have more testing to do though, but the initial results are hopeful.
On Wed, Aug 14, 2013 at 4:22 PM, Kevin Osborn <kevin.osb...@cbsi.com> wrote: > Actually, I thought it worked last night, but that may have just been a > fluke. Today, it is not working. > > This is what I have done. > > I have turned off autoCommit and softAutoCommit. My updates are not > sending any softCommit messages. > > I am sending over data in chunks of 500 records. > > At the end of each complete upload, I am doing an explicit commit. > > So, I send over the first upload (536 records). So, this works fine in 2 > chunks. After the commit, they are searchable as well. > > The second catalog is much larger (300k records). This starts uploading > about 5 minutes later. Usually, it hangs on the very first chunk. If I were > to kill the server during the hang, it does then work. > > As a variation, I set autoCommit maxDocs with openSearcher to false. I set > this to 10000. During the test, it hung right after the second autocommit. > > In both servers, I see a long waiting commitScheduler object in the thread > dump. > > - sun.misc.Unsafe.park(Native Method) > - java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > - > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > - > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079) > - > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) > - > > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) > - > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > - > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > - java.lang.Thread.run(Thread.java:722) > > > On the second server, I see quite a few other long waiting processes as > well. > > -Kevin > > > On Wed, Aug 14, 2013 at 9:51 AM, Kevin Osborn <kevin.osb...@cbsi.com>wrote: > >> Thanks so much for your help and for the explanations. Eventually, we >> will be doing several batches in parallel. But at least now I know where to >> look and can do some testing on various scenarios. >> >> Since we may be doing a lot of heavy uploading (while still doing a lot >> of queries), having a autoCommit interval shorter than the softAutoCommit >> internal does sound interesting and I will test it out. And then just >> disable softCommit on my batch uploads. >> >> Either way, I at least know where to focus my efforts. >> >> -Kevin >> >> >> >> On Wed, Aug 14, 2013 at 6:27 AM, Jason Hellman < >> jhell...@innoventsolutions.com> wrote: >> >>> Kevin, >>> >>> I wouldn't have considered using softCommits at all based on what I >>> understand from your use case. You appear to be loading in large batches, >>> and softCommits are better aligned to NRT search where there is a steady >>> stream of smaller updates that need to be available immediately. >>> >>> As Erick pointed out, soft commits are all about avoiding constant >>> reopening of the index searcher…where by constant we mean every few >>> seconds. Provided you can wait until your batch is completed, and that >>> frequency is roughly a minute or more, you likely will find an >>> old-fashioned hard commit (with openSearcher="true") will work just fine >>> (YMMV). >>> >>> Jason >>> >>> >>> >>> On Aug 14, 2013, at 4:51 AM, Erick Erickson <erickerick...@gmail.com> >>> wrote: >>> >>> > right, SOLR-5081 is possible but somewhat unlikely >>> > given the fact that you actually don't have very many >>> > nodes in your cluster. >>> > >>> > soft commits aren't relevant to the tlog, but here's >>> > the thing. Your tlogs may get replayed >>> > when you restart solr. If they're large, this may take >>> > a long time. When you said you restarted Solr after >>> > killing it, you might have triggered this. >>> > >>> > The way to keep tlogs small is to hard commit more >>> > frequently (you should look at their size before >>> > worrying about it though!). If you set openSearcher=false, >>> > this is pretty inexpensive, all it really does is close >>> > the current segment files, open new ones, and start a new >>> > tlog file. It does _not_ invalidate caches, do autowarming, >>> > all that expensive stuff. >>> > >>> > Your soft commit does _not_ improve performance! It is >>> > just "less expensive" than a hard commit with >>> > openSearcher=true. It _does_ invalidate caches, fire >>> > off autowarming, etc. So it does "improve performance" >>> > over doing hard commits with openSearcher=true >>> > with the same frequency, but it still isn't free. It's still >>> > good to have the soft commit interval as long as you >>> > can tolerate. >>> > >>> > It's perfectly reasonable to have a hard commit interval >>> > that's much shorter than your soft commit interval. As >>> > Yonik explained once, "soft commits are about visibility >>> > but hard commits are about durability". >>> > >>> > Best >>> > Erick >>> > >>> > >>> > On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn <kevin.osb...@cbsi.com> >>> wrote: >>> > >>> >> Interesting, that did work. Do you or anyone else have any ideas or >>> what I >>> >> should look at? While soft commit is not a requirement in my project, >>> my >>> >> understanding is that it should help performance. On the same index, >>> I will >>> >> be doing both a large number of queries as well as updates. >>> >> >>> >> If I have to disable autoCommit, should I increase the chunk size? >>> >> >>> >> Of course, I will have to run a more large scale test tomorrow, but I >>> saw >>> >> this problem fairly consistently in my smaller test. >>> >> >>> >> In a previous experiment, I applied the SOLR-4816 patch that someone >>> >> indicated might help. I also reduced the CSV upload chunk size to >>> 500. It >>> >> seemed like things got a little better, but still eventually hung. >>> >> >>> >> I also see SOLR-5081, but I don't know if that is my issue or not. At >>> least >>> >> in my test, the index writes are not parallel as in the ticket. >>> >> >>> >> -Kevin >>> >> >>> >> >>> >> On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman < >>> >> jhell...@innoventsolutions.com> wrote: >>> >> >>> >>> While I don't have a past history of this issue to use as reference, >>> if I >>> >>> were in your shoes I would consider trying your updates with >>> softCommit >>> >>> disabled. My suspicion is you're experiencing some issue with the >>> >>> transaction logging and how it's managed when your hard commit >>> occurs. >>> >>> >>> >>> If you can give that a try and let us know how that fares we might >>> have >>> >>> some further input to share. >>> >>> >>> >>> >>> >>> On Aug 13, 2013, at 11:54 AM, Kevin Osborn <kevin.osb...@cbsi.com> >>> >> wrote: >>> >>> >>> >>>> I am using Solr Cloud 4.4. It is pretty much a base configuration. >>> We >>> >>> have >>> >>>> 2 servers and 3 collections. Collection1 is 1 shard and the >>> Collection2 >>> >>> and >>> >>>> Collection3 both have 2 shards. Both servers are identical. >>> >>>> >>> >>>> So, here is my process, I do a lot of queries on Collection1 and >>> >>>> Collection2. I then do a bunch of inserts into Collection3. I am >>> doing >>> >>> CSV >>> >>>> uploads. I am also doing custom shard routing. All the products in a >>> >>> single >>> >>>> upload will have the same shard key. All Solr interaction is through >>> >>> SolrJ >>> >>>> with full Zookeeper awareness. My uploads are also using soft >>> commits. >>> >>>> >>> >>>> I tried this on a record set of 936 products. Everything worked >>> fine. I >>> >>>> then sent over a record set of 300k products. The upload into >>> >> Collection3 >>> >>>> is chunked. I tried both 1000 and 200,000 with similar results. The >>> >> first >>> >>>> upload to Solr would just hang. There would simply be no response >>> from >>> >>>> Solr. A few of the products from this request would make it into the >>> >>> index, >>> >>>> but not many. >>> >>>> >>> >>>> In this state, queries continued to work, but deletes did not. >>> >>>> >>> >>>> My only solution was to kill each Solr process. >>> >>>> >>> >>>> As an experiment, I did the large catalog first. First, I reset >>> >>> everything. >>> >>>> With A chunk size of 1000, about 110,000 out of 300,000 records >>> made it >>> >>>> into Solr before the process hung. Again, queries worked, but >>> deletes >>> >> did >>> >>>> not and I had to kill Solr. It hung after about 30 seconds. >>> >> Timing-wise, >>> >>>> this is at about the second autocommit cycle, given the default >>> >>> autocommit >>> >>>> of 15 seconds. I am not sure if this is related or not. >>> >>>> >>> >>>> As an additional experiment, I ran the entire test with just a >>> single >>> >>> node >>> >>>> in the cluster. This time, everything ran fine. >>> >>>> >>> >>>> Does anyone have any ideas? Everything is pretty default. These >>> servers >>> >>> are >>> >>>> Azure VMs, although I have seen similar behavior running two Solr >>> >>> instances >>> >>>> on a single internal server as well. >>> >>>> >>> >>>> I had also noticed similar behavior before with Solr 4.3. It >>> definitely >>> >>> has >>> >>>> something do with the clustering, but I am not sure what. And I >>> don't >>> >> see >>> >>>> any error message (or really anything else) in the Solr logs. >>> >>>> >>> >>>> Thanks. >>> >>>> >>> >>>> -- >>> >>>> *KEVIN OSBORN* >>> >>>> LEAD SOFTWARE ENGINEER >>> >>>> CNET Content Solutions >>> >>>> OFFICE 949.399.8714 >>> >>>> CELL 949.310.4677 SKYPE osbornk >>> >>>> 5 Park Plaza, Suite 600, Irvine, CA 92614 >>> >>>> [image: CNET Content Solutions] >>> >>> >>> >>> >>> >> >>> >> >>> >> -- >>> >> *KEVIN OSBORN* >>> >> LEAD SOFTWARE ENGINEER >>> >> CNET Content Solutions >>> >> OFFICE 949.399.8714 >>> >> CELL 949.310.4677 SKYPE osbornk >>> >> 5 Park Plaza, Suite 600, Irvine, CA 92614 >>> >> [image: CNET Content Solutions] >>> >> >>> >>> >> >> >> -- >> *KEVIN OSBORN* >> >> LEAD SOFTWARE ENGINEER >> CNET Content Solutions >> OFFICE 949.399.8714 >> CELL 949.310.4677 SKYPE osbornk >> 5 Park Plaza, Suite 600, Irvine, CA 92614 >> [image: CNET Content Solutions] >> >> > > > -- > *KEVIN OSBORN* > LEAD SOFTWARE ENGINEER > CNET Content Solutions > OFFICE 949.399.8714 > CELL 949.310.4677 SKYPE osbornk > 5 Park Plaza, Suite 600, Irvine, CA 92614 > [image: CNET Content Solutions] > > -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions]