Actually, I thought it worked last night, but that may have just been a fluke. Today, it is not working.
This is what I have done. I have turned off autoCommit and softAutoCommit. My updates are not sending any softCommit messages. I am sending over data in chunks of 500 records. At the end of each complete upload, I am doing an explicit commit. So, I send over the first upload (536 records). So, this works fine in 2 chunks. After the commit, they are searchable as well. The second catalog is much larger (300k records). This starts uploading about 5 minutes later. Usually, it hangs on the very first chunk. If I were to kill the server during the hang, it does then work. As a variation, I set autoCommit maxDocs with openSearcher to false. I set this to 10000. During the test, it hung right after the second autocommit. In both servers, I see a long waiting commitScheduler object in the thread dump. - sun.misc.Unsafe.park(Native Method) - java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) - java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079) - java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) - java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) - java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) - java.lang.Thread.run(Thread.java:722) On the second server, I see quite a few other long waiting processes as well. -Kevin On Wed, Aug 14, 2013 at 9:51 AM, Kevin Osborn <kevin.osb...@cbsi.com> wrote: > Thanks so much for your help and for the explanations. Eventually, we will > be doing several batches in parallel. But at least now I know where to look > and can do some testing on various scenarios. > > Since we may be doing a lot of heavy uploading (while still doing a lot of > queries), having a autoCommit interval shorter than the softAutoCommit > internal does sound interesting and I will test it out. And then just > disable softCommit on my batch uploads. > > Either way, I at least know where to focus my efforts. > > -Kevin > > > > On Wed, Aug 14, 2013 at 6:27 AM, Jason Hellman < > jhell...@innoventsolutions.com> wrote: > >> Kevin, >> >> I wouldn't have considered using softCommits at all based on what I >> understand from your use case. You appear to be loading in large batches, >> and softCommits are better aligned to NRT search where there is a steady >> stream of smaller updates that need to be available immediately. >> >> As Erick pointed out, soft commits are all about avoiding constant >> reopening of the index searcher…where by constant we mean every few >> seconds. Provided you can wait until your batch is completed, and that >> frequency is roughly a minute or more, you likely will find an >> old-fashioned hard commit (with openSearcher="true") will work just fine >> (YMMV). >> >> Jason >> >> >> >> On Aug 14, 2013, at 4:51 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> > right, SOLR-5081 is possible but somewhat unlikely >> > given the fact that you actually don't have very many >> > nodes in your cluster. >> > >> > soft commits aren't relevant to the tlog, but here's >> > the thing. Your tlogs may get replayed >> > when you restart solr. If they're large, this may take >> > a long time. When you said you restarted Solr after >> > killing it, you might have triggered this. >> > >> > The way to keep tlogs small is to hard commit more >> > frequently (you should look at their size before >> > worrying about it though!). If you set openSearcher=false, >> > this is pretty inexpensive, all it really does is close >> > the current segment files, open new ones, and start a new >> > tlog file. It does _not_ invalidate caches, do autowarming, >> > all that expensive stuff. >> > >> > Your soft commit does _not_ improve performance! It is >> > just "less expensive" than a hard commit with >> > openSearcher=true. It _does_ invalidate caches, fire >> > off autowarming, etc. So it does "improve performance" >> > over doing hard commits with openSearcher=true >> > with the same frequency, but it still isn't free. It's still >> > good to have the soft commit interval as long as you >> > can tolerate. >> > >> > It's perfectly reasonable to have a hard commit interval >> > that's much shorter than your soft commit interval. As >> > Yonik explained once, "soft commits are about visibility >> > but hard commits are about durability". >> > >> > Best >> > Erick >> > >> > >> > On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn <kevin.osb...@cbsi.com> >> wrote: >> > >> >> Interesting, that did work. Do you or anyone else have any ideas or >> what I >> >> should look at? While soft commit is not a requirement in my project, >> my >> >> understanding is that it should help performance. On the same index, I >> will >> >> be doing both a large number of queries as well as updates. >> >> >> >> If I have to disable autoCommit, should I increase the chunk size? >> >> >> >> Of course, I will have to run a more large scale test tomorrow, but I >> saw >> >> this problem fairly consistently in my smaller test. >> >> >> >> In a previous experiment, I applied the SOLR-4816 patch that someone >> >> indicated might help. I also reduced the CSV upload chunk size to 500. >> It >> >> seemed like things got a little better, but still eventually hung. >> >> >> >> I also see SOLR-5081, but I don't know if that is my issue or not. At >> least >> >> in my test, the index writes are not parallel as in the ticket. >> >> >> >> -Kevin >> >> >> >> >> >> On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman < >> >> jhell...@innoventsolutions.com> wrote: >> >> >> >>> While I don't have a past history of this issue to use as reference, >> if I >> >>> were in your shoes I would consider trying your updates with >> softCommit >> >>> disabled. My suspicion is you're experiencing some issue with the >> >>> transaction logging and how it's managed when your hard commit occurs. >> >>> >> >>> If you can give that a try and let us know how that fares we might >> have >> >>> some further input to share. >> >>> >> >>> >> >>> On Aug 13, 2013, at 11:54 AM, Kevin Osborn <kevin.osb...@cbsi.com> >> >> wrote: >> >>> >> >>>> I am using Solr Cloud 4.4. It is pretty much a base configuration. We >> >>> have >> >>>> 2 servers and 3 collections. Collection1 is 1 shard and the >> Collection2 >> >>> and >> >>>> Collection3 both have 2 shards. Both servers are identical. >> >>>> >> >>>> So, here is my process, I do a lot of queries on Collection1 and >> >>>> Collection2. I then do a bunch of inserts into Collection3. I am >> doing >> >>> CSV >> >>>> uploads. I am also doing custom shard routing. All the products in a >> >>> single >> >>>> upload will have the same shard key. All Solr interaction is through >> >>> SolrJ >> >>>> with full Zookeeper awareness. My uploads are also using soft >> commits. >> >>>> >> >>>> I tried this on a record set of 936 products. Everything worked >> fine. I >> >>>> then sent over a record set of 300k products. The upload into >> >> Collection3 >> >>>> is chunked. I tried both 1000 and 200,000 with similar results. The >> >> first >> >>>> upload to Solr would just hang. There would simply be no response >> from >> >>>> Solr. A few of the products from this request would make it into the >> >>> index, >> >>>> but not many. >> >>>> >> >>>> In this state, queries continued to work, but deletes did not. >> >>>> >> >>>> My only solution was to kill each Solr process. >> >>>> >> >>>> As an experiment, I did the large catalog first. First, I reset >> >>> everything. >> >>>> With A chunk size of 1000, about 110,000 out of 300,000 records made >> it >> >>>> into Solr before the process hung. Again, queries worked, but deletes >> >> did >> >>>> not and I had to kill Solr. It hung after about 30 seconds. >> >> Timing-wise, >> >>>> this is at about the second autocommit cycle, given the default >> >>> autocommit >> >>>> of 15 seconds. I am not sure if this is related or not. >> >>>> >> >>>> As an additional experiment, I ran the entire test with just a single >> >>> node >> >>>> in the cluster. This time, everything ran fine. >> >>>> >> >>>> Does anyone have any ideas? Everything is pretty default. These >> servers >> >>> are >> >>>> Azure VMs, although I have seen similar behavior running two Solr >> >>> instances >> >>>> on a single internal server as well. >> >>>> >> >>>> I had also noticed similar behavior before with Solr 4.3. It >> definitely >> >>> has >> >>>> something do with the clustering, but I am not sure what. And I don't >> >> see >> >>>> any error message (or really anything else) in the Solr logs. >> >>>> >> >>>> Thanks. >> >>>> >> >>>> -- >> >>>> *KEVIN OSBORN* >> >>>> LEAD SOFTWARE ENGINEER >> >>>> CNET Content Solutions >> >>>> OFFICE 949.399.8714 >> >>>> CELL 949.310.4677 SKYPE osbornk >> >>>> 5 Park Plaza, Suite 600, Irvine, CA 92614 >> >>>> [image: CNET Content Solutions] >> >>> >> >>> >> >> >> >> >> >> -- >> >> *KEVIN OSBORN* >> >> LEAD SOFTWARE ENGINEER >> >> CNET Content Solutions >> >> OFFICE 949.399.8714 >> >> CELL 949.310.4677 SKYPE osbornk >> >> 5 Park Plaza, Suite 600, Irvine, CA 92614 >> >> [image: CNET Content Solutions] >> >> >> >> > > > -- > *KEVIN OSBORN* > > LEAD SOFTWARE ENGINEER > CNET Content Solutions > OFFICE 949.399.8714 > CELL 949.310.4677 SKYPE osbornk > 5 Park Plaza, Suite 600, Irvine, CA 92614 > [image: CNET Content Solutions] > > -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions]