Re: Indexing hangs when more than 1 server in a cluster

Kevin Osborn Wed, 14 Aug 2013 16:24:34 -0700

Actually, I thought it worked last night, but that may have just been a
fluke. Today, it is not working.


This is what I have done.

I have turned off autoCommit and softAutoCommit. My updates are not sending
any softCommit messages.

I am sending over data in chunks of 500 records.

At the end of each complete upload, I am doing an explicit commit.

So, I send over the first upload (536 records). So, this works fine in 2
chunks. After the commit, they are searchable as well.

The second catalog is much larger (300k records). This starts uploading
about 5 minutes later. Usually, it hangs on the very first chunk. If I were
to kill the server during the hang, it does then work.

As a variation, I set autoCommit maxDocs with openSearcher to false. I set
this to 10000. During the test, it hung right after the second autocommit.

In both servers, I see a long waiting commitScheduler object in the thread
dump.

   - sun.misc.Unsafe.park(Native Method)
   - java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   -
   
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079)
   -
   
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
   -
   java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
   -
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
   -
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   - java.lang.Thread.run(Thread.java:722)


On the second server, I see quite a few other long waiting processes as
well.

-Kevin


On Wed, Aug 14, 2013 at 9:51 AM, Kevin Osborn <kevin.osb...@cbsi.com> wrote:

> Thanks so much for your help and for the explanations. Eventually, we will
> be doing several batches in parallel. But at least now I know where to look
> and can do some testing on various scenarios.
>
> Since we may be doing a lot of heavy uploading (while still doing a lot of
> queries), having a autoCommit interval shorter than the softAutoCommit
> internal does sound interesting and I will test it out. And then just
> disable softCommit on my batch uploads.
>
> Either way, I at least know where to focus my efforts.
>
> -Kevin
>
>
>
> On Wed, Aug 14, 2013 at 6:27 AM, Jason Hellman <
> jhell...@innoventsolutions.com> wrote:
>
>> Kevin,
>>
>> I wouldn't have considered using softCommits at all based on what I
>> understand from your use case.  You appear to be loading in large batches,
>> and softCommits are better aligned to NRT search where there is a steady
>> stream of smaller updates that need to be available immediately.
>>
>> As Erick pointed out, soft commits are all about avoiding constant
>> reopening of the index searcher…where by constant we mean every few
>> seconds.  Provided you can wait until your batch is completed, and that
>> frequency is roughly a minute or more, you likely will find an
>> old-fashioned hard commit (with openSearcher="true") will work just fine
>> (YMMV).
>>
>> Jason
>>
>>
>>
>> On Aug 14, 2013, at 4:51 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>> > right, SOLR-5081 is possible but somewhat unlikely
>> > given the fact that you actually don't have very many
>> > nodes in your cluster.
>> >
>> > soft commits aren't relevant to the tlog, but here's
>> > the thing. Your tlogs may get replayed
>> > when you restart solr. If they're large, this may take
>> > a long time. When you said you restarted Solr after
>> > killing it, you might have triggered this.
>> >
>> > The way to keep tlogs small is to hard commit more
>> > frequently (you should look at their size before
>> > worrying about it though!). If you set openSearcher=false,
>> > this is pretty inexpensive, all it really does is close
>> > the current segment files, open new ones, and start a new
>> > tlog file. It does _not_ invalidate caches, do autowarming,
>> > all that expensive stuff.
>> >
>> > Your soft commit does _not_ improve performance! It is
>> > just "less expensive" than a hard commit with
>> > openSearcher=true. It _does_ invalidate caches, fire
>> > off autowarming, etc. So it does "improve performance"
>> > over doing hard commits with openSearcher=true
>> > with the same frequency, but it still isn't free. It's still
>> > good to have the soft commit interval as long as you
>> > can tolerate.
>> >
>> > It's perfectly reasonable to have a hard commit interval
>> > that's much shorter than your soft commit interval. As
>> > Yonik explained once, "soft commits are about visibility
>> > but hard commits are about durability".
>> >
>> > Best
>> > Erick
>> >
>> >
>> > On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn <kevin.osb...@cbsi.com>
>> wrote:
>> >
>> >> Interesting, that did work. Do you or anyone else have any ideas or
>> what I
>> >> should look at? While soft commit is not a requirement in my project,
>> my
>> >> understanding is that it should help performance. On the same index, I
>> will
>> >> be doing both a large number of queries as well as updates.
>> >>
>> >> If I have to disable autoCommit, should I increase the chunk size?
>> >>
>> >> Of course, I will have to run a more large scale test tomorrow, but I
>> saw
>> >> this problem fairly consistently in my smaller test.
>> >>
>> >> In a previous experiment, I applied the SOLR-4816 patch that someone
>> >> indicated might help. I also reduced the CSV upload chunk size to 500.
>> It
>> >> seemed like things got a little better, but still eventually hung.
>> >>
>> >> I also see SOLR-5081, but I don't know if that is my issue or not. At
>> least
>> >> in my test, the index writes are not parallel as in the ticket.
>> >>
>> >> -Kevin
>> >>
>> >>
>> >> On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman <
>> >> jhell...@innoventsolutions.com> wrote:
>> >>
>> >>> While I don't have a past history of this issue to use as reference,
>> if I
>> >>> were in your shoes I would consider trying your updates with
>> softCommit
>> >>> disabled.  My suspicion is you're experiencing some issue with the
>> >>> transaction logging and how it's managed when your hard commit occurs.
>> >>>
>> >>> If you can give that a try and let us know how that fares we might
>> have
>> >>> some further input to share.
>> >>>
>> >>>
>> >>> On Aug 13, 2013, at 11:54 AM, Kevin Osborn <kevin.osb...@cbsi.com>
>> >> wrote:
>> >>>
>> >>>> I am using Solr Cloud 4.4. It is pretty much a base configuration. We
>> >>> have
>> >>>> 2 servers and 3 collections. Collection1 is 1 shard and the
>> Collection2
>> >>> and
>> >>>> Collection3 both have 2 shards. Both servers are identical.
>> >>>>
>> >>>> So, here is my process, I do a lot of queries on Collection1 and
>> >>>> Collection2. I then do a bunch of inserts into Collection3. I am
>> doing
>> >>> CSV
>> >>>> uploads. I am also doing custom shard routing. All the products in a
>> >>> single
>> >>>> upload will have the same shard key. All Solr interaction is through
>> >>> SolrJ
>> >>>> with full Zookeeper awareness. My uploads are also using soft
>> commits.
>> >>>>
>> >>>> I tried this on a record set of 936 products. Everything worked
>> fine. I
>> >>>> then sent over a record set of 300k products. The upload into
>> >> Collection3
>> >>>> is chunked. I tried both 1000 and 200,000 with similar results. The
>> >> first
>> >>>> upload to Solr would just hang. There would simply be no response
>> from
>> >>>> Solr. A few of the products from this request would make it into the
>> >>> index,
>> >>>> but not many.
>> >>>>
>> >>>> In this state, queries continued to work, but deletes did not.
>> >>>>
>> >>>> My only solution was to kill each Solr process.
>> >>>>
>> >>>> As an experiment, I did the large catalog first. First, I reset
>> >>> everything.
>> >>>> With A chunk size of 1000, about 110,000 out of 300,000 records made
>> it
>> >>>> into Solr before the process hung. Again, queries worked, but deletes
>> >> did
>> >>>> not and I had to kill Solr. It hung after about 30 seconds.
>> >> Timing-wise,
>> >>>> this is at about the second autocommit cycle, given the default
>> >>> autocommit
>> >>>> of 15 seconds. I am not sure if this is related or not.
>> >>>>
>> >>>> As an additional experiment, I ran the entire test with just a single
>> >>> node
>> >>>> in the cluster. This time, everything ran fine.
>> >>>>
>> >>>> Does anyone have any ideas? Everything is pretty default. These
>> servers
>> >>> are
>> >>>> Azure VMs, although I have seen similar behavior running two Solr
>> >>> instances
>> >>>> on a single internal server as well.
>> >>>>
>> >>>> I had also noticed similar behavior before with Solr 4.3. It
>> definitely
>> >>> has
>> >>>> something do with the clustering, but I am not sure what. And I don't
>> >> see
>> >>>> any error message (or really anything else) in the Solr logs.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> --
>> >>>> *KEVIN OSBORN*
>> >>>> LEAD SOFTWARE ENGINEER
>> >>>> CNET Content Solutions
>> >>>> OFFICE 949.399.8714
>> >>>> CELL 949.310.4677      SKYPE osbornk
>> >>>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> >>>> [image: CNET Content Solutions]
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> *KEVIN OSBORN*
>> >> LEAD SOFTWARE ENGINEER
>> >> CNET Content Solutions
>> >> OFFICE 949.399.8714
>> >> CELL 949.310.4677      SKYPE osbornk
>> >> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> >> [image: CNET Content Solutions]
>> >>
>>
>>
>
>
> --
> *KEVIN OSBORN*
>
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677      SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]
>
>


-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677      SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]

Re: Indexing hangs when more than 1 server in a cluster

Reply via email to