Re: Indexing hangs when more than 1 server in a cluster

Kevin Osborn Wed, 14 Aug 2013 17:13:50 -0700

I may have a bit of good news. The ulimit of open files was set to 4096. I
just chose a random high limit (100000) and it seems to be working better
now. I still have more testing to do though, but the initial results are
hopeful.




On Wed, Aug 14, 2013 at 4:22 PM, Kevin Osborn <kevin.osb...@cbsi.com> wrote:

> Actually, I thought it worked last night, but that may have just been a
> fluke. Today, it is not working.
>
> This is what I have done.
>
> I have turned off autoCommit and softAutoCommit. My updates are not
> sending any softCommit messages.
>
> I am sending over data in chunks of 500 records.
>
> At the end of each complete upload, I am doing an explicit commit.
>
> So, I send over the first upload (536 records). So, this works fine in 2
> chunks. After the commit, they are searchable as well.
>
> The second catalog is much larger (300k records). This starts uploading
> about 5 minutes later. Usually, it hangs on the very first chunk. If I were
> to kill the server during the hang, it does then work.
>
> As a variation, I set autoCommit maxDocs with openSearcher to false. I set
> this to 10000. During the test, it hung right after the second autocommit.
>
> In both servers, I see a long waiting commitScheduler object in the thread
> dump.
>
>    - sun.misc.Unsafe.park(Native Method)
>    - java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>    -
>    
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>    -
>    
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079)
>    -
>    
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
>    -
>    
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>    -
>    
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>    -
>    
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    - java.lang.Thread.run(Thread.java:722)
>
>
> On the second server, I see quite a few other long waiting processes as
> well.
>
> -Kevin
>
>
> On Wed, Aug 14, 2013 at 9:51 AM, Kevin Osborn <kevin.osb...@cbsi.com>wrote:
>
>> Thanks so much for your help and for the explanations. Eventually, we
>> will be doing several batches in parallel. But at least now I know where to
>> look and can do some testing on various scenarios.
>>
>> Since we may be doing a lot of heavy uploading (while still doing a lot
>> of queries), having a autoCommit interval shorter than the softAutoCommit
>> internal does sound interesting and I will test it out. And then just
>> disable softCommit on my batch uploads.
>>
>> Either way, I at least know where to focus my efforts.
>>
>> -Kevin
>>
>>
>>
>> On Wed, Aug 14, 2013 at 6:27 AM, Jason Hellman <
>> jhell...@innoventsolutions.com> wrote:
>>
>>> Kevin,
>>>
>>> I wouldn't have considered using softCommits at all based on what I
>>> understand from your use case.  You appear to be loading in large batches,
>>> and softCommits are better aligned to NRT search where there is a steady
>>> stream of smaller updates that need to be available immediately.
>>>
>>> As Erick pointed out, soft commits are all about avoiding constant
>>> reopening of the index searcher…where by constant we mean every few
>>> seconds.  Provided you can wait until your batch is completed, and that
>>> frequency is roughly a minute or more, you likely will find an
>>> old-fashioned hard commit (with openSearcher="true") will work just fine
>>> (YMMV).
>>>
>>> Jason
>>>
>>>
>>>
>>> On Aug 14, 2013, at 4:51 AM, Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>
>>> > right, SOLR-5081 is possible but somewhat unlikely
>>> > given the fact that you actually don't have very many
>>> > nodes in your cluster.
>>> >
>>> > soft commits aren't relevant to the tlog, but here's
>>> > the thing. Your tlogs may get replayed
>>> > when you restart solr. If they're large, this may take
>>> > a long time. When you said you restarted Solr after
>>> > killing it, you might have triggered this.
>>> >
>>> > The way to keep tlogs small is to hard commit more
>>> > frequently (you should look at their size before
>>> > worrying about it though!). If you set openSearcher=false,
>>> > this is pretty inexpensive, all it really does is close
>>> > the current segment files, open new ones, and start a new
>>> > tlog file. It does _not_ invalidate caches, do autowarming,
>>> > all that expensive stuff.
>>> >
>>> > Your soft commit does _not_ improve performance! It is
>>> > just "less expensive" than a hard commit with
>>> > openSearcher=true. It _does_ invalidate caches, fire
>>> > off autowarming, etc. So it does "improve performance"
>>> > over doing hard commits with openSearcher=true
>>> > with the same frequency, but it still isn't free. It's still
>>> > good to have the soft commit interval as long as you
>>> > can tolerate.
>>> >
>>> > It's perfectly reasonable to have a hard commit interval
>>> > that's much shorter than your soft commit interval. As
>>> > Yonik explained once, "soft commits are about visibility
>>> > but hard commits are about durability".
>>> >
>>> > Best
>>> > Erick
>>> >
>>> >
>>> > On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn <kevin.osb...@cbsi.com>
>>> wrote:
>>> >
>>> >> Interesting, that did work. Do you or anyone else have any ideas or
>>> what I
>>> >> should look at? While soft commit is not a requirement in my project,
>>> my
>>> >> understanding is that it should help performance. On the same index,
>>> I will
>>> >> be doing both a large number of queries as well as updates.
>>> >>
>>> >> If I have to disable autoCommit, should I increase the chunk size?
>>> >>
>>> >> Of course, I will have to run a more large scale test tomorrow, but I
>>> saw
>>> >> this problem fairly consistently in my smaller test.
>>> >>
>>> >> In a previous experiment, I applied the SOLR-4816 patch that someone
>>> >> indicated might help. I also reduced the CSV upload chunk size to
>>> 500. It
>>> >> seemed like things got a little better, but still eventually hung.
>>> >>
>>> >> I also see SOLR-5081, but I don't know if that is my issue or not. At
>>> least
>>> >> in my test, the index writes are not parallel as in the ticket.
>>> >>
>>> >> -Kevin
>>> >>
>>> >>
>>> >> On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman <
>>> >> jhell...@innoventsolutions.com> wrote:
>>> >>
>>> >>> While I don't have a past history of this issue to use as reference,
>>> if I
>>> >>> were in your shoes I would consider trying your updates with
>>> softCommit
>>> >>> disabled.  My suspicion is you're experiencing some issue with the
>>> >>> transaction logging and how it's managed when your hard commit
>>> occurs.
>>> >>>
>>> >>> If you can give that a try and let us know how that fares we might
>>> have
>>> >>> some further input to share.
>>> >>>
>>> >>>
>>> >>> On Aug 13, 2013, at 11:54 AM, Kevin Osborn <kevin.osb...@cbsi.com>
>>> >> wrote:
>>> >>>
>>> >>>> I am using Solr Cloud 4.4. It is pretty much a base configuration.
>>> We
>>> >>> have
>>> >>>> 2 servers and 3 collections. Collection1 is 1 shard and the
>>> Collection2
>>> >>> and
>>> >>>> Collection3 both have 2 shards. Both servers are identical.
>>> >>>>
>>> >>>> So, here is my process, I do a lot of queries on Collection1 and
>>> >>>> Collection2. I then do a bunch of inserts into Collection3. I am
>>> doing
>>> >>> CSV
>>> >>>> uploads. I am also doing custom shard routing. All the products in a
>>> >>> single
>>> >>>> upload will have the same shard key. All Solr interaction is through
>>> >>> SolrJ
>>> >>>> with full Zookeeper awareness. My uploads are also using soft
>>> commits.
>>> >>>>
>>> >>>> I tried this on a record set of 936 products. Everything worked
>>> fine. I
>>> >>>> then sent over a record set of 300k products. The upload into
>>> >> Collection3
>>> >>>> is chunked. I tried both 1000 and 200,000 with similar results. The
>>> >> first
>>> >>>> upload to Solr would just hang. There would simply be no response
>>> from
>>> >>>> Solr. A few of the products from this request would make it into the
>>> >>> index,
>>> >>>> but not many.
>>> >>>>
>>> >>>> In this state, queries continued to work, but deletes did not.
>>> >>>>
>>> >>>> My only solution was to kill each Solr process.
>>> >>>>
>>> >>>> As an experiment, I did the large catalog first. First, I reset
>>> >>> everything.
>>> >>>> With A chunk size of 1000, about 110,000 out of 300,000 records
>>> made it
>>> >>>> into Solr before the process hung. Again, queries worked, but
>>> deletes
>>> >> did
>>> >>>> not and I had to kill Solr. It hung after about 30 seconds.
>>> >> Timing-wise,
>>> >>>> this is at about the second autocommit cycle, given the default
>>> >>> autocommit
>>> >>>> of 15 seconds. I am not sure if this is related or not.
>>> >>>>
>>> >>>> As an additional experiment, I ran the entire test with just a
>>> single
>>> >>> node
>>> >>>> in the cluster. This time, everything ran fine.
>>> >>>>
>>> >>>> Does anyone have any ideas? Everything is pretty default. These
>>> servers
>>> >>> are
>>> >>>> Azure VMs, although I have seen similar behavior running two Solr
>>> >>> instances
>>> >>>> on a single internal server as well.
>>> >>>>
>>> >>>> I had also noticed similar behavior before with Solr 4.3. It
>>> definitely
>>> >>> has
>>> >>>> something do with the clustering, but I am not sure what. And I
>>> don't
>>> >> see
>>> >>>> any error message (or really anything else) in the Solr logs.
>>> >>>>
>>> >>>> Thanks.
>>> >>>>
>>> >>>> --
>>> >>>> *KEVIN OSBORN*
>>> >>>> LEAD SOFTWARE ENGINEER
>>> >>>> CNET Content Solutions
>>> >>>> OFFICE 949.399.8714
>>> >>>> CELL 949.310.4677      SKYPE osbornk
>>> >>>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>>> >>>> [image: CNET Content Solutions]
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> *KEVIN OSBORN*
>>> >> LEAD SOFTWARE ENGINEER
>>> >> CNET Content Solutions
>>> >> OFFICE 949.399.8714
>>> >> CELL 949.310.4677      SKYPE osbornk
>>> >> 5 Park Plaza, Suite 600, Irvine, CA 92614
>>> >> [image: CNET Content Solutions]
>>> >>
>>>
>>>
>>
>>
>> --
>> *KEVIN OSBORN*
>>
>> LEAD SOFTWARE ENGINEER
>> CNET Content Solutions
>> OFFICE 949.399.8714
>> CELL 949.310.4677      SKYPE osbornk
>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> [image: CNET Content Solutions]
>>
>>
>
>
> --
> *KEVIN OSBORN*
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677      SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]
>
>


-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677      SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]

Re: Indexing hangs when more than 1 server in a cluster

Reply via email to