Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

Erick Erickson Mon, 06 Oct 2014 21:23:37 -0700

Not, I'm not guaranteeing that it'll actually cure the problem, just
that enough has changed since 4.7 that it'd be a good place to start.


Things have been reported off and on, but they're often pesky race
conditions or something else that takes a long time to track down, you
just are lucky perhaps ;)...

Erick

On Mon, Oct 6, 2014 at 8:04 PM, S.L <simpleliving...@gmail.com> wrote:
> Erick,
>
> Thanks for the suggestion , I am not sure if I would be able to capture
> what went wrong , so upgrading to 4.10 seems easier even though it means ,
> a days work of effort :) . I will go ahead and upgrade and let me know ,
> although I am surprised that this issue never got reported for 4.7 up until
> now.
>
> Thanks again for your help!
>
>
>
> On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> I think there were some holes that would allow replicas and leaders to
>> be out of synch that have been patched up in the last 3 releases.
>>
>> There shouldn't be anything you need to do to keep these in synch, so
>> if you can capture what happened when things got out of synch we'll
>> fix it. But a lot has changed in the last several months, so the first
>> thing I'd do if possible is to upgrade to 4.10.1.
>>
>>
>> Best,
>> Erick
>>
>> On Mon, Oct 6, 2014 at 2:41 PM, S.L <simpleliving...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Before I tried your suggestion of  issung a commit=true update, I
>> realized that for eaach shard there was atleast a node that had its index
>> directory named like index.<timestamp>.
>> >
>> > I went ahead and deleted index directory that restarted that core and
>> now the index directory got syched with the other node and is properly
>> named as 'index' without any timestamp attached to it.This is now giving me
>> consistent results for distrib=true using a load balancer.Also
>> distrib=false returns expexted results for a given shard.
>> >
>> > The underlying issue appears to be that in every shard the leader and
>> the replica(follower) were out of sych.
>> >
>> > How can I avoid this from happening again?
>> >
>> > Thanks for your help!
>> >
>> > Sent from my HTC
>> >
>> > ----- Reply message -----
>> > From: "Erick Erickson" <erickerick...@gmail.com>
>> > To: <solr-user@lucene.apache.org>
>> > Subject: SolrCloud 4.7 not doing distributed search when querying from a
>> load balancer.
>> > Date: Fri, Oct 3, 2014 12:56 AM
>> >
>> > Hmmmm. Assuming that you aren't re-indexing the doc you're searching
>> for...
>> >
>> > Try issuing http://blah blah:8983/solr/collection/update?commit=true.
>> > That'll force all the docs to be searchable. Does <1> still hold for
>> > the document in question? Because this is exactly backwards of what
>> > I'd expect. I'd expect, if anything, the replica (I'm trying to call
>> > it the "follower" when a distinction needs to be made since the leader
>> > is a "replica" too....) would be out of sync. This is still a Bad
>> > Thing, but the leader gets first crack at indexing thing.
>> >
>> > bq: only the replica of the shard that has this key returns the result
>> > , and the leader does not ,
>> >
>> > Just to be sure we're talking about the same thing. When you say
>> > "leader", you mean the shard leader, right? The filled-in circle on
>> > the graph view from the admin/cloud page.
>> >
>> > And let's see your soft and hard commit settings please.
>> >
>> > Best,
>> > Erick
>> >
>> > On Thu, Oct 2, 2014 at 9:48 PM, S.L <simpleliving...@gmail.com> wrote:
>> >> Eirck,
>> >>
>> >> 0> Load balancer is out of the picture
>> >> .
>> >> 1>When I query with *distrib=false* , I get consistent results as
>> expected
>> >> for those shards that dont have the key i.e I dont get the results back
>> for
>> >> those shards, however I just realized that while *distrib=false* is
>> present
>> >> in the query for the shard that is supposed to contain the key,only the
>> >> replica of the shard that has this key returns the result , and the
>> leader
>> >> does not , looks like replica and the leader do not have the same data
>> and
>> >> replica seems to contain the key in the query for that shard.
>> >>
>> >> 2> By indexing I mean this collection is being populated by a web
>> crawler.
>> >>
>> >> So looks like 1> above  is pointing to leader and replica being out of
>> >> synch for atleast one shard.
>> >>
>> >>
>> >>
>> >> On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> >> wrote:
>> >>
>> >>> bq: Also ,the collection is being actively indexed as I query this,
>> could
>> >>> that
>> >>> be an issue too ?
>> >>>
>> >>> Not if the documents you're searching aren't being added as you search
>> >>> (and all your autocommit intervals have expired).
>> >>>
>> >>> I would turn off indexing for testing, it's just one more variable
>> >>> that can get in the way of understanding this.
>> >>>
>> >>> Do note that if the problem were endemic to Solr, there would probably
>> >>> be a _lot_ more noise out there.
>> >>>
>> >>> So to recap:
>> >>> 0> we can take the load balancer out of the picture all together.
>> >>>
>> >>> 1> when you query each shard individually with &distrib=true, every
>> >>> replica in a particular shard returns the same count.
>> >>>
>> >>> 2> when you query without &distrib=true you get varying counts.
>> >>>
>> >>> This is very strange and not at all expected. Let's try it again
>> >>> without indexing going on....
>> >>>
>> >>> And what do you mean by "indexing" anyway? How are documents being fed
>> >>> to your system?
>> >>>
>> >>> Best,
>> >>> Erick@PuzzledAsWell
>> >>>
>> >>> On Thu, Oct 2, 2014 at 7:32 PM, S.L <simpleliving...@gmail.com> wrote:
>> >>> > Erick,
>> >>> >
>> >>> > I would like to add that the interesting behavior i.e point #2 that I
>> >>> > mentioned in my earlier reply  happens in all the shards , if this
>> were
>> >>> to
>> >>> > be a distributed search issue this should have not manifested itself
>> in
>> >>> the
>> >>> > shard that contains the key that I am searching for , looks like the
>> >>> search
>> >>> > is just failing as whole intermittently .
>> >>> >
>> >>> > Also ,the collection is being actively indexed as I query this, could
>> >>> that
>> >>> > be an issue too ?
>> >>> >
>> >>> > Thanks.
>> >>> >
>> >>> > On Thu, Oct 2, 2014 at 10:24 PM, S.L <simpleliving...@gmail.com>
>> wrote:
>> >>> >
>> >>> >> Erick,
>> >>> >>
>> >>> >> Thanks for your reply, I tried your suggestions.
>> >>> >>
>> >>> >> 1 . When not using loadbalancer if  *I have distrib=false* I get
>> >>> >> consistent results across the replicas.
>> >>> >>
>> >>> >> 2. However here's the insteresting part , while not using load
>> balancer
>> >>> if
>> >>> >> I *dont have distrib=false* , then when I query a particular node
>> ,I get
>> >>> >> the same behaviour as if I were using a loadbalancer , meaning the
>> >>> >> distributed search from a node works intermittently .Does this give
>> any
>> >>> >> clue ?
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >>> >
>> >>> >> wrote:
>> >>> >>
>> >>> >>> Hmmm, nothing quite makes sense here....
>> >>> >>>
>> >>> >>> Here are some experiments:
>> >>> >>> 1> avoid the load balancer and issue queries like
>> >>> >>> http://solr_server:8983/solr/collection/q=whatever&distrib=false
>> >>> >>>
>> >>> >>> the &distrib=false bit will cause keep SolrCloud from trying to
>> send
>> >>> >>> the queries anywhere, they'll be served only from the node you
>> address
>> >>> >>> them to.
>> >>> >>> that'll help check whether the nodes are consistent. You should be
>> >>> >>> getting back the same results from each replica in a shard (i.e. 2
>> of
>> >>> >>> your 6 machines).
>> >>> >>>
>> >>> >>> Next, try your failing query the same way.
>> >>> >>>
>> >>> >>> Next, try your failing query from a browser, pointing it at
>> successive
>> >>> >>> nodes.
>> >>> >>>
>> >>> >>> Where is the first place problems show up?
>> >>> >>>
>> >>> >>> My _guess_ is that your load balancer isn't quite doing what you
>> >>> think, or
>> >>> >>> your cluster isn't set up the way you think it is, but those are
>> >>> guesses.
>> >>> >>>
>> >>> >>> Best,
>> >>> >>> Erick
>> >>> >>>
>> >>> >>> On Thu, Oct 2, 2014 at 2:51 PM, S.L <simpleliving...@gmail.com>
>> wrote:
>> >>> >>> > Hi All,
>> >>> >>> >
>> >>> >>> > I am trying to query a 6 node Solr4.7  cluster with 3 shards
>> and  a
>> >>> >>> > replication factor of 2 .
>> >>> >>> >
>> >>> >>> > I have fronted these 6 Solr nodes using a load balancer , what I
>> >>> notice
>> >>> >>> is
>> >>> >>> > that every time I do a search of the form
>> >>> >>> > q=*:*&fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a
>> >>> result
>> >>> >>> > only once in every 3 tries , telling me that the load balancer is
>> >>> >>> > distributing the requests between the 3 shards and SolrCloud only
>> >>> >>> returns a
>> >>> >>> > result if the request goes to the core that as that id .
>> >>> >>> >
>> >>> >>> > However if I do a simple search like q=*:* , I consistently get
>> the
>> >>> >>> right
>> >>> >>> > aggregated results back of all the documents across all the
>> shards
>> >>> for
>> >>> >>> > every request from the load balancer. Can someone please let me
>> know
>> >>> >>> what
>> >>> >>> > this is symptomatic of ?
>> >>> >>> >
>> >>> >>> > Somehow Solr Cloud seems to be doing search query distribution
>> and
>> >>> >>> > aggregation for queries of type *:* only.
>> >>> >>> >
>> >>> >>> > Thanks.
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>>
>>

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

Reply via email to