Not, I'm not guaranteeing that it'll actually cure the problem, just that enough has changed since 4.7 that it'd be a good place to start.
Things have been reported off and on, but they're often pesky race conditions or something else that takes a long time to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L <simpleliving...@gmail.com> wrote: > Erick, > > Thanks for the suggestion , I am not sure if I would be able to capture > what went wrong , so upgrading to 4.10 seems easier even though it means , > a days work of effort :) . I will go ahead and upgrade and let me know , > although I am surprised that this issue never got reported for 4.7 up until > now. > > Thanks again for your help! > > > > On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> I think there were some holes that would allow replicas and leaders to >> be out of synch that have been patched up in the last 3 releases. >> >> There shouldn't be anything you need to do to keep these in synch, so >> if you can capture what happened when things got out of synch we'll >> fix it. But a lot has changed in the last several months, so the first >> thing I'd do if possible is to upgrade to 4.10.1. >> >> >> Best, >> Erick >> >> On Mon, Oct 6, 2014 at 2:41 PM, S.L <simpleliving...@gmail.com> wrote: >> > Hi Erick, >> > >> > Before I tried your suggestion of issung a commit=true update, I >> realized that for eaach shard there was atleast a node that had its index >> directory named like index.<timestamp>. >> > >> > I went ahead and deleted index directory that restarted that core and >> now the index directory got syched with the other node and is properly >> named as 'index' without any timestamp attached to it.This is now giving me >> consistent results for distrib=true using a load balancer.Also >> distrib=false returns expexted results for a given shard. >> > >> > The underlying issue appears to be that in every shard the leader and >> the replica(follower) were out of sych. >> > >> > How can I avoid this from happening again? >> > >> > Thanks for your help! >> > >> > Sent from my HTC >> > >> > ----- Reply message ----- >> > From: "Erick Erickson" <erickerick...@gmail.com> >> > To: <solr-user@lucene.apache.org> >> > Subject: SolrCloud 4.7 not doing distributed search when querying from a >> load balancer. >> > Date: Fri, Oct 3, 2014 12:56 AM >> > >> > Hmmmm. Assuming that you aren't re-indexing the doc you're searching >> for... >> > >> > Try issuing http://blah blah:8983/solr/collection/update?commit=true. >> > That'll force all the docs to be searchable. Does <1> still hold for >> > the document in question? Because this is exactly backwards of what >> > I'd expect. I'd expect, if anything, the replica (I'm trying to call >> > it the "follower" when a distinction needs to be made since the leader >> > is a "replica" too....) would be out of sync. This is still a Bad >> > Thing, but the leader gets first crack at indexing thing. >> > >> > bq: only the replica of the shard that has this key returns the result >> > , and the leader does not , >> > >> > Just to be sure we're talking about the same thing. When you say >> > "leader", you mean the shard leader, right? The filled-in circle on >> > the graph view from the admin/cloud page. >> > >> > And let's see your soft and hard commit settings please. >> > >> > Best, >> > Erick >> > >> > On Thu, Oct 2, 2014 at 9:48 PM, S.L <simpleliving...@gmail.com> wrote: >> >> Eirck, >> >> >> >> 0> Load balancer is out of the picture >> >> . >> >> 1>When I query with *distrib=false* , I get consistent results as >> expected >> >> for those shards that dont have the key i.e I dont get the results back >> for >> >> those shards, however I just realized that while *distrib=false* is >> present >> >> in the query for the shard that is supposed to contain the key,only the >> >> replica of the shard that has this key returns the result , and the >> leader >> >> does not , looks like replica and the leader do not have the same data >> and >> >> replica seems to contain the key in the query for that shard. >> >> >> >> 2> By indexing I mean this collection is being populated by a web >> crawler. >> >> >> >> So looks like 1> above is pointing to leader and replica being out of >> >> synch for atleast one shard. >> >> >> >> >> >> >> >> On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson < >> erickerick...@gmail.com> >> >> wrote: >> >> >> >>> bq: Also ,the collection is being actively indexed as I query this, >> could >> >>> that >> >>> be an issue too ? >> >>> >> >>> Not if the documents you're searching aren't being added as you search >> >>> (and all your autocommit intervals have expired). >> >>> >> >>> I would turn off indexing for testing, it's just one more variable >> >>> that can get in the way of understanding this. >> >>> >> >>> Do note that if the problem were endemic to Solr, there would probably >> >>> be a _lot_ more noise out there. >> >>> >> >>> So to recap: >> >>> 0> we can take the load balancer out of the picture all together. >> >>> >> >>> 1> when you query each shard individually with &distrib=true, every >> >>> replica in a particular shard returns the same count. >> >>> >> >>> 2> when you query without &distrib=true you get varying counts. >> >>> >> >>> This is very strange and not at all expected. Let's try it again >> >>> without indexing going on.... >> >>> >> >>> And what do you mean by "indexing" anyway? How are documents being fed >> >>> to your system? >> >>> >> >>> Best, >> >>> Erick@PuzzledAsWell >> >>> >> >>> On Thu, Oct 2, 2014 at 7:32 PM, S.L <simpleliving...@gmail.com> wrote: >> >>> > Erick, >> >>> > >> >>> > I would like to add that the interesting behavior i.e point #2 that I >> >>> > mentioned in my earlier reply happens in all the shards , if this >> were >> >>> to >> >>> > be a distributed search issue this should have not manifested itself >> in >> >>> the >> >>> > shard that contains the key that I am searching for , looks like the >> >>> search >> >>> > is just failing as whole intermittently . >> >>> > >> >>> > Also ,the collection is being actively indexed as I query this, could >> >>> that >> >>> > be an issue too ? >> >>> > >> >>> > Thanks. >> >>> > >> >>> > On Thu, Oct 2, 2014 at 10:24 PM, S.L <simpleliving...@gmail.com> >> wrote: >> >>> > >> >>> >> Erick, >> >>> >> >> >>> >> Thanks for your reply, I tried your suggestions. >> >>> >> >> >>> >> 1 . When not using loadbalancer if *I have distrib=false* I get >> >>> >> consistent results across the replicas. >> >>> >> >> >>> >> 2. However here's the insteresting part , while not using load >> balancer >> >>> if >> >>> >> I *dont have distrib=false* , then when I query a particular node >> ,I get >> >>> >> the same behaviour as if I were using a loadbalancer , meaning the >> >>> >> distributed search from a node works intermittently .Does this give >> any >> >>> >> clue ? >> >>> >> >> >>> >> >> >>> >> >> >>> >> On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson < >> erickerick...@gmail.com >> >>> > >> >>> >> wrote: >> >>> >> >> >>> >>> Hmmm, nothing quite makes sense here.... >> >>> >>> >> >>> >>> Here are some experiments: >> >>> >>> 1> avoid the load balancer and issue queries like >> >>> >>> http://solr_server:8983/solr/collection/q=whatever&distrib=false >> >>> >>> >> >>> >>> the &distrib=false bit will cause keep SolrCloud from trying to >> send >> >>> >>> the queries anywhere, they'll be served only from the node you >> address >> >>> >>> them to. >> >>> >>> that'll help check whether the nodes are consistent. You should be >> >>> >>> getting back the same results from each replica in a shard (i.e. 2 >> of >> >>> >>> your 6 machines). >> >>> >>> >> >>> >>> Next, try your failing query the same way. >> >>> >>> >> >>> >>> Next, try your failing query from a browser, pointing it at >> successive >> >>> >>> nodes. >> >>> >>> >> >>> >>> Where is the first place problems show up? >> >>> >>> >> >>> >>> My _guess_ is that your load balancer isn't quite doing what you >> >>> think, or >> >>> >>> your cluster isn't set up the way you think it is, but those are >> >>> guesses. >> >>> >>> >> >>> >>> Best, >> >>> >>> Erick >> >>> >>> >> >>> >>> On Thu, Oct 2, 2014 at 2:51 PM, S.L <simpleliving...@gmail.com> >> wrote: >> >>> >>> > Hi All, >> >>> >>> > >> >>> >>> > I am trying to query a 6 node Solr4.7 cluster with 3 shards >> and a >> >>> >>> > replication factor of 2 . >> >>> >>> > >> >>> >>> > I have fronted these 6 Solr nodes using a load balancer , what I >> >>> notice >> >>> >>> is >> >>> >>> > that every time I do a search of the form >> >>> >>> > q=*:*&fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a >> >>> result >> >>> >>> > only once in every 3 tries , telling me that the load balancer is >> >>> >>> > distributing the requests between the 3 shards and SolrCloud only >> >>> >>> returns a >> >>> >>> > result if the request goes to the core that as that id . >> >>> >>> > >> >>> >>> > However if I do a simple search like q=*:* , I consistently get >> the >> >>> >>> right >> >>> >>> > aggregated results back of all the documents across all the >> shards >> >>> for >> >>> >>> > every request from the load balancer. Can someone please let me >> know >> >>> >>> what >> >>> >>> > this is symptomatic of ? >> >>> >>> > >> >>> >>> > Somehow Solr Cloud seems to be doing search query distribution >> and >> >>> >>> > aggregation for queries of type *:* only. >> >>> >>> > >> >>> >>> > Thanks. >> >>> >>> >> >>> >> >> >>> >> >> >>> >>