I thought that decision would come back to bite us somehow. At the time, we didn't have enough space available to do a fresh reindex alongside the old collection, so the only course of action available was to index over the old one, and the vast majority of its use worked as expected.
We're planning on upgrading to version 7 at some point in the near future and will have enough space to do a full, clean reindex at that time. bq: This can propagate through all following segment merges IIUC. It is exceedingly unfortunate that reindexing the data on that shard only probably won't end up fixing the problem. Out of curiosity, are there any good write-ups or documentation on how two (or more) lucene segments are merged, or is it just worth looking at the source code to figure that out? Thanks, Chris On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <erickerick...@gmail.com> wrote: > bq: ...but the collection wasn't emptied first.... > > This is what I'd suspect is the problem. Here's the issue: Segments > aren't merged identically on all replicas. So at some point you had > this field indexed without docValues, changed that and re-indexed. But > the segment merging could "read" the first segment it's going to merge > and think it knows about docValues for that field, when in fact that > segment had the old (non-DV) definition. > > This would not necessarily be the same on all replicas even on the _same_ > shard. > > This can propagate through all following segment merges IIUC. > > So my bet is that if you index into a new collection, everything will > be fine. You can also just delete everything first, but I usually > prefer a new collection so I'm absolutely and positively sure that the > above can't happen. > > Best, > Erick > > On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <culicny@iq.media> wrote: > > Hi, > > > > We've run into a strange issue with our deployment of solrcloud 6.3.0. > > Essentially, a standard facet query on a string field usually comes back > > empty when it shouldn't. However, every now and again the query actually > > returns the correct values. This is only affecting a single shard in our > > setup. > > > > The behavior pattern generally looks like the query works properly when > it > > hasn't been run recently, and then returns nothing after the query seems > to > > have been cached (< 50ms QTime). Wait a while and you get the correct > > result followed by blanks. It doesn't matter which replica of the shard > is > > queried; the results are the same. > > > > The general query in question looks like > > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters> > > > > The field is defined in the schema as <field name="market" type="string" > > docValues="true"/> > > > > There are numerous other fields defined similarly, and they do not > exhibit > > the same behavior when used as the facet.field value. They consistently > > return the right results on the shard in question. > > > > If we add facet.method=enum to the query, we get the correct results > every > > time (though slower. So our assumption is that something is sporadically > > working when the fc method is chosen by default. > > > > A few other notes about the collection. This collection is not freshly > > indexed, but has not had any particularly bad failures beyond follower > > replicas going down due to PKIAuthentication timeouts (has been fixed). > It > > has also had a full reindex after a schema change added docValues some > > fields (including the one above), but the collection wasn't emptied > first. > > We are using the composite router to co-locate documents. > > > > Currently, our plan is just to reindex all of the documents on the > affected > > shard to see if that fixes the problem. Any ideas on what might be > > happening or ways to troubleshoot this are appreciated. > > > > Thanks, > > Chris >