We tested the query on all replicas for the given shard, and they all have
the same issue. So deleting and adding another replica won't fix the
problem since the leader is exhibiting the behavior as well. I believe the
second replica was moved (new one added, old one deleted) between nodes and
so was just a copy of the leader's index after the problematic merge
happened.

bq: Anything that didn't merge old segments, just threw them
away when empty (which was my idea) would possibly require as much
disk space as the index currently occupied, so doesn't help your
disk-constrained situation.

Something like this was originally what I thought might fix the issue. If
we reindex the data for the affected shard, it would possibly delete all
docs from the old segments and just drop them instead of merging them. As
mentioned, you'd expect the problems to persist through subsequent merges.
So I've got two questions

1) If the problem persists through merges, does it only affect the segments
being merged, and then when solr goes looking for the values, it comes up
empty? Instead of all segments being affected by a single merge they
weren't a part of.

2) Is it expected that any large tainted segments will eventually merge
with clean segments resulting in more tainted segments as enough docs are
deleted on the large segments?

Also, we aren't disk constrained as much as previously. Reindexing a subset
of docs is possible, but a full clean collection reindex isn't.

Thanks,
Chris


On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> Never mind. Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> > If it's _only_ on a particular replica, here's what you could do:
> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> > define the "node" parameter on ADDREPLICA to get it back on the same
> > node. Then the normal replication process would pull the entire index
> > down from the leader.
> >
> > My bet, though, is that this wouldn't really fix things. While it fixes
> the
> > particular case you've noticed I'd guess others would pop up. You can
> > see what replicas return what by firing individual queries at the
> > particular replica in question with &distrib=false, something like
> >
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> > blah blah
> >
> >
> > bq: It is exceedingly unfortunate that reindexing the data on that shard
> only
> > probably won't end up fixing the problem
> >
> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
> > but progress has stalled.
> >
> > How would that work? You have two segments with vastly different
> > characteristics for a field. You could change the type, the
> multiValued-ness,
> > the analysis chain, there's no end to the things that could go wrong.
> Fixing
> > them actually _is_ impossible given how Lucene is structured.
> >
> > Hmmmm, you've now given me a brainstorm I'll suggest on the JIRA
> > system after I talk to the dev list....
> >
> > Consider indexed=true stored=false. After stemming, "running" can be
> > indexed as "run". At merge time you have no way of knowing that
> > "running" was the original term so you simply couldn't fix it on merge,
> > not to mention that the performance penalty would be...er...
> > severe.
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny <culicny@iq.media> wrote:
> >> I thought that decision would come back to bite us somehow. At the
> time, we
> >> didn't have enough space available to do a fresh reindex alongside the
> old
> >> collection, so the only course of action available was to index over the
> >> old one, and the vast majority of its use worked as expected.
> >>
> >> We're planning on upgrading to version 7 at some point in the near
> future
> >> and will have enough space to do a full, clean reindex at that time.
> >>
> >> bq: This can propagate through all following segment merges IIUC.
> >>
> >> It is exceedingly unfortunate that reindexing the data on that shard
> only
> >> probably won't end up fixing the problem.
> >>
> >> Out of curiosity, are there any good write-ups or documentation on how
> two
> >> (or more) lucene segments are merged, or is it just worth looking at the
> >> source code to figure that out?
> >>
> >> Thanks,
> >> Chris
> >>
> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson <erickerick...@gmail.com
> >
> >> wrote:
> >>
> >>> bq: ...but the collection wasn't emptied first....
> >>>
> >>> This is what I'd suspect is the problem. Here's the issue: Segments
> >>> aren't merged identically on all replicas. So at some point you had
> >>> this field indexed without docValues, changed that and re-indexed. But
> >>> the segment merging could "read" the first segment it's going to merge
> >>> and think it knows about docValues for that field, when in fact that
> >>> segment had the old (non-DV) definition.
> >>>
> >>> This would not necessarily be the same on all replicas even on the
> _same_
> >>> shard.
> >>>
> >>> This can propagate through all following segment merges IIUC.
> >>>
> >>> So my bet is that if you index into a new collection, everything will
> >>> be fine. You can also just delete everything first, but I usually
> >>> prefer a new collection so I'm absolutely and positively sure that the
> >>> above can't happen.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny <culicny@iq.media>
> wrote:
> >>> > Hi,
> >>> >
> >>> > We've run into a strange issue with our deployment of solrcloud
> 6.3.0.
> >>> > Essentially, a standard facet query on a string field usually comes
> back
> >>> > empty when it shouldn't. However, every now and again the query
> actually
> >>> > returns the correct values. This is only affecting a single shard in
> our
> >>> > setup.
> >>> >
> >>> > The behavior pattern generally looks like the query works properly
> when
> >>> it
> >>> > hasn't been run recently, and then returns nothing after the query
> seems
> >>> to
> >>> > have been cached (< 50ms QTime). Wait a while and you get the correct
> >>> > result followed by blanks. It doesn't matter which replica of the
> shard
> >>> is
> >>> > queried; the results are the same.
> >>> >
> >>> > The general query in question looks like
> >>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=<some filters>
> >>> >
> >>> > The field is defined in the schema as <field name="market"
> type="string"
> >>> > docValues="true"/>
> >>> >
> >>> > There are numerous other fields defined similarly, and they do not
> >>> exhibit
> >>> > the same behavior when used as the facet.field value. They
> consistently
> >>> > return the right results on the shard in question.
> >>> >
> >>> > If we add facet.method=enum to the query, we get the correct results
> >>> every
> >>> > time (though slower. So our assumption is that something is
> sporadically
> >>> > working when the fc method is chosen by default.
> >>> >
> >>> > A few other notes about the collection. This collection is not
> freshly
> >>> > indexed, but has not had any particularly bad failures beyond
> follower
> >>> > replicas going down due to PKIAuthentication timeouts (has been
> fixed).
> >>> It
> >>> > has also had a full reindex after a schema change added docValues
> some
> >>> > fields (including the one above), but the collection wasn't emptied
> >>> first.
> >>> > We are using the composite router to co-locate documents.
> >>> >
> >>> > Currently, our plan is just to reindex all of the documents on the
> >>> affected
> >>> > shard to see if that fixes the problem. Any ideas on what might be
> >>> > happening or ways to troubleshoot this are appreciated.
> >>> >
> >>> > Thanks,
> >>> > Chris
> >>>
>

Reply via email to