Hi Alex,
Thank you for your investigation. I will be able to take a deeper look at
it by Thursday or Friday. This fix is potentially very important for my
client, Fullstory, so thank you for your work on this on their behalf too.
Regards,
Ishan

On Mon, 10 Apr, 2023, 23:54 Alex Deparvu, <stilla...@apache.org> wrote:

> Hi,
>
> Gentle reminder for anyone interested to take a look at the PR. I have
> added a lot of detail to this failure (and a fix proposal) and would
> appreciate some feedback, or at least let me know if you are planning on
> taking a look at some point in the near future.
> This would help both with validating some of the findings so far and
> building some more confidence to keep digging into the 'delete' scenario
> failures.
> PR https://github.com/apache/solr/pull/1504
>
> best
> alex
>
>
>
> On Mon, Apr 3, 2023 at 9:49 AM Alex Deparvu <stilla...@apache.org> wrote:
>
> > An update on the failing test.
> >
> > The failure comes from documents having null version field from the add
> > operations running in parallel with the split shard operation.
> >
> > Digging into this a bit more (and running the test a very large number of
> > times) I think I identified 3 problematic cases that can cause failures.
> > will list them in order of importance.
> >
> > First, documents persisted with null version. I think this is by far the
> > most important one to fix,and my PR is focused on closing this gap.
> > The race condition seems to be when the new shard slice will receive
> > pending updates from the old shard leader and persist them 'as is', but
> the
> > trouble is some of these have zero version. so the docs will get
> persisted
> > without version, breaking the test's consistency checks. My tentative fix
> > is to allow the new slice to act as a master and attach a version.
> > I also added a version integrity check that will not allow
> leader-received
> > documents to be persisted without a version. I am considering moving this
> > behind some sort of system property/flag to disable on demand if it turns
> > out this is too strict.
> > As far as testing is concerned: there was no need for more tests, the
> > current one reproduces this issue easily (unfortunately it did not apply
> > stricter assertions regarding errors on the parallel operations), so I
> made
> > it a tiny bit stricter, in the hope that the test fails early and with a
> > clearer message.
> > I am looking for some thoughts and feedback on the PR itself but also on
> > understanding the impact. This smells like a data corruption situation,
> but
> > being reduced to the split shard operation, so maybe the window is not
> very
> > big. please correct me if I am overstating the problem.
> >
> > Second. Another race condition happens on the deletes as well. but due to
> > the first exception it is invisible. this one is milder, in the sense
> where
> > deletes will be rejected, so less data corruption opportunity. here as
> well
> > there is some opportunity to close this gap. I tried to do it as a part
> of
> > this PR but I hit some weirdness, so instead of sitting on this more, I
> > would prefer to have it reviewed and approach deletes in a subsequent PR.
> >
> > Third, I keep seeing yet another exception "Request says it is coming
> from
> > parent shard leader but we are in active state", which seems to be more
> > buffered requests on the old leader being sent over. I did not get any
> > chance to look at this, so cannot comment on the impact yet.
> >
> >
> > If anyone has interest (or some knowledge) in this area, please take a
> > look at the PR and let me know your thoughts.
> > PR https://github.com/apache/solr/pull/1504
> >
> > best
> > alex
> >
> >
> >
> >
> >
> > On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kris...@apache.org> wrote:
> >
> >> I don't have any helpful pointers as to why this might be happening -
> but
> >> I
> >> do want to say thanks for digging in and finding out the cause as well
> as
> >> finding some related old jiras. Its helpful to even just make small
> steps
> >> forward.
> >>
> >> Kevin Risden
> >>
> >>
> >> On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <stilla...@apache.org>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > Following David's recent email about the failing tests (and obviously
> >> not
> >> > knowing any better) I started to look at the ShardSplitTest failures.
> >> >
> >> > The easy part was turning the current NPEs into an actual assertion.
> The
> >> > hard part is that the documents seem to sometimes be missing the
> >> _version_
> >> > field which causes the test to fail.
> >> > If anyone has more pointers into where the cause might be, please let
> me
> >> > know I would be happy to dig further down this rabbit hole.
> >> >
> >> > This is what I have so far https://github.com/apache/solr/pull/1504
> >> >
> >> > best,
> >> > alex
> >> >
> >>
> >
>

Reply via email to