Thanks for the explanation Erick, that makes sense!

Matt

On 21/09/2018 15:50, Erick Erickson wrote:
The disk corruption is, of course, a red flag and likely the root cause.

As for how it replicated let's assume a 2 replica shard (leader +
follower). If the follower ever went into full recovery it would use
old-style replication to copy down the entire index, corrupted index
and all, from the leader. The follower can go into "full recovery" for
a number of reasons, from it being shut down for a while and indexing
still happening to the leader to communications burps.

There's been a lot of work put in to making fewer full recoveries, but
much of that only came to fruition in recent Solr releases, especially
starting with Solr 7.3. (SOLR-11702)

Best,
Erick
On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <m...@flax.co.uk> wrote:

Hi,

We've just been working with a client who had a corruption issue with
their SolrCloud install. They're running Solr 5.3.1, with a collection
spread across 12 shards. Each shard has a single replica.

They were seeing "Index Corruption" errors when running certain queries.
We investigated, and narrowed it down to a single shard. Using the
Lucene CheckIndex utility, we tested both the primary and replica copies
of the data, and found the same issue with both - the first segment,
containing the majority of the documents, was reporting corruption. They
were able to restore from a backup, but it would be good to get some
idea what could have caused the problem in SolrCloud. One of the
machines ran out of disk space last week during indexing, which we guess
could have been the starting point for the corrupted data files.

Our question is: why would the corruption have spread to the replica as
well? Could a corrupted document be replicated and cause the replica
index to break as well?

Thanks,

Matt

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk

Reply via email to