The disk corruption is, of course, a red flag and likely the root cause.

As for how it replicated let's assume a 2 replica shard (leader +
follower). If the follower ever went into full recovery it would use
old-style replication to copy down the entire index, corrupted index
and all, from the leader. The follower can go into "full recovery" for
a number of reasons, from it being shut down for a while and indexing
still happening to the leader to communications burps.

There's been a lot of work put in to making fewer full recoveries, but
much of that only came to fruition in recent Solr releases, especially
starting with Solr 7.3. (SOLR-11702)

Best,
Erick
On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <m...@flax.co.uk> wrote:
>
> Hi,
>
> We've just been working with a client who had a corruption issue with
> their SolrCloud install. They're running Solr 5.3.1, with a collection
> spread across 12 shards. Each shard has a single replica.
>
> They were seeing "Index Corruption" errors when running certain queries.
> We investigated, and narrowed it down to a single shard. Using the
> Lucene CheckIndex utility, we tested both the primary and replica copies
> of the data, and found the same issue with both - the first segment,
> containing the majority of the documents, was reporting corruption. They
> were able to restore from a backup, but it would be good to get some
> idea what could have caused the problem in SolrCloud. One of the
> machines ran out of disk space last week during indexing, which we guess
> could have been the starting point for the corrupted data files.
>
> Our question is: why would the corruption have spread to the replica as
> well? Could a corrupted document be replicated and cause the replica
> index to break as well?
>
> Thanks,
>
> Matt
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk

Reply via email to