The disk corruption is, of course, a red flag and likely the root cause. As for how it replicated let's assume a 2 replica shard (leader + follower). If the follower ever went into full recovery it would use old-style replication to copy down the entire index, corrupted index and all, from the leader. The follower can go into "full recovery" for a number of reasons, from it being shut down for a while and indexing still happening to the leader to communications burps.
There's been a lot of work put in to making fewer full recoveries, but much of that only came to fruition in recent Solr releases, especially starting with Solr 7.3. (SOLR-11702) Best, Erick On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <m...@flax.co.uk> wrote: > > Hi, > > We've just been working with a client who had a corruption issue with > their SolrCloud install. They're running Solr 5.3.1, with a collection > spread across 12 shards. Each shard has a single replica. > > They were seeing "Index Corruption" errors when running certain queries. > We investigated, and narrowed it down to a single shard. Using the > Lucene CheckIndex utility, we tested both the primary and replica copies > of the data, and found the same issue with both - the first segment, > containing the majority of the documents, was reporting corruption. They > were able to restore from a backup, but it would be good to get some > idea what could have caused the problem in SolrCloud. One of the > machines ran out of disk space last week during indexing, which we guess > could have been the starting point for the corrupted data files. > > Our question is: why would the corruption have spread to the replica as > well? Could a corrupted document be replicated and cause the replica > index to break as well? > > Thanks, > > Matt > > -- > Matt Pearce > Flax - Open Source Enterprise Search > www.flax.co.uk