If you google replication can cause index corruption there are two jira issues that are the most likely cause of corruption in a solrcloud env.
- Mark > On Mar 5, 2015, at 2:20 PM, Garth Grimm <garthgr...@averyranchconsulting.com> > wrote: > > For updates, the document will always get routed to the leader of the > appropriate shard, no matter what server first receives the request. > > -----Original Message----- > From: Martin de Vries [mailto:mar...@downnotifier.com] > Sent: Thursday, March 05, 2015 4:14 PM > To: solr-user@lucene.apache.org > Subject: Re: Solrcloud Index corruption > > Hi Erick, > > Thank you for your detailed reply. > > You say in our case some docs didn't made it to the node, but that's not > really true: the docs can be found on the corrupted nodes when I search on > ID. The docs are also complete. The problem is that the docs do not appear > when I filter on certain fields (however the fields are in the doc and have > the right value when I search on ID). So something seems to be corrupt in the > filter index. We will try the checkindex, hopefully it is able to identify > the problematic cores. > > I understand there is not a "master" in SolrCloud. In our case we use haproxy > as a load balancer for every request. So when indexing every document will be > sent to a different solr server, immediately after each other. Maybe > SolrCloud is not able to handle that correctly? > > > Thanks, > > Martin > > > > > Erick Erickson schreef op 05.03.2015 19:00: > >> Wait up. There's no "master" index in SolrCloud. Raw documents are >> forwarded to each replica, indexed and put in the local tlog. If a >> replica falls too far out of synch (say you take it offline), then the >> entire index _can_ be replicated from the leader and, if the leader's >> index was incomplete then that might propagate the error. >> >> The practical consequence of this is that if _any_ replica has a >> complete index, you can recover. Before going there though, the >> brute-force approach is to just re-index everything from scratch. >> That's likely easier, especially on indexes this size. >> >> Here's what I'd do. >> >> Assuming you have the Collections API calls for ADDREPLICA and >> DELETEREPLICA, then: >> 0> Identify the complete replicas. If you're lucky you have at least >> one for each shard. >> 1> Copy 1 good index from each shard somewhere just to have a backup. >> 2> DELETEREPLICA on all the incomplete replicas >> 2.5> I might shut down all the nodes at this point and check that all >> the cores I'd deleted were gone. If any remnants exist, 'rm -rf >> deleted_core_dir'. >> 3> ADDREPLICA to get the ones removed in back. >> >> should copy the entire index from the leader for each replica. As you >> do the leadership will change and after you've deleted all the >> incomplete replicas, one of the complete ones will be the leader and >> you should be OK. >> >> If you don't want to/can't use the Collections API, then >> 0> Identify the complete replicas. If you're lucky you have at least >> one for each shard. >> 1> Shut 'em all down. >> 2> Copy the good index somewhere just to have a backup. >> 3> 'rm -rf data' for all the incomplete cores. >> 4> Bring up the good cores. >> 5> Bring up the cores that you deleted the data dirs from. >> >> What should do is replicate the entire index from the leader. When you >> restart the good cores (step 4 above), they'll _become_ the leader. >> >> bq: Is it possible to make Solrcloud invulnerable for network problems >> I'm a little surprised that this is happening. It sounds like the >> network problems were such that some nodes weren't out of touch long >> enough for Zookeeper to sense that they were down and put them into >> recovery. Not sure there's any way to secure against that. >> >> bq: Is it possible to see if a core is corrupt? >> There's "CheckIndex", here's at least one link: >> http://java.dzone.com/news/lucene-and-solrs-checkindex >> What you're describing, though, is that docs just didn't make it to >> the node, _not_ that the index has unexpected bits, bad disk sectors >> and the like so CheckIndex can't detect that. How would it know what >> _should_ have been in the index? >> >> bq: I noticed a difference in the "Gen" column on Overview - >> Replication. Does this mean there is something wrong? >> You cannot infer anything from this. In particular, the merging will >> be significantly different between a single full-reindex and what the >> state of segment merges is in an incrementally built index. >> >> The admin UI screen is rooted in the pre-cloud days, the Master/Slave >> thing is entirely misleading. In SolrCloud, since all the raw data is >> forwarded to all replicas, and any auto commits that happen may very >> well be slightly out of sync, the index size, number of segments, >> generations, and all that are pretty safely ignored. >> >> Best, >> Erick >> >> On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries >> <mar...@downnotifier.com> >> wrote: >> >>> Hi Andrew, Even our master index is corrupt, so I'm afraid this won't >>> help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45: >>> >>>> Force a fetchindex on slave from master command: >>>> http://slave_host:port/solr/replication?command=fetchindex - from >>>> http://wiki.apache.org/solr/SolrReplication [1] The above command >>>> will download the whole index from master to slave, there are >>>> configuration options in solr to make this problem happen less often >>>> (allowing it to recover from new documents added and only send the >>>> changes with a wider gap) - but I cant remember what those were. > > > > Links: > ------ > [1] http://wiki.apache.org/solr/SolrReplication