Istvan Farkas created SOLR-15003:
------------------------------------

             Summary: SolrCloud Snapshot metadata inconsistent after core 
replication
                 Key: SOLR-15003
                 URL: https://issues.apache.org/jira/browse/SOLR-15003
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 7.4.1
            Reporter: Istvan Farkas
         Attachments: snap2-before.txt, snap2-failed.png, state.json


After a replica does a full recovery, the old index directory is deleted, 
however the snapshot metadata is not updated in Zookeeper. This means that the 
affected core will have snapshots which point to a non-existing index directory.

Steps to reproduce (I used Solr 7.4.0 to test this but it likely affects newer 
versions too):

(1) Create any collection in SolrCloud with more than 1 replica per shard. 
The state.json of the testcollection2 I used: 
 [^state.json] 

(2) Start adding documents to the collection. After some documents are added, 
creating a snapshot. In this example I called it snap2.

{code}
INFO (qtp577405636-12140)-c:testcollection2----o.a.s.s.HttpSolrCall: [admin] 
webapp=null path=/admin/collections 
params={name=testcollection2&action=CREATESNAPSHOT&collection=testcollection2&commitName=snap2&wt=javabin&version=2}
 status=0 QTime=280 
{code}

The snapshot is created successfully for both cores, the metadata in ZooKeeper 
looks like this:
 [^snap2-before.txt] 

For core_node4 the index directory is 
/solr/testcollection2/core_node4/data/index. 


(3) Shut down one of the Solr servers.  Here I use the server hosting 
core_node4.

(4) Continue adding documents to the collection (add at least 100 documents to 
ensure that the replica which is shut down will go to a full replication 
recovery on the next start)

Start the server:

{code}
INFO 
(recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr 
x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader:
 PeerSync: core=testcollection2_shard1_replica_n2 
url=https://snapshot-test-2.example.com:8985/solr  Received 99 versions from 
https://snapshot-test-3.example.com:8985/solr/testcollection2_shard1_replica_n1/
INFO 
(recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr 
x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSync:
 PeerSync: core=testcollection2_shard1_replica_n2 
url=https://snapshot-test-2.example.com:8985/solr  Our versions are too old. 
ourHighThreshold=1683104801277083650 otherLowThreshold=1683139494002294784 
ourHighest=1683104801293860865 otherHighest=1683139494085132289
INFO 
(recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr 
x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader:
 PeerSync: core=testcollection2_shard1_replica_n2 
url=https://snapshot-test-2.example.com:8985/solr DONE. sync failed
INFO 
(recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr 
x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy:
 PeerSync Recovery was not successful - trying replication.
INFO 
(recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr 
x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy:
 Starting Replication Recovery.
{code}

(5) After the replication is finished, the index.properties points to the new 
directory index.20201112075340480

{code}
hdfs dfs -cat  /solr/testcollection2/core_node4/data/index.properties
#index.properties
#Thu Nov 12 07:58:52 GMT+00:00 2020
index=index.20201112075340480
{code}

And the original index directory for core_node4 has been deleted:

{code}
hdfs dfs -du -h  /solr/testcollection2/core_node4/data
0      0       /solr/testcollection2/core_node4/data/index
4.5 G  8.9 G   /solr/testcollection2/core_node4/data/index.20201112075340480
84     168     /solr/testcollection2/core_node4/data/index.properties
215    430     /solr/testcollection2/core_node4/data/replication.properties
401    802     /solr/testcollection2/core_node4/data/snapshot_metadata
9.2 M  18.5 M  /solr/testcollection2/core_node4/data/tlog
{code}

The snapshot metadata in Zookeeper is exactly the same as in step (2), so the 
snap2 still points to the index directory 
/solr/testcollection2/core_node4/data/index which is empty by this time.

(6) Try deleting the snapshot snap2. 

In Zookeeper, the snapshot metadata will change: 
- the snapshot status goes to failed
- the snapshot metadata for the core_node3 (the legit core) is deleted
- the snapshot metadata for the core_node4 (which is in the stale status) will 
remain there

 !snap2-failed.png! 

The solr server log will contain the following error message:

{code}
INFO 
(qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall:
 [admin] webapp=null path=/admin/cores 
params={core=testcollection2_shard1_replica_n2&qt=/admin/cores&name=shard1&action=DELETESNAPSHOT&comm
itName=snap2&wt=javabin&version=2} status=500 QTime=60
ERROR 
(qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall:
 null:org.apache.solr.common.SolrException: Error handling 'DELETESNAPSHOT' 
action
        at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:365)
        at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
        at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
        at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:799)
        at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:780)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:504)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:376)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:322)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:539)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
        at 
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:259)
        at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
        at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file 
found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://snapshot-test-1.example.com:8020/solr/testcollection2/core_node4/data/index
 lockFactory=org.apache.lucene.store.NoLockFactory@b06d116); maxCacheMB=192.0 
maxMergeSizeMB=16.0)): files: []
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:833)
        at 
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:110)
        at 
org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:287)
        at 
org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:225)
        at org.apache.solr.core.SolrCore.deleteNamedSnapshot(SolrCore.java:542)
        at 
org.apache.solr.handler.admin.DeleteSnapshotOp.execute(DeleteSnapshotOp.java:42)
        at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
        ... 38 more
{code}

(7) After trying the DELETESNAPSHOT command again, it succeeds without issues, 
the metadata for snap2 is entirely removed from ZK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to