Re: PeerSync Recovery fails, starting Replication Recovery

Anca Kopetz Fri, 20 Dec 2013 02:05:42 -0800

Hi,

We used to have many "Client session timeout" messages in solr logs.


INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have 
not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing 
socket connection and attempting reconnect

Then we set the zkClientTimeout to 30 seconds. Therefore there are no more 
messages of this type in the logs.

But we get some other messages :

Leader logs (solr-39):

2013-12-18 09:45:26,052 [http-8083-74] ERROR 
org.apache.solr.update.SolrCmdDistributor:log:119  - shard update error 
StdNode: 
http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at: 
http://solr-40/searchsolrnodees/es_blue
...
Caused by: java.net.SocketException: Connection reset

2013-12-18 09:45:26,060 [http-8083-74] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-18 09:45:26,140 [http-8083-49] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819  - 
Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: 
recovering, checkLive: true, onlyIfLeader: true

Replica logs (solr-40) :

2013-12-18 09:45:26,083 [http-8083-65] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-18 09:45:26,091 [http-8083-65] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERY&core=es_blue&wt=javabin&version=2} status=0 
QTime=8
...
2013-12-18 09:45:29,190 [RecoveryThread] WARN  
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=es_blue 
url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too 
many updates received since start - startingUpdates no longer overlaps with our 
currentUpdates
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery was 
not successful - trying replication. core=es_blue
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
Recovery. core=es_blue

Therefore, if I understand it right, the leader does not manage to foward the documents 
to the replica due to a "Connection reset" problem, and it asks the replica to 
recover. The replica tries to recover, it fails, and it starts a replication recovery.

Why does the leader not retry to forward the documents to the replica when it 
gets an IOException in SolrCmdDistributor ?

Solr version is 4.5.1

For the Garbage Collector, we use the settings defined here 
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thank you,
Anca

On 12/19/2013 04:50 PM, Mark Miller wrote:

Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz 
<anca.kop...@kelkoo.com><mailto:anca.kop...@kelkoo.com> wrote:



Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
time into recovering state, and it takes more than 10 minutes to finish to 
recover.

In logs, we see that "PeerSync Recovery" fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
received since start - startingUpdates no longer overlaps with our 
currentUpdates

Then "Replication Recovery" starts.

Is there something we can do to avoid the failure of "Peer Recovery" so that 
the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
QTime=0
2013-12-05 13:51:53,740 [Thread-1544] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=recovering
2013-12-05 13:51:53,741 [http-8080-46] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
QTime=1
2013-12-05 13:51:53,740 [Thread-1543] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=recovering
2013-12-05 13:51:53,743 [Thread-1544] INFO  
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
descriptor - reading it from system property
2013-12-05 13:51:53,746 [Thread-1543] INFO  
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
descriptor - reading it from system property
2013-12-05 13:51:53,755 [Thread-1543] WARN  
org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for 
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
2013-12-05 13:51:53,756 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,762 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
process. core=fr_green
2013-12-05 13:51:53,762 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,765 [RecoveryThread] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=recovering
2013-12-05 13:51:53,765 [RecoveryThread] INFO  
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
descriptor - reading it from system property
2013-12-05 13:51:53,767 [RecoveryThread] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-05 13:51:54,777 [main-EventThread] INFO  
org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:51:56,804 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to PeerSync 
from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - 
recoveringAfterStartup=false
2013-12-05 13:51:56,806 [RecoveryThread] WARN  
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green 
url=http://solr-08/searchsolrnodefr too many updates received since start - 
startingUpdates no longer overlaps with our currentUpdates
2013-12-05 13:51:56,806 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery was 
not successful - trying replication. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
Recovery. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering 
updates. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to replicate 
from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-05 13:52:01,203 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:<init>:211  -  No value set for 
'pollInterval'. Timer Task not started.
2013-12-05 13:52:01,209 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:369  - Master's generation: 
4382
2013-12-05 13:52:01,209 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:370  - Slave's generation: 
4382
2013-12-05 13:52:01,209 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:371  - Starting replication 
process
2013-12-05 13:52:01,213 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:376  - Number of files in 
latest index in master: 158
2013-12-05 13:52:01,213 [RecoveryThread] INFO  
org.apache.solr.core.CachingDirectoryFactory:get:360  - return new directory 
for /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
2013-12-05 13:52:01,213 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:402  - Starting download to 
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213<mailto:org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213>
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7 fullCopy=true
2013-12-05 13:52:21,970 [main-EventThread] INFO  
org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:55:15,039 [main-EventThread] INFO  
org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:56:27,160 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:406  - Total time taken for 
download : 265 secs
2013-12-05 13:56:27,540 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:modifyIndexProps:886  - New index installed. 
Updating index properties... index=index.20131205135201213
2013-12-05 13:56:27,546 [RecoveryThread] INFO  
org.apache.solr.handler.SnapPuller:fetchLatestIndex:461  - removing old index 
directory 
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724<mailto:org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724>
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838
2013-12-05 13:56:27,765 [RecoveryThread] INFO  
org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening 
Searcher@30a544f3 main
2013-12-05 13:58:52,337 [RecoveryThread] INFO  
org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235  - looking to close 
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]
2013-12-05 13:58:52,337 [RecoveryThread] INFO  
org.apache.solr.core.CachingDirectoryFactory:close:304  - Closing directory: 
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
2013-12-05 13:58:52,337 [RecoveryThread] INFO  
org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279  - Removing 
directory before core close: 
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
2013-12-05 13:58:54,172 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:replay:506  - Replaying buffered 
documents. core=fr_green
2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN  
org.apache.solr.update.UpdateLog:doReplay:1240  - Starting log replay 
tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
 refcount=2} active=true starting pos=1103833347
2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO  
org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening 
Searcher@5e48fb12 main
2013-12-05 14:01:11,908 [main-EventThread] INFO  
org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN  
org.apache.solr.update.UpdateLog:run:1230  - Log replay finished. 
recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0 
positionOfStart=1103833347}
2013-12-05 14:01:22,846 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:410  - Replication Recovery 
was successful - registering as Active. core=fr_green
2013-12-05 14:01:22,846 [RecoveryThread] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=active
2013-12-05 14:01:22,850 [RecoveryThread] INFO  
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
descriptor - reading it from system property
2013-12-05 14:01:22,860 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
process. core=fr_green

Best regards,
Anca Kopetz
--
<generic-logo-no-strap.gif>
Anca Kopetz Software engineer

E anca.kop...@kelkoo.com<mailto:anca.kop...@kelkoo.com>  Y!Messenger kelkooancak
T +33 (0)4 56 09 07 55
A 4/6 Rue des Meridiens 38130 Echirolles
<x-bottom-left.gif>   <x-bottom-middle.gif>
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.






________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: PeerSync Recovery fails, starting Replication Recovery

Reply via email to