Hi, We used to have many "Client session timeout" messages in solr logs.
INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing socket connection and attempting reconnect Then we set the zkClientTimeout to 30 seconds. Therefore there are no more messages of this type in the logs. But we get some other messages : Leader logs (solr-39): 2013-12-18 09:45:26,052 [http-8083-74] ERROR org.apache.solr.update.SolrCmdDistributor:log:119 - shard update error StdNode: http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://solr-40/searchsolrnodees/es_blue ... Caused by: java.net.SocketException: Connection reset 2013-12-18 09:45:26,060 [http-8083-74] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2013-12-18 09:45:26,140 [http-8083-49] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819 - Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: recovering, checkLive: true, onlyIfLeader: true Replica logs (solr-40) : 2013-12-18 09:45:26,083 [http-8083-65] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-18 09:45:26,091 [http-8083-65] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=es_blue&wt=javabin&version=2} status=0 QTime=8 ... 2013-12-18 09:45:29,190 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=es_blue url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=es_blue 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=es_blue Therefore, if I understand it right, the leader does not manage to foward the documents to the replica due to a "Connection reset" problem, and it asks the replica to recover. The replica tries to recover, it fails, and it starts a replication recovery. Why does the leader not retry to forward the documents to the replica when it gets an IOException in SolrCmdDistributor ? Solr version is 4.5.1 For the Garbage Collector, we use the settings defined here http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Thank you, Anca On 12/19/2013 04:50 PM, Mark Miller wrote: Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz <anca.kop...@kelkoo.com><mailto:anca.kop...@kelkoo.com> wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that "PeerSync Recovery" fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then "Replication Recovery" starts. Is there something we can do to avoid the failure of "Peer Recovery" so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false 2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2013-12-05 13:52:01,203 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:<init>:211 - No value set for 'pollInterval'. Timer Task not started. 2013-12-05 13:52:01,209 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:369 - Master's generation: 4382 2013-12-05 13:52:01,209 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:370 - Slave's generation: 4382 2013-12-05 13:52:01,209 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:371 - Starting replication process 2013-12-05 13:52:01,213 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:376 - Number of files in latest index in master: 158 2013-12-05 13:52:01,213 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory:get:360 - return new directory for /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213 2013-12-05 13:52:01,213 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:402 - Starting download to org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213<mailto:org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213> lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7 fullCopy=true 2013-12-05 13:52:21,970 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:55:15,039 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:56:27,160 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:406 - Total time taken for download : 265 secs 2013-12-05 13:56:27,540 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:modifyIndexProps:886 - New index installed. Updating index properties... index=index.20131205135201213 2013-12-05 13:56:27,546 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:fetchLatestIndex:461 - removing old index directory org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724<mailto:org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724> lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838 2013-12-05 13:56:27,765 [RecoveryThread] INFO org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening Searcher@30a544f3 main 2013-12-05 13:58:52,337 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235 - looking to close /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>] 2013-12-05 13:58:52,337 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory:close:304 - Closing directory: /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724 2013-12-05 13:58:52,337 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279 - Removing directory before core close: /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724 2013-12-05 13:58:54,172 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replay:506 - Replaying buffered documents. core=fr_green 2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN org.apache.solr.update.UpdateLog:doReplay:1240 - Starting log replay tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091 refcount=2} active=true starting pos=1103833347 2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening Searcher@5e48fb12 main 2013-12-05 14:01:11,908 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN org.apache.solr.update.UpdateLog:run:1230 - Log replay finished. recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0 positionOfStart=1103833347} 2013-12-05 14:01:22,846 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:410 - Replication Recovery was successful - registering as Active. core=fr_green 2013-12-05 14:01:22,846 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=active 2013-12-05 14:01:22,850 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 14:01:22,860 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green Best regards, Anca Kopetz -- <generic-logo-no-strap.gif> Anca Kopetz Software engineer E anca.kop...@kelkoo.com<mailto:anca.kop...@kelkoo.com> Y!Messenger kelkooancak T +33 (0)4 56 09 07 55 A 4/6 Rue des Meridiens 38130 Echirolles <x-bottom-left.gif> <x-bottom-middle.gif> Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. ________________________________ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.