zk disconnects and failure to retry?

Daniel Collins Fri, 24 May 2013 01:07:36 -0700

Had a scenario on a dev system here that has me confused.

We have a simple Solr cloud (dev) system running 4.3, 4 shards, running on
2 machines (2 instances per machine), 2 ZKs (external) and no replicas (or
1 replica depending on your definition, we only have 1 instance of each
shard!)


Yes, we have no backups, and we only have 2 ZKs which is bad, but its a dev
system, so not mission critical.

What I saw last night was that various shards disconnected from ZK (still
trying to work out why that was in itself), and some reconnected, some
didn't.  The ones that failed eventually had this error:

2013-05-23 14:27:38,876 ERROR [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [SolrException.java:119] Reconnect to
ZooKeeper failed:java.lang.RuntimeException:
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
xxx1:11600,xxx2:11600 within 30000 ms

2013-05-23 14:27:38,877 INFO [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:51]
Reconnect to ZooKeeper failed
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:130] Connected:false
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.z.ClientCnxn
[ClientCnxn.java:509] EventThread shut down

So my question is why don't they keep re-trying?  Yes I could increase the
timeout, but that feels like the wrong action.  If the core had failed to
connect to ZK, shouldn't it keep trying to re-enter the cloud, why does it
"give up"?  From that point onwards, those cores just give errors during
update

2013-05-23 14:30:39,605 ERROR [qtp21465667-1439] o.a.s.c.SolrCore
[SolrException.java:108] org.apache.solr.common.SolrException: Cannot talk
to ZooKeeper - Updates are disabled.
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:999)

Now I understand the reason for the errors, but surprised it didn't try to
fix itself.  I eventually bounced the core and it reconnected, but why does
it need a manual fix?

zk disconnects and failure to retry?

Reply via email to