Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any expired sessions...
There must be a way to start solr with many collections. It runs fine.. until a restart is required. On 3 March 2015 at 03:33, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/2/2015 12:54 AM, Damien Kamerman wrote: > > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 > > collections from scratch and then attempted to stop/start the cloud. > > > > node1: > > WARN - 2015-03-02 18:09:02.371; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; > Timed > > out waiting to see all nodes published as DOWN in our cluster state. > > WARN - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DDDDDD-3219 after 30 seconds; our state says > > http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says > > http://host:8000/solr/DDDDDD-3219_shard1_replica2/ > > > > node2: > > WARN - 2015-03-02 18:09:01.871; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:17:04.458; > > org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered, > > but Solr cannot talk to ZK > > stop/start > > WARN - 2015-03-02 18:53:12.725; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DDDDDD-3581 after 30 seconds; our state says > > http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says > > http://host:8002/solr/DDDDDD-3581_shard1_replica1/ > > > > node3: > > WARN - 2015-03-02 18:09:03.022; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; > Timed > > out waiting to see all nodes published as DOWN in our cluster state. > > WARN - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DDDDDD-2707 after 30 seconds; our state says > > http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says > > http://host:8000/solr/DDDDDD-2707_shard1_replica1/ > > I'm sorry to hear that 5.0 didn't fix the problem. I really hoped that > it would. > > There is one other thing I'd like to try before you file a bug -- > increasing zkClientTimeout to 40 seconds, to see whether it allows > changes the point at which it fails (or allows it to succeed). With the > default tickTime (2 seconds), the maximum time you can set > zkClientTimeout to is 40 seconds ... which in normal circumstances is a > VERY long time. In your situation, at least with the code in its > current state, 30 seconds (I'm pretty sure this is the default in 5.0) > may simply not be enough. > > > https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters > > I think filing a bug, even if 40 seconds allows this to succeed, is a > good idea ... but you might want to wait for some of the cloud experts > to look at your logs to see if they have anything to add. > > Thanks, > Shawn > > -- Damien Kamerman