I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-3219 after 30 seconds; our state says
http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DDDDDD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-3581 after 30 seconds; our state says
http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DDDDDD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-2707 after 30 seconds; our state says
http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DDDDDD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/26/2015 11:14 PM, Damien Kamerman wrote:
> > I've run into an issue with starting my solr cloud with many collections.
> > My setup is:
> > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
> > server (256GB RAM).
> > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
> > 1 x Zookeeper 3.4.6
> > Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
> >
> > Then I stop all nodes, then start all nodes. All replicas are in the down
> > state, some have no leader. At times I have seen some (12 or so) leaders
> in
> > the active state. In the solr logs I see lots of:
> >
> > org.apache.solr.cloud.ZkController; Still seeing conflicting information
> > about the leader of shard shard1 for collection DDDDDD-4351 after 30
> > seconds; our state says
> http://ftea1:8001/solr/DDDDDD-4351_shard1_replica1/,
> > but ZooKeeper says http://ftea1:8000/solr/DDDDDD-4351_shard1_replica2/
>
> <snip>
>
> > I've tried staggering the starts (1min) but does not help.
> > I've reproduced with zero documents.
> > Restarts are OK up to around 3,000 cores.
> > Should this work?
>
> This is going to push SolrCloud beyond its limits.  Is this just an
> exercise to see how far you can push Solr, or are you looking at setting
> up a production install with several thousand collections?
>
> In Solr 4.x, the clusterstate is one giant JSON structure containing the
> state of the entire cloud.  With 5000 collections, the entire thing
> would need to be downloaded and uploaded at least 5000 times during the
> course of a successful full system startup ... and I think with
> replicationFactor set to 2, that might actually be 10000 times. The
> best-case scenario is that it would take a VERY long time, the
> worst-case scenario is that concurrency problems would lead to a
> deadlock.  A deadlock might be what is happening here.
>
> In Solr 5.x, the clusterstate is broken up so there's a separate state
> structure for each collection.  This setup allows for faster and safer
> multi-threading and far less data transfer.  Assuming I understand the
> implications correctly, there might not be any need to increase
> jute.maxbuffer with 5.x ... although I have to assume that I might be
> wrong about that.
>
> I would very much recommend that you set your scenario up from scratch
> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
> problem you're seeing.  If it doesn't, then we can pursue it as a likely
> bug in the 5.x branch and you can file an issue in Jira.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Reply via email to