"Production" typically implies "high availability" and in a distributed
system the goal is that the overall cluster integrity and performance should
not be compromised just because a few "worker" nodes go down. Solr nodes do
a lot of complex operations and are quite prone to running into "issues"
that compromise their integrity and require that they be taken down,
restarted, etc. In fact, taking down a "bunch" of Solr "worker" nodes should
not be a big deal (unless they are all of the nodes/replicas from a single
shard/slice), while taking down a "bunch" of zookeepers could be
catastrophic to maintaining the integrity of the zookeeper ensemble. (OTOH,
if every Solr node is also a zookeeper node, a "bunch" of Solr nodes would
generally be less than a quorum, so maybe that is not an absolute issue per
se.) Zookeeper nodes are categorically distinct in terms of their importance
to maintaining the integrity and availability of the overall cluster. They
are special in that sense. And they are special because they are maintaining
the integrity of the cluster's configuration information. Even for large
clusters their number will be relatively "few" compared to the "many" of
"worker" nodes (replicas), so zookeeper nodes need to be "protected" from
the vagaries that can disrupt and take Solr nodes down, not the least of
which is incoming traffic.
I'm not sure what the implications would be if you had a large cluster and
because Zookeeper was embedded you had a large number of zookeepers. Any of
the inter-zookeeper operations would take longer and could be compromised by
even a single busy/overloaded/dead Solr node. OTOH, the Zookeeper ensemble
design is supposed to be able to handle a far number of missing zookeeper
nodes.
OTOH, if high availability is not a requirement for a production cluster
(use case?), then non-embedded zookeepers are certainly an annoyance.
Maybe you could think of embedded zookeeper like every employee having their
manager sitting right next to them all the time. How could that be anything
but a bad idea in terms of maximizing worker output - and
distracting/preventing managers from focusing on their own "work"?
-- Jack Krupansky
-----Original Message-----
From: Nick Chase
Sent: Sunday, November 11, 2012 7:12 AM
To: solr-user@lucene.apache.org
Subject: Internal Vs. External ZooKeeper
OK, I can't find a definitive answer on this. The wiki says not to use
the embedded ZooKeeper servers for production. But my question is: why
not? Basically, what are the reasons and circumstances that make you
better off using an external ZooKeeper ensemble?
Thanks...
---- Nick