So - my IT guy makes the case that we don't really need Zookeeper / Solr
Cloud...

He may be right - we're serving static data (changes to the collection
occur only 2 or 3 times a year and are minor)

We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
configured the same way, behind a load balancer and do fine.

I've got a Kafka server set up with the solr docs as topics.  It takes
about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
If I target 3-4 SOLR servers from my microservice instead of one, it
wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
Solr servers from scratch...

I'm biased in terms of using the most recent functionality, but I'm aware
that bias is not necessarily based on facts and want to do my due
diligence...

Aside from the obvious benefits of spreading work across nodes (which may
not be a big deal in our application and which my IT guy proposes is more
transparently handled with a load balancer he understands) are there any
other considerations that would drive a choice for Solr Cloud (zookeeper
etc)?



On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans <tevans...@googlemail.com> wrote:

> On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
> <j...@johnbickerstaff.com> wrote:
> > Thanks all - very helpful.
> >
> > @Shawn - your reply implies that even if I'm hitting the URL for a single
> > endpoint via HTTP - the "balancing" will still occur across the Solr
> Cloud
> > (I understand the caveat about that single endpoint being a potential
> point
> > of failure).  I just want to verify that I'm interpreting your response
> > correctly...
> >
> > (I have been asked to provide IT with a comprehensive list of options
> prior
> > to a design discussion - which is why I'm trying to get clear about the
> > various options)
> >
> > In a nutshell, I think I understand the following:
> >
> > a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> > available nodes for searching
> >           Caveat: That single URL represents a potential single point of
> > failure and this should be taken into account
> >
> > b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> > based on Zookeeper's "knowledge" of all available Solr instances.
> >           Note: This is more robust than "a" due to the fact that it
> > eliminates the "single point of failure"
> >
> > c.  Use of a load balancer hitting all known Solr instances will be fine
> -
> > although the search requests may not run on the Solr instance the load
> > balancer targeted - due to "a" above.
> >
> > Corrections or refinements welcomed...
>
> With option a), although queries will be distributed across the
> cluster, all queries will be going through that single node. Not only
> is that a single point of failure, but you risk saturating the
> inter-node network traffic, possibly resulting in lower QPS and higher
> latency on your queries.
>
> With option b), as well as SolrJ, recent versions of pysolr have a
> ZK-aware SolrCloud client that behaves in a similar way.
>
> With option c), you can use the preferLocalShards so that shards that
> are local to the queried node are used in preference to distributed
> shards. Depending on your shard/cluster topology, this can increase
> performance if you are returning large amounts of data - many or large
> fields or many documents.
>
> Cheers
>
> Tom
>

Reply via email to