On 18/04/2016 18:22, John Bickerstaff wrote:
So - my IT guy makes the case that we don't really need Zookeeper / Solr
Cloud...

He may be right - we're serving static data (changes to the collection
occur only 2 or 3 times a year and are minor)

We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
configured the same way, behind a load balancer and do fine.

I've got a Kafka server set up with the solr docs as topics.  It takes
about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
If I target 3-4 SOLR servers from my microservice instead of one, it
wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
Solr servers from scratch...

This is something we've been discussing as a concept - to offload all the scaling stuff to Kafka (which is very good at that sort of thing) and simply hang Solr instances onto a Kafka topic. We've not taken it any further than a concept at this point but interesting to hear about others doing so!

Charlie


I'm biased in terms of using the most recent functionality, but I'm aware
that bias is not necessarily based on facts and want to do my due
diligence...

Aside from the obvious benefits of spreading work across nodes (which may
not be a big deal in our application and which my IT guy proposes is more
transparently handled with a load balancer he understands) are there any
other considerations that would drive a choice for Solr Cloud (zookeeper
etc)?



On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans <tevans...@googlemail.com> wrote:

On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
<j...@johnbickerstaff.com> wrote:
Thanks all - very helpful.

@Shawn - your reply implies that even if I'm hitting the URL for a single
endpoint via HTTP - the "balancing" will still occur across the Solr
Cloud
(I understand the caveat about that single endpoint being a potential
point
of failure).  I just want to verify that I'm interpreting your response
correctly...

(I have been asked to provide IT with a comprehensive list of options
prior
to a design discussion - which is why I'm trying to get clear about the
various options)

In a nutshell, I think I understand the following:

a. Even if hitting a single URL, the Solr Cloud will "balance" across all
available nodes for searching
           Caveat: That single URL represents a potential single point of
failure and this should be taken into account

b. SolrJ's CloudSolrClient API provides the ability to distribute load --
based on Zookeeper's "knowledge" of all available Solr instances.
           Note: This is more robust than "a" due to the fact that it
eliminates the "single point of failure"

c.  Use of a load balancer hitting all known Solr instances will be fine
-
although the search requests may not run on the Solr instance the load
balancer targeted - due to "a" above.

Corrections or refinements welcomed...

With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to