On 9/29/2017 6:34 AM, John Blythe wrote:
complete noob as to solrcloud here. almost-non-noob on solr in general.

we're experiencing growing pains in our data and am thinking through moving
to solrcloud as a result. i'm hoping to find out if it seems like a good
strategy or if we need to get other areas of interest handled first before
introducing new complexities.

SolrCloud's main advantages are in automation, centralization, and eliminating single points of failure. Indexing multiple replicas works very differently in cloud than in master/slave, a difference that can have both advantages and disadvantages.  It is advantageous in *most* situations, but master/slave might have an edge in *some* situations.

For most *new* production setups requiring high availability, I would in almost every case recommend SolrCloud. Master/slave is a system that works, but the master represents a single point of failure.  If the master dies, manual reconfiguration of all machines is usually required in order to define a new master.  If you're willing to do some tricks with DNS, it might be possible to avoid manual Solr reconfiguration, but it is not seamless like SolrCloud, which is a true cluster that has no masters and no slaves.

I do not use SolrCloud in most of my setups.  This is only because when those setups were designed, SolrCloud was a development dream, something that was being worked on in a development branch.  SolrCloud did not arrive in a released version until 4.0.0-ALPHA.  If I were designing a setup from scratch now, I would definitely build it with SolrCloud.

here's a rundown of things:
- we are on a 30g ram aws instance
- we have ~30g tucked away in the ../solr/server/ dir
- our largest core is 6.8g w/ ~25 segments at any given time. this is also
the core that our business directly runs off of, users interact with, etc.
- 5g is for a logs type of dataset that analytics can be built off of to
help inform the primary core above
- 3g are taken up by 3 different third party sources that we use solr to
warehouse and have available for query for the sake of linking items in our
primary core to these cores for data enrichment
- several others take up < 1g each
- and then we have dev- and demo- flavors for some of these

we had been operating on a 16gb machine till a few weeks ago (actually
bumped while at lucene revolution bc i hadn't noticed how much we'd
outgrown the cache size's needs till the week before!). the load when doing
an import or running our heavier operations is much better and doesn't fall
under the weight of the operations like it had been doing.

we have no master/slave replica. all of our data is 'replicated' by the
fact that it exists in mysql. if solr were to go down it'd be a nice big
fire but one we could recover from within a couple hours by simply
reimporting.

If your business model can tolerate a two hour outage, I am envious.  That is not something that most businesses can tolerate.  Also, many setups cannot do a full rebuild in two hours.  Some kind of replication is required for a fault tolerant installation.

i'd like to have a more sophisticated set up in place for fault tolerance
than that, of course. i'd also like to see our heavy, many-query based
operations be speedier and better capable of handling multi-threaded runs
at once w/ ease.

is this a matter of getting still more ram on the machine? cpus for faster
processing? splitting up the read/write operations between master/slave?
going full steam into a solrcloud configuration?

one more note. per discussion at the conference i'm combing through our
configs to make sure we trim any fat we can. also wanting to get
optimization scheduled more regularly to help out w segmentation and
garbage heap. not sure how far those two alone will get us, though

The desire to scale an index, either in size or query load, is not by itself a reason to switch to SolrCloud.  Scaling is generally easier to manage with cloud, because you just fire up another server, and it is immediately part of the cloud, ready for whatever collection changes or additions you might need, most of which can be done with requests via the HTTP API.  Although performance can improve with SolrCloud, it is not usually a *significant* improvement, assuming that the distribution of data and the number/configuration of servers are similar between master/slave and SolrCloud.

If you rearrange the data or upgrade/add server hardware *with* the switch to SolrCloud, then any significant performance improvement is probably not attributable to SolrCloud, but to the other changes.

If all your homegrown tools are designed around non-cloud setups, you might find it very painful to switch.  Some things require different HTTP APIs, and the APIs that you might already use could have different responses or require slightly different information in the request.

RAM is the resource with the most impact on Solr performance.  CPU is certainly important, but increasing the available RAM will usually give the biggest boost.  If there is sufficient RAM, disk speed will have very little effect on performance.  Disk speed only becomes a major factor when you do not have enough memory to effectively cache the index.

Thanks,
Shawn

Reply via email to