Hi Joel,

You're sort of describing the classic replication scenario, which you can
get started on by reading this: http://wiki.apache.org/solr/SolrReplication

Although I believe this is handled in the reference guide, too.

Generally speaking, the sorts of issues you mention are general issues that
you have to deal with when using Solr at scale, no matter how you
replicate. Proper GC tuning is a must. You can seriously diminish the
impact of GC with some tuning.

Etsy has done some interesting things regarding implementing an API that's
resilient to garbage collecting nodes. Take a look at this:
http://www.lucenerevolution.org/sites/default/files/Living%20with%20Garbage.pdf


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Fri, Aug 1, 2014 at 10:48 AM, Joel Cohen <jco...@grubhub.com> wrote:

> Hi,
>
> We're in the development phase of a new application and the current dev
> team mindset leans towards running Solr (4.9) in AWS without Zookeeper. The
> theory is that we can add nodes quickly to our load balancer
> programmatically and get a dump of the indexes from another node and copy
> them over to the new one. A RESTful API would handle other applications
> talking to Solr without the need for each of them to have to use SolrJ.
> Data ingestion happens nightly in bulk by way of ActiveMQ which each server
> subscribes to and pulls its own copy of the indexes. Incremental updates
> are very few during the day, but we would have some mechanism of getting a
> new server to 'catch up' to the live servers before making it active in the
> load balancer.
>
> The only thing so far that I see as a hurdle here is the data set size vs.
> heap size. If the index grows too large, then we have to increase the heap
> size, which could lead to longer GC times. Servers could pop in and out of
> the load balancer if they are unavailable for too long when a major GC
> happens.
>
> Current stats:
> 11 Gb of data (and growing)
> 4 Gb java heap
> 4 CPU, 16 Gb RAM nodes (maybe more needed?)
>
> All thoughts are welcomed.
>
> Thanks.
> --
> *Joel Cohen*
> Devops Engineer
>
> *GrubHub Inc.*
> *jco...@grubhub.com <jco...@grubhub.com>*
> 646-527-7771
> 1065 Avenue of the Americas
> 15th Floor
> New York, NY 10018
>
> grubhub.com | *fb <http://www.facebook.com/grubhub>* | *tw
> <http://www.twitter.com/grubhub>*
> seamless.com | *fb <http://www.facebook.com/seamless>* | *tw
> <http://www.twitter.com/seamless>*
>

Reply via email to