Thanks for the input. For how long will the 'old style' of replication be supported? Is it set to go away in Solr 5? I don't want to be stuck on using an old version because we designed our application the wrong way.
On Mon, Aug 4, 2014 at 10:22 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > Hi Joel, > > You're sort of describing the classic replication scenario, which you can > get started on by reading this: > http://wiki.apache.org/solr/SolrReplication > > Although I believe this is handled in the reference guide, too. > > Generally speaking, the sorts of issues you mention are general issues that > you have to deal with when using Solr at scale, no matter how you > replicate. Proper GC tuning is a must. You can seriously diminish the > impact of GC with some tuning. > > Etsy has done some interesting things regarding implementing an API that's > resilient to garbage collecting nodes. Take a look at this: > > http://www.lucenerevolution.org/sites/default/files/Living%20with%20Garbage.pdf > > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > < > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/> > > > On Fri, Aug 1, 2014 at 10:48 AM, Joel Cohen <jco...@grubhub.com> wrote: > > > Hi, > > > > We're in the development phase of a new application and the current dev > > team mindset leans towards running Solr (4.9) in AWS without Zookeeper. > The > > theory is that we can add nodes quickly to our load balancer > > programmatically and get a dump of the indexes from another node and copy > > them over to the new one. A RESTful API would handle other applications > > talking to Solr without the need for each of them to have to use SolrJ. > > Data ingestion happens nightly in bulk by way of ActiveMQ which each > server > > subscribes to and pulls its own copy of the indexes. Incremental updates > > are very few during the day, but we would have some mechanism of getting > a > > new server to 'catch up' to the live servers before making it active in > the > > load balancer. > > > > The only thing so far that I see as a hurdle here is the data set size > vs. > > heap size. If the index grows too large, then we have to increase the > heap > > size, which could lead to longer GC times. Servers could pop in and out > of > > the load balancer if they are unavailable for too long when a major GC > > happens. > > > > Current stats: > > 11 Gb of data (and growing) > > 4 Gb java heap > > 4 CPU, 16 Gb RAM nodes (maybe more needed?) > > > > All thoughts are welcomed. > > > > Thanks. > > -- > > *Joel Cohen* > > Devops Engineer > > > > *GrubHub Inc.* > > *jco...@grubhub.com <jco...@grubhub.com>* > > 646-527-7771 > > 1065 Avenue of the Americas > > 15th Floor > > New York, NY 10018 > > > > grubhub.com | *fb <http://www.facebook.com/grubhub>* | *tw > > <http://www.twitter.com/grubhub>* > > seamless.com | *fb <http://www.facebook.com/seamless>* | *tw > > <http://www.twitter.com/seamless>* > > > -- *Joel Cohen* Senior Devops Engineer *GrubHub Inc.* *jco...@grubhub.com <jco...@grubhub.com>* 646-527-7771 1065 Avenue of the Americas 15th Floor New York, NY 10018 grubhub.com | *fb <http://www.facebook.com/grubhub>* | *tw <http://www.twitter.com/grubhub>* seamless.com | *fb <http://www.facebook.com/seamless>* | *tw <http://www.twitter.com/seamless>*