Hi, We're in the development phase of a new application and the current dev team mindset leans towards running Solr (4.9) in AWS without Zookeeper. The theory is that we can add nodes quickly to our load balancer programmatically and get a dump of the indexes from another node and copy them over to the new one. A RESTful API would handle other applications talking to Solr without the need for each of them to have to use SolrJ. Data ingestion happens nightly in bulk by way of ActiveMQ which each server subscribes to and pulls its own copy of the indexes. Incremental updates are very few during the day, but we would have some mechanism of getting a new server to 'catch up' to the live servers before making it active in the load balancer.
The only thing so far that I see as a hurdle here is the data set size vs. heap size. If the index grows too large, then we have to increase the heap size, which could lead to longer GC times. Servers could pop in and out of the load balancer if they are unavailable for too long when a major GC happens. Current stats: 11 Gb of data (and growing) 4 Gb java heap 4 CPU, 16 Gb RAM nodes (maybe more needed?) All thoughts are welcomed. Thanks. -- *Joel Cohen* Devops Engineer *GrubHub Inc.* *jco...@grubhub.com <jco...@grubhub.com>* 646-527-7771 1065 Avenue of the Americas 15th Floor New York, NY 10018 grubhub.com | *fb <http://www.facebook.com/grubhub>* | *tw <http://www.twitter.com/grubhub>* seamless.com | *fb <http://www.facebook.com/seamless>* | *tw <http://www.twitter.com/seamless>*