On 8/21/2013 6:23 PM, dmarini wrote:
Shawn,Thanks for your reply. All of these suggestions look like good ideas and I will follow up. We are running Solr via the Jetty process on windows as well as all of our zookeepers on the same boxes as the clouds. The reason for this is that we're on EC2 servers so it gets ultra expensive to have a 6 box setup just to have zookeepers on separate boxes from the solr instances.
You can have zookeeper on the same host as Solr, that's no problem. You should drop to just three total zookeepers, one per node, and use the chroot method to keep things separate. You can probably run zookeeper with a max heap of 256MB, but it likely would never need more than 512MB. It doesn't use much memory at all.
Each of our Windows boxes has 8GB of RAM, with roughly 35 - 40% of it still seemingly free. Is there a tool or some way we can identify for certain if we're running into memory issues?I like your zookeeper idea and I didn't know that this was feasible. I will get a test bed set up that way soon.As for indexes, each cloud has multiple collections but we're looking at the largest entire cloud (multiple indexes) being about 200MB, each collection is between 50 and 100MB and I don't see them getting much bigger than that per index (but I do see more indexes being added to the clouds).
With indexes that small, I would run each Jetty/Solr with a max heap of 1GB. With three of them per server, that will mean that Solr is using 3GB of RAM, leaving 5GB for the OS disk cache. You could probably bump that to 1.5 or 2GB and still be OK.
Is there a definitive advantage to running Solr on a linux box over windows? I need to be able to justify the time and effort it will take to get up to speed on a non-familiar OS if we're going to go that route but if there's a good enough reason I don't see why not.
Linux manages memory better than Windows, and ext4 is a much better filesystem than NTFS. If you are familiar with Windows, there's nothing wrong with continuing to use it, except for the fact that you have to give Microsoft a few hundred bucks per machine for a server OS when you take it into production. You can run Linux for free.
--Would it be helpful to have the zookeeper ensemble on a different disk drive than the clouds? --Can the chattiness of all of the replication and zookeeper communication for multiple clouds/collections cause any of these issues (We do have some collections that are in constant flux with 1 - 5 requests each second, which we gather up and send to solr in batches of 250 documents or a 10 second flush)?
It never hurts to have things separated so they are on different disks, but SolrCloud will put hardly any load on zookeeper, so I don't think it matters much. It is Solr itself that will take that load.
Thanks, Shawn