I'm guessing the slaves you restarted were running low on RAM, and possibly engaged in out of control GC.
I have had good luck using the JVM option "-XX:+UseConcMarkSweepGC ", which seems to result in GC happening in another thread and not interfering with the servicing of requests. If that's what's going on, it may also indicate that you should give your Solr more RAM, for your current cache settings. On the other hand, apparently sometimes out of control GC can also indicate that your JVM has too MUCH ram. I don't know the answers, there doesn't seem to be any simple way to figure out how much RAM your Solr needs, or if it's having problems as a result of not enough (or too much!), I haven't really figured out how to do it. For a server process like Solr, I also think it makes sense to set -xms the same value as -xmx, I don't think there's much to be gained by leaving a window, and by setting them the same the JVM won't spend any time increasing it's heap size. ________________________________________ From: Mason Hale [masonh...@gmail.com] Sent: Wednesday, October 27, 2010 6:33 PM To: solr-user@lucene.apache.org Subject: Re: Inconsistent slave performance after optimize Hi Lance -- Thanks for the reply. > Did you restart all of these slave servers? That would help. We discovered independently that restarting the slave nodes resulted in dramatically improved performance (e.g. from 2.0 sec average response to 0.25 sec average). Can you please explain why this is the case? I would expect a process restart to invalid caches and thus trigger additional cache-warming overhead, slowing things down, not speeding things up. > What garbage collection options do you use? We've not tweaked the garbage collection settings. We're using -Xms512M -Xmx5000M on the command line. > Which release of Solr? version 1.4.0 > How many Searchers are there in admin/stats.jsp? I'm looking much later, and after a restart -- but I currently see 2 searchers listed. I admit I'm not sure what I'm looking for on this page. thanks, Mason On Wed, Oct 27, 2010 at 2:25 AM, Lance Norskog <goks...@gmail.com> wrote: > Did you restart all of these slave servers? That would help. > What garbage collection options do you use? > Which release of Solr? > How many Searchers are there in admin/stats.jsp? > Searchers hold open all kinds of memory. They are supposed to cycle out. > > These are standard questions, but- what you are seeing is definitely not > normal. > > Separately, if you want a regular optimization regime, there is a new > option called 'maxSegments' to the optimize command. If you have solrconfig > mergeFactor set to 10, then optimize 'maxSegments=8' will roll up the very > smallest segments. This allows you to have a gradual optimization (and > replication overhead) instead of big ones. > > > Mason Hale wrote: > >> Hello esteemed Solr community -- >> >> I'm observing some inconsistent performance on our slave servers after >> recently optimizing our master server. >> >> Our configuration is as follows: >> >> - all servers are hosted at Amazon EC2, running Ubuntu 8.04 >> - 1 master with heavy insert/update traffic, about 125K new documents >> per day (m1.large, ~8GB RAM) >> - autocommit every 1 minute >> - 3 slaves (m2.xlarge instance sizes, ~16GB RAM) >> - replicate every 5 minutes >> - we have configured autowarming queries for these machines >> - autowarmCount = 0 >> - Total index size is ~7M documents >> >> We were seeing increasing, but gradual performance degradation across all >> nodes. >> So we decided to try optimizing our index to improve performance. >> >> In preparation for the optimize we disabled replication polling on all >> slaves. We also turned off all >> workers that were writing to the index. Then we ran optimize on the >> master. >> >> The optimize took 45-60 minutes to complete, and the total size went from >> 68GB down to 23GB. >> >> We then enabled replication on each slave one at a time. >> >> The first slave we re-enabled took about 15 minutes to copy the new files. >> Once the files were copied >> the performance of slave plummeted. Average response time went from 0.75 >> sec >> to 45 seconds. >> Over the past 18 hours the average response time has gradually gown down >> to >> around 1.2 seconds now. >> >> Before re-enabling replication the second slave, we first removed it from >> our load-balanced pool of available search servers. >> This server's average query performance also degraded quickly, and then >> (unlike the first slave we replicated) did not improve. >> It stayed at around 30 secs per query. On the theory that this is a >> cache-warming issue, we added this server >> back to the pool in hopes that additional traffic would warm the cache. >> But >> what we saw was a quick spike of much worse >> performance (50 sec / query on average) followed by a slow/gradual decline >> in average response times. >> As of now (10 hours after the initial replication) this server is still >> reporting an average response time of ~2 seconds. >> This is much worse than before the optimize and is a counter-intuitive >> result. We expected an index 1/3 the size would be faster, not slower. >> >> On the theory that the index files needed to be loaded into the file >> system >> cache, I used the 'dd' command to copy >> the contents of the data/index directory to /dev/null, but that did not >> result in any noticeable performance improvement. >> >> At this point, things were not going as expected. We did not expect the >> replication after an optimize to result in such horrid >> performance. So we decided to let the last slave continue to serve stale >> results while we waited 4 hours for the >> other two slaves to approach some acceptable performance level. >> >> After the 4 hour break, we re-moved the 3rd and last slave server from our >> load-balancing pool, then re-enabled replication. >> This time we saw a tiny blip. The average performance went up to 1 second >> briefly then went back to the (normal for us) >> 0.25 to 0.5 second range. We then added this server back to the >> load-balancing pool and observed no degradation in performance. >> >> While we were happy to avoid a repeat of the poor performance we saw on >> the >> previous slaves, we are at a loss to explain >> why this slave did not also have such poor performance. >> >> At this point we're scratching our heads trying to understand: >> (a) Why the performance of the first two slaves was so terrible after >> the >> optimize. We think its cache-warming related, but we're not sure. >> > 10 hours seems like a long time to wait for the cache to warm >> up >> (b) Why the performance of the third slave was barely impacted. It >> should >> have hit the same cold-cache issues as the other servers, if that is >> indeed >> the root cause. >> (c) Why performance of the first 2 slaves is still much worse after the >> optimize than it was before the optimize, >> where the performance of the 3rd slave is pretty much unchanged. We >> expected the optimize to *improve* performance. >> >> All 3 slave servers are identically configured, and the procedure for >> re-enabling replication was identical for the 2nd and 3rd >> slaves, with the exception of a 4-hour wait period. >> >> We have confirmed that the 3rd slave did replicate, the number of >> documents >> and total index size matches the master and other slave servers. >> >> I'm writing to fish for an explanation or ideas that might explain this >> inconsistent performance. Obviously, we'd like to be able to reproduce the >> performance of the 3rd slave, and avoid the poor performance of the first >> two slaves the next time we decide it's time to optimize our index. >> >> thanks in advance, >> >> Mason >> >> >> >