Hello esteemed Solr community -- I'm observing some inconsistent performance on our slave servers after recently optimizing our master server.
Our configuration is as follows: - all servers are hosted at Amazon EC2, running Ubuntu 8.04 - 1 master with heavy insert/update traffic, about 125K new documents per day (m1.large, ~8GB RAM) - autocommit every 1 minute - 3 slaves (m2.xlarge instance sizes, ~16GB RAM) - replicate every 5 minutes - we have configured autowarming queries for these machines - autowarmCount = 0 - Total index size is ~7M documents We were seeing increasing, but gradual performance degradation across all nodes. So we decided to try optimizing our index to improve performance. In preparation for the optimize we disabled replication polling on all slaves. We also turned off all workers that were writing to the index. Then we ran optimize on the master. The optimize took 45-60 minutes to complete, and the total size went from 68GB down to 23GB. We then enabled replication on each slave one at a time. The first slave we re-enabled took about 15 minutes to copy the new files. Once the files were copied the performance of slave plummeted. Average response time went from 0.75 sec to 45 seconds. Over the past 18 hours the average response time has gradually gown down to around 1.2 seconds now. Before re-enabling replication the second slave, we first removed it from our load-balanced pool of available search servers. This server's average query performance also degraded quickly, and then (unlike the first slave we replicated) did not improve. It stayed at around 30 secs per query. On the theory that this is a cache-warming issue, we added this server back to the pool in hopes that additional traffic would warm the cache. But what we saw was a quick spike of much worse performance (50 sec / query on average) followed by a slow/gradual decline in average response times. As of now (10 hours after the initial replication) this server is still reporting an average response time of ~2 seconds. This is much worse than before the optimize and is a counter-intuitive result. We expected an index 1/3 the size would be faster, not slower. On the theory that the index files needed to be loaded into the file system cache, I used the 'dd' command to copy the contents of the data/index directory to /dev/null, but that did not result in any noticeable performance improvement. At this point, things were not going as expected. We did not expect the replication after an optimize to result in such horrid performance. So we decided to let the last slave continue to serve stale results while we waited 4 hours for the other two slaves to approach some acceptable performance level. After the 4 hour break, we re-moved the 3rd and last slave server from our load-balancing pool, then re-enabled replication. This time we saw a tiny blip. The average performance went up to 1 second briefly then went back to the (normal for us) 0.25 to 0.5 second range. We then added this server back to the load-balancing pool and observed no degradation in performance. While we were happy to avoid a repeat of the poor performance we saw on the previous slaves, we are at a loss to explain why this slave did not also have such poor performance. At this point we're scratching our heads trying to understand: (a) Why the performance of the first two slaves was so terrible after the optimize. We think its cache-warming related, but we're not sure. > 10 hours seems like a long time to wait for the cache to warm up (b) Why the performance of the third slave was barely impacted. It should have hit the same cold-cache issues as the other servers, if that is indeed the root cause. (c) Why performance of the first 2 slaves is still much worse after the optimize than it was before the optimize, where the performance of the 3rd slave is pretty much unchanged. We expected the optimize to *improve* performance. All 3 slave servers are identically configured, and the procedure for re-enabling replication was identical for the 2nd and 3rd slaves, with the exception of a 4-hour wait period. We have confirmed that the 3rd slave did replicate, the number of documents and total index size matches the master and other slave servers. I'm writing to fish for an explanation or ideas that might explain this inconsistent performance. Obviously, we'd like to be able to reproduce the performance of the 3rd slave, and avoid the poor performance of the first two slaves the next time we decide it's time to optimize our index. thanks in advance, Mason