Inconsistent slave performance after optimize

Mason Hale Tue, 26 Oct 2010 08:53:10 -0700

Hello esteemed Solr community --

I'm observing some inconsistent performance on our slave servers after
recently optimizing our master server.


Our configuration is as follows:

    - all servers are hosted at Amazon EC2, running Ubuntu 8.04
    - 1 master with heavy insert/update traffic, about 125K new documents
per day (m1.large, ~8GB RAM)
       - autocommit every 1 minute
    - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
       - replicate every 5 minutes
       - we have configured autowarming queries for these machines
       - autowarmCount = 0
    - Total index size is ~7M documents

We were seeing increasing, but gradual performance degradation across all
nodes.
So we decided to try optimizing our index to improve performance.

In preparation for the optimize we disabled replication polling on all
slaves. We also turned off all
workers that were writing to the index. Then we ran optimize on the master.

The optimize took 45-60 minutes to complete, and the total size went from
68GB down to 23GB.

We then enabled replication on each slave one at a time.

The first slave we re-enabled took about 15 minutes to copy the new files.
Once the files were copied
the performance of slave plummeted. Average response time went from 0.75 sec
to 45 seconds.
Over the past 18 hours the average response time has gradually gown down to
around 1.2 seconds now.

Before re-enabling replication the second slave, we first removed it from
our load-balanced pool of available search servers.
This server's average query performance also degraded quickly, and then
(unlike the first slave we replicated) did not improve.
It stayed at around 30 secs per query. On the theory that this is a
cache-warming issue, we added this server
back to the pool in hopes that additional traffic would warm the cache. But
what we saw was a quick spike of much worse
performance (50 sec / query on average) followed by a slow/gradual decline
in average response times.
As of now (10 hours after the initial replication) this server is still
reporting an average response time of ~2 seconds.
This is much worse than before the optimize and is a counter-intuitive
result. We expected an index 1/3 the size would be faster, not slower.

On the theory that the index files needed to be loaded into the file system
cache, I used the 'dd' command to copy
the contents of the data/index directory to /dev/null, but that did not
result in any noticeable performance improvement.

At this point, things were not going as expected. We did not expect the
replication after an optimize to result in such horrid
performance. So we decided to let the last slave continue to serve stale
results while we waited 4 hours for the
other two slaves to approach some acceptable performance level.

After the 4 hour break, we re-moved the 3rd and last slave server from our
load-balancing pool, then re-enabled replication.
This time we saw a tiny blip. The average performance went up to 1 second
briefly then went back to the (normal for us)
0.25 to 0.5 second range. We then added this server back to the
load-balancing pool and observed no degradation in performance.

While we were happy to avoid a repeat of the poor performance we saw on the
previous slaves, we are at a loss to explain
why this slave did not also have such poor performance.

At this point we're scratching our heads trying to understand:
   (a) Why the performance of the first two slaves was so terrible after the
optimize. We think its cache-warming related, but we're not sure.
         > 10 hours seems like a long time to wait for the cache to warm up
   (b) Why the performance of the third slave was barely impacted. It should
have hit the same cold-cache issues as the other servers, if that is indeed
the root cause.
   (c) Why performance of the first 2 slaves is still much worse after the
optimize than it was before the optimize,
      where the performance of the 3rd slave is pretty much unchanged. We
expected the optimize to *improve* performance.

All 3 slave servers are identically configured, and the procedure for
re-enabling replication was identical for the 2nd and 3rd
slaves, with the exception of a 4-hour wait period.

We have confirmed that the 3rd slave did replicate, the number of documents
and total index size matches the master and other slave servers.

I'm writing to fish for an explanation or ideas that might explain this
inconsistent performance. Obviously, we'd like to be able to reproduce the
performance of the 3rd slave, and avoid the poor performance of the first
two slaves the next time we decide it's time to optimize our index.

thanks in advance,

Mason

Inconsistent slave performance after optimize

Reply via email to