Hello esteemed Solr community --
I'm observing some inconsistent performance on our slave servers
after
recently optimizing our master server.
Our configuration is as follows:
- all servers are hosted at Amazon EC2, running Ubuntu 8.04
- 1 master with heavy insert/update traffic, about 125K new
documents
per day (m1.large, ~8GB RAM)
- autocommit every 1 minute
- 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
- replicate every 5 minutes
- we have configured autowarming queries for these machines
- autowarmCount = 0
- Total index size is ~7M documents
We were seeing increasing, but gradual performance degradation
across all
nodes.
So we decided to try optimizing our index to improve performance.
In preparation for the optimize we disabled replication polling on
all
slaves. We also turned off all
workers that were writing to the index. Then we ran optimize on the
master.
The optimize took 45-60 minutes to complete, and the total size
went from
68GB down to 23GB.
We then enabled replication on each slave one at a time.
The first slave we re-enabled took about 15 minutes to copy the
new files.
Once the files were copied
the performance of slave plummeted. Average response time went
from 0.75
sec
to 45 seconds.
Over the past 18 hours the average response time has gradually
gown down
to
around 1.2 seconds now.
Before re-enabling replication the second slave, we first removed
it from
our load-balanced pool of available search servers.
This server's average query performance also degraded quickly, and
then
(unlike the first slave we replicated) did not improve.
It stayed at around 30 secs per query. On the theory that this is a
cache-warming issue, we added this server
back to the pool in hopes that additional traffic would warm the
cache.
But
what we saw was a quick spike of much worse
performance (50 sec / query on average) followed by a slow/gradual
decline
in average response times.
As of now (10 hours after the initial replication) this server is
still
reporting an average response time of ~2 seconds.
This is much worse than before the optimize and is a counter-
intuitive
result. We expected an index 1/3 the size would be faster, not
slower.
On the theory that the index files needed to be loaded into the file
system
cache, I used the 'dd' command to copy
the contents of the data/index directory to /dev/null, but that
did not
result in any noticeable performance improvement.
At this point, things were not going as expected. We did not
expect the
replication after an optimize to result in such horrid
performance. So we decided to let the last slave continue to serve
stale
results while we waited 4 hours for the
other two slaves to approach some acceptable performance level.
After the 4 hour break, we re-moved the 3rd and last slave server
from our
load-balancing pool, then re-enabled replication.
This time we saw a tiny blip. The average performance went up to 1
second
briefly then went back to the (normal for us)
0.25 to 0.5 second range. We then added this server back to the
load-balancing pool and observed no degradation in performance.
While we were happy to avoid a repeat of the poor performance we
saw on
the
previous slaves, we are at a loss to explain
why this slave did not also have such poor performance.
At this point we're scratching our heads trying to understand:
(a) Why the performance of the first two slaves was so terrible
after
the
optimize. We think its cache-warming related, but we're not sure.
10 hours seems like a long time to wait for the cache to warm
up
(b) Why the performance of the third slave was barely impacted. It
should
have hit the same cold-cache issues as the other servers, if that is
indeed
the root cause.
(c) Why performance of the first 2 slaves is still much worse
after the
optimize than it was before the optimize,
where the performance of the 3rd slave is pretty much
unchanged. We
expected the optimize to *improve* performance.
All 3 slave servers are identically configured, and the procedure
for
re-enabling replication was identical for the 2nd and 3rd
slaves, with the exception of a 4-hour wait period.
We have confirmed that the 3rd slave did replicate, the number of
documents
and total index size matches the master and other slave servers.
I'm writing to fish for an explanation or ideas that might explain
this
inconsistent performance. Obviously, we'd like to be able to
reproduce the
performance of the 3rd slave, and avoid the poor performance of
the first
two slaves the next time we decide it's time to optimize our index.
thanks in advance,
Mason