Re: Inconsistent slave performance after optimize

Ken Krugler Wed, 27 Oct 2010 17:19:32 -0700

Normally I'd say like you were getting into swap hell, but based onyour settings you only have 5GB of JVM space being used, on a 16GB box.

Just to confirm, nothing else is using lots of memory, right? And the"top" command isn't showing any swap usage, right?

When you encounter very slow search times, what does the top commandsay about system load and cpu vs. I/O percentages?


-- Ken

On Oct 27, 2010, at 3:33pm, Mason Hale wrote:

Hi Lance --

Thanks for the reply.
Did you restart all of these slave servers? That would help.
We discovered independently that restarting the slave nodes resultedindramatically improved performance (e.g. from 2.0 sec averageresponse to
0.25 sec average).

Can you please explain why this is the case?

I would expect a process restart to invalid caches and thus trigger
additional cache-warming overhead, slowing things down, not speedingthings
up.
What garbage collection options do you use?
We've not tweaked the garbage collection settings. We're using -Xms512M
-Xmx5000M on the command line.
Which release of Solr?
version 1.4.0
How many Searchers are there in admin/stats.jsp?
I'm looking much later, and after a restart -- but I currently see 2
searchers listed.

I admit I'm not sure what I'm looking for on this page.


thanks,
Mason
On Wed, Oct 27, 2010 at 2:25 AM, Lance Norskog <goks...@gmail.com>wrote:
Did you restart all of these slave servers? That would help.
What garbage collection options do you use?
Which release of Solr?
How many Searchers are there in admin/stats.jsp?
Searchers hold open all kinds of memory. They are supposed to cycleout.
These are standard questions, but- what you are seeing isdefinitely not
normal.

Separately, if you want a regular optimization regime, there is a new
option called 'maxSegments' to the optimize command. If you havesolrconfigmergeFactor set to 10, then optimize 'maxSegments=8' will roll upthe verysmallest segments. This allows you to have a gradual optimization(and
replication overhead) instead of big ones.


Mason Hale wrote:
Hello esteemed Solr community --
I'm observing some inconsistent performance on our slave serversafter
recently optimizing our master server.

Our configuration is as follows:

   - all servers are hosted at Amazon EC2, running Ubuntu 8.04
- 1 master with heavy insert/update traffic, about 125K newdocuments
per day (m1.large, ~8GB RAM)
      - autocommit every 1 minute
   - 3 slaves (m2.xlarge instance sizes, ~16GB RAM)
      - replicate every 5 minutes
      - we have configured autowarming queries for these machines
      - autowarmCount = 0
   - Total index size is ~7M documents
We were seeing increasing, but gradual performance degradationacross all
nodes.
So we decided to try optimizing our index to improve performance.
In preparation for the optimize we disabled replication polling onall
slaves. We also turned off all
workers that were writing to the index. Then we ran optimize on the
master.
The optimize took 45-60 minutes to complete, and the total sizewent from
68GB down to 23GB.

We then enabled replication on each slave one at a time.
The first slave we re-enabled took about 15 minutes to copy thenew files.
Once the files were copied
the performance of slave plummeted. Average response time wentfrom 0.75
sec
to 45 seconds.
Over the past 18 hours the average response time has graduallygown down
to
around 1.2 seconds now.
Before re-enabling replication the second slave, we first removedit from
our load-balanced pool of available search servers.
This server's average query performance also degraded quickly, andthen
(unlike the first slave we replicated) did not improve.
It stayed at around 30 secs per query. On the theory that this is a
cache-warming issue, we added this server
back to the pool in hopes that additional traffic would warm thecache.
But
what we saw was a quick spike of much worse
performance (50 sec / query on average) followed by a slow/gradualdecline
in average response times.
As of now (10 hours after the initial replication) this server isstill
reporting an average response time of ~2 seconds.
This is much worse than before the optimize and is a counter-intuitiveresult. We expected an index 1/3 the size would be faster, notslower.
On the theory that the index files needed to be loaded into the file
system
cache, I used the 'dd' command to copy
the contents of the data/index directory to /dev/null, but thatdid not
result in any noticeable performance improvement.
At this point, things were not going as expected. We did notexpect the
replication after an optimize to result in such horrid
performance. So we decided to let the last slave continue to servestale
results while we waited 4 hours for the
other two slaves to approach some acceptable performance level.
After the 4 hour break, we re-moved the 3rd and last slave serverfrom our
load-balancing pool, then re-enabled replication.
This time we saw a tiny blip. The average performance went up to 1second
briefly then went back to the (normal for us)
0.25 to 0.5 second range. We then added this server back to the
load-balancing pool and observed no degradation in performance.
While we were happy to avoid a repeat of the poor performance wesaw on
the
previous slaves, we are at a loss to explain
why this slave did not also have such poor performance.

At this point we're scratching our heads trying to understand:
(a) Why the performance of the first two slaves was so terribleafter
the
optimize. We think its cache-warming related, but we're not sure.
10 hours seems like a long time to wait for the cache to warm
up
  (b) Why the performance of the third slave was barely impacted. It
should
have hit the same cold-cache issues as the other servers, if that is
indeed
the root cause.
(c) Why performance of the first 2 slaves is still much worseafter the
optimize than it was before the optimize,
where the performance of the 3rd slave is pretty muchunchanged. We
expected the optimize to *improve* performance.
All 3 slave servers are identically configured, and the procedurefor
re-enabling replication was identical for the 2nd and 3rd
slaves, with the exception of a 4-hour wait period.

We have confirmed that the 3rd slave did replicate, the number of
documents
and total index size matches the master and other slave servers.
I'm writing to fish for an explanation or ideas that might explainthisinconsistent performance. Obviously, we'd like to be able toreproduce theperformance of the 3rd slave, and avoid the poor performance ofthe first
two slaves the next time we decide it's time to optimize our index.

thanks in advance,

Mason


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Inconsistent slave performance after optimize

Reply via email to