I'm working on performance tuning / testing a solr 4 cluster - the slaves
are handling all the queries and they are currently replicating from the
master every minute.

Right now for my load test, I'm playing back random queries form a set of
about 5M queries harvested from our existing production servers. Whenever
the replication ends and the new searcher gets used there is a jump in
QTimes.

The QTimes jump briefly but ONLY after replication. I tried adding in some
warming queries to create facets etc - and I analyzed the queries and they
all seem relatively similar (use a large number of facets etc) - but even
when I have converted those exact queries over to the format that is in the
logs, it doesn't seem to eliminate the "post-repl bump" of qtimes jumping
from single digits to 500 - 800ms.

Here's an example of what i'm talking about.

-- replication ends (message about deleting temporary index in the logs)
date  |  qtime
21:50:45.159 | 0 |
21:50:45.170 | 1 |
21:50:45.174 | 0 |
21:50:45.182 | 2 |
21:50:45.185 | 78 |
21:50:45.194 | 1 |
21:50:45.201 | 69 |
21:50:45.206 | 0 |
21:50:45.211 | 0 |
21:50:45.219 | 0 |
21:50:45.286 | 1 |
21:50:45.288 | 0 |
21:50:45.301 | 30 |
21:50:45.317 | 15 |
21:50:45.327 | 2 |
21:50:45.327 | 1 |
21:50:45.334 | 0 |
21:50:45.337 | 1 |
21:50:45.345 | 1 |
21:50:45.347 | 2 |
21:50:45.392 | 19 |
21:50:45.415 | 47 |
21:50:45.428 | 1 |
21:50:45.438 | 2 |
21:50:45.453 | 1 |
21:50:45.468 | 3 |
21:50:45.507 | 4 |
21:50:45.551 | 1 |
21:50:45.617 | 92 |
21:50:45.617 | 251 |
21:50:45.619 | 457 |
21:50:45.632 | 1 |
21:50:45.731 | 500 |
21:50:45.731 | 437 |
21:50:45.731 | 526 |
21:50:45.731 | 514 |
21:50:45.731 | 354 |
21:50:45.731 | 531 |
21:50:45.731 | 525 |
21:50:45.731 | 525 |
21:50:45.732 | 502 |
21:50:45.732 | 452 |
21:50:45.732 | 278 |
21:50:45.732 | 527 |
21:50:45.732 | 270 |
21:50:45.732 | 576 |
21:50:45.733 | 221 |
21:50:45.733 | 225 |
21:50:45.734 | 265 |
21:50:45.735 | 370 |
21:50:45.737 | 551 |
21:50:45.737 | 517 |
21:50:45.738 | 440 |
21:50:45.738 | 477 |
21:50:45.738 | 43 |
21:50:45.738 | 299 |
21:50:45.739 | 541 |
21:50:45.825 | 1 |
21:50:45.838 | 5 |
21:50:45.848 | 0 |
21:50:45.852 | 0 |
21:50:45.859 | 19 |
21:50:45.875 | 6 |
21:50:45.876 | 7 |
21:50:45.881 | 0 |
21:50:45.883 | 14 |
21:50:45.886 | 1 |
21:50:45.890 | 4 |
21:50:45.891 | 3 |
21:50:45.894 | 1 |
21:50:45.902 | 1 |
21:50:45.906 | 4 |
21:50:45.908 | 3 |
21:50:45.918 | 18 |
21:50:45.921 | 10 |

The DocumentCache in our case is not very useful because of our 1 minute
replication pattern. I have it sized to 1024 elements now. When I've tried
to increase the size, it caused our GC pause times to skyrocket. (Currently
it's tuned so it has a 250ms GC pause roughly every 16 seconds - and I've
verified that the above QTime bump is not due to GC activity)

Is there something I can do to help with these? (short of increasing the
replication time to a longer interval to mitigate the impact of these
bumps)?

I guess the real question I have is "Why do queries get faster after a
second or so after replication? How can I try to get it to do that as part
of the newSearcher warming??" - Like I said, I've copied some of those slow
queries and put them into the newSearcher warming section to see if "well,
maybe running through a few dozen of these searches that's what makes it
get faster" - but that hasn't helped.

Our index file is stored on disk but the OS basically has it all cached in
RAM (I tested moving it to tmpfs but saw no improvement in speed - so I
went back to putting it on disk) - and the CPU is not anywhere near taxed
(it's got 24 cores on this machine)

So far the performance of SOLR has been stellar - and once we finish tuning
this we'll write up how we've tuned it / how we're using it to share out
more widely with anyone who cares - but the one perplexing thing is this
bump in bad times after replication.

Any thoughts would be appreciated.

Ryan

Reply via email to