I'm working on performance tuning / testing a solr 4 cluster - the slaves are handling all the queries and they are currently replicating from the master every minute.
Right now for my load test, I'm playing back random queries form a set of about 5M queries harvested from our existing production servers. Whenever the replication ends and the new searcher gets used there is a jump in QTimes. The QTimes jump briefly but ONLY after replication. I tried adding in some warming queries to create facets etc - and I analyzed the queries and they all seem relatively similar (use a large number of facets etc) - but even when I have converted those exact queries over to the format that is in the logs, it doesn't seem to eliminate the "post-repl bump" of qtimes jumping from single digits to 500 - 800ms. Here's an example of what i'm talking about. -- replication ends (message about deleting temporary index in the logs) date | qtime 21:50:45.159 | 0 | 21:50:45.170 | 1 | 21:50:45.174 | 0 | 21:50:45.182 | 2 | 21:50:45.185 | 78 | 21:50:45.194 | 1 | 21:50:45.201 | 69 | 21:50:45.206 | 0 | 21:50:45.211 | 0 | 21:50:45.219 | 0 | 21:50:45.286 | 1 | 21:50:45.288 | 0 | 21:50:45.301 | 30 | 21:50:45.317 | 15 | 21:50:45.327 | 2 | 21:50:45.327 | 1 | 21:50:45.334 | 0 | 21:50:45.337 | 1 | 21:50:45.345 | 1 | 21:50:45.347 | 2 | 21:50:45.392 | 19 | 21:50:45.415 | 47 | 21:50:45.428 | 1 | 21:50:45.438 | 2 | 21:50:45.453 | 1 | 21:50:45.468 | 3 | 21:50:45.507 | 4 | 21:50:45.551 | 1 | 21:50:45.617 | 92 | 21:50:45.617 | 251 | 21:50:45.619 | 457 | 21:50:45.632 | 1 | 21:50:45.731 | 500 | 21:50:45.731 | 437 | 21:50:45.731 | 526 | 21:50:45.731 | 514 | 21:50:45.731 | 354 | 21:50:45.731 | 531 | 21:50:45.731 | 525 | 21:50:45.731 | 525 | 21:50:45.732 | 502 | 21:50:45.732 | 452 | 21:50:45.732 | 278 | 21:50:45.732 | 527 | 21:50:45.732 | 270 | 21:50:45.732 | 576 | 21:50:45.733 | 221 | 21:50:45.733 | 225 | 21:50:45.734 | 265 | 21:50:45.735 | 370 | 21:50:45.737 | 551 | 21:50:45.737 | 517 | 21:50:45.738 | 440 | 21:50:45.738 | 477 | 21:50:45.738 | 43 | 21:50:45.738 | 299 | 21:50:45.739 | 541 | 21:50:45.825 | 1 | 21:50:45.838 | 5 | 21:50:45.848 | 0 | 21:50:45.852 | 0 | 21:50:45.859 | 19 | 21:50:45.875 | 6 | 21:50:45.876 | 7 | 21:50:45.881 | 0 | 21:50:45.883 | 14 | 21:50:45.886 | 1 | 21:50:45.890 | 4 | 21:50:45.891 | 3 | 21:50:45.894 | 1 | 21:50:45.902 | 1 | 21:50:45.906 | 4 | 21:50:45.908 | 3 | 21:50:45.918 | 18 | 21:50:45.921 | 10 | The DocumentCache in our case is not very useful because of our 1 minute replication pattern. I have it sized to 1024 elements now. When I've tried to increase the size, it caused our GC pause times to skyrocket. (Currently it's tuned so it has a 250ms GC pause roughly every 16 seconds - and I've verified that the above QTime bump is not due to GC activity) Is there something I can do to help with these? (short of increasing the replication time to a longer interval to mitigate the impact of these bumps)? I guess the real question I have is "Why do queries get faster after a second or so after replication? How can I try to get it to do that as part of the newSearcher warming??" - Like I said, I've copied some of those slow queries and put them into the newSearcher warming section to see if "well, maybe running through a few dozen of these searches that's what makes it get faster" - but that hasn't helped. Our index file is stored on disk but the OS basically has it all cached in RAM (I tested moving it to tmpfs but saw no improvement in speed - so I went back to putting it on disk) - and the CPU is not anywhere near taxed (it's got 24 cores on this machine) So far the performance of SOLR has been stellar - and once we finish tuning this we'll write up how we've tuned it / how we're using it to share out more widely with anyone who cares - but the one perplexing thing is this bump in bad times after replication. Any thoughts would be appreciated. Ryan