1.4 is ancient, but you know that already :).... Anyway, what are your autocommit settings? That vintage of Solr blocks indexing when committing which may include rewriting the entire index. So part of your regular slowdown is likely segment merging happening with the commit. The 14 hour cycle is a bit weird though. One thing I'd be curious about is whether, when that happens, you look at your index on disk and see whether you've merged down to just a few (or one) segment. One possible explanation is that roughly that often, the merge that happens rewrites the entire index and that takes 500 seconds. If that's true, you should see a few massive segments in your index right after it happens.
I'm assuming your autocommit settings aren't, like, 14 hours..... Does anything issue an optimize command? That will also block updates until it rewrites the entire index. I don't know of a good stop-gap though. Even a master/slave would still have this problem on the master. You might be able to do something with stopping the indexing process, issue a manual optimize and then start the indexing up again. About all that would do, though, is make the slowdown's predictable. Not much help I know. Here's a writeup though: http://blog.trifork.com/2011/04/01/gimme-all-resources-you-have-i-can-use-them/ Best, Erick On Tue, Nov 5, 2013 at 2:15 AM, Stephen Delano <stevi...@gmail.com> wrote: > Hi all, > > I wanted to share the issues we're having with Solr 1.4 to get some ideas > of things we can do in the short term that will buy us enough time to > validate Solr 4 before upgrading and not have 1.4 burn to the ground before > we get there. > > We've been running Solr 1.4 in production for over 3 years now, but are > really starting to hit some performance bottlenecks that are beginning to > affect our users. Here are the details of our setup: > > We're running 2 4-CPU Solr servers. The data is on a 4-disk RAID 10 array > and we're using block-level replication via DRBD over GigE to write to the > standby node. Only one server is serving traffic at a time. > > Some tuning information: > - Merge Factor: 25 > - Auto Commit: 60s / 1000 docs > > What we're seeing: > In roughly 14 hour cycles, the CPU usage climbs from 100% to between 200 > and 250%. At the end of the cycle, we get one long commit of roughly 500 > seconds, blocking all writes. Around the same time queries begin to get > very slow, often causing timeouts from connecting clients. This behavior is > cyclical, and is getting progressively worse. > > What is this, and what can we do about it? > > I've attached relevant graphs. Apologies in advance for the obscenely large > image sizes. > > Cheers, > Stephen > > client-requests-2.png< > https://docs.google.com/file/d/0B7_6ZI9PZjjUN1lhd1hfSE9Jc2M/edit?usp=drive_web > > > > cpu-usage.png< > https://docs.google.com/file/d/0B7_6ZI9PZjjUSHpsY1B2T01iVGM/edit?usp=drive_web > > > > disk-ios-2.png< > https://docs.google.com/file/d/0B7_6ZI9PZjjUNEpkMGRkR3dhYVk/edit?usp=drive_web > > > > mem-usage-2.png< > https://docs.google.com/file/d/0B7_6ZI9PZjjUWnFVZlU3aUxYNXc/edit?usp=drive_web > > > > tcp-connections-2.png< > https://docs.google.com/file/d/0B7_6ZI9PZjjUYmdvMmpDSlVvQUE/edit?usp=drive_web > > > >