Hi all, I wanted to share the issues we're having with Solr 1.4 to get some ideas of things we can do in the short term that will buy us enough time to validate Solr 4 before upgrading and not have 1.4 burn to the ground before we get there.
We've been running Solr 1.4 in production for over 3 years now, but are really starting to hit some performance bottlenecks that are beginning to affect our users. Here are the details of our setup: We're running 2 4-CPU Solr servers. The data is on a 4-disk RAID 10 array and we're using block-level replication via DRBD over GigE to write to the standby node. Only one server is serving traffic at a time. Some tuning information: - Merge Factor: 25 - Auto Commit: 60s / 1000 docs What we're seeing: In roughly 14 hour cycles, the CPU usage climbs from 100% to between 200 and 250%. At the end of the cycle, we get one long commit of roughly 500 seconds, blocking all writes. Around the same time queries begin to get very slow, often causing timeouts from connecting clients. This behavior is cyclical, and is getting progressively worse. What is this, and what can we do about it? I've attached relevant graphs. Apologies in advance for the obscenely large image sizes. Cheers, Stephen client-requests-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUN1lhd1hfSE9Jc2M/edit?usp=drive_web> cpu-usage.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUSHpsY1B2T01iVGM/edit?usp=drive_web> disk-ios-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUNEpkMGRkR3dhYVk/edit?usp=drive_web> mem-usage-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUWnFVZlU3aUxYNXc/edit?usp=drive_web> tcp-connections-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUYmdvMmpDSlVvQUE/edit?usp=drive_web>