Hi James,

I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching.

At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds.

In general immediate updating of an index with a continuous stream of new content, and fast search results, work in opposition. The searcher's various caches are getting continuously flushed to avoid stale content, which can easily kill your performance.

This issue was one of the more interesting topics discussed during the Lucene BoF meeting at ApacheCon. You're not alone in wanting to have it both ways, but it's clear this is A Hard Problem.

If you can relax the need for immediate updates to the index, and accept some level of lag time between receiving new content and this showing up in the index, then I'd suggest splitting the two processes. Have a backend system that deals with updates, and then at some slower interval update the search index.

-- Ken


This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index.

I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds.

Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload?

I can't see signs of Solr being IO-bound, CPU-bound or memory-bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals.

Any help appreciated!
James


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to