On 11-Feb-08, at 11:38 PM, James Brady wrote:

Hello,
I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching.

At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds.

This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index.

I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds.

By "database commit" do you mean "solr commit"? If so, that is far too frequent if you are sorting on big fields.

I use Solr to serve queries for ~10m docs on a medium size EC2 instance. This is an optimized configuration where highlighting is broken off into a separate index, and load balanced into two subindices of 5m docs a piece. I do a good deal of faceting but no sorting. The only reason that this is possible is that the index is only updated every few days.

On another box we have a several hundred thousand document index which is updated relatively frequently (autocommit time: 20s). These are merged with the static-er index to create an illusion of real- time index updates.

When lucene supports efficient, reopen()able fieldcache upates, this situation might improve, but the above architecture would still probably be better. Note that the second index can be on the same machine.

-Mike

Reply via email to