Hi - thanks to everyone for their responses.
A couple of extra pieces of data which should help me optimise -
documents are very rarely updated once in the index, and I can throw
away index data older than 7 days.
So, based on advice from Mike and Walter, it seems my best option
will be to have seven separate indices. 6 indices will never change
and hold data from the six previous days. One index will change and
will hold data from the current day. Deletions and updates will be
handled by effectively storing a revocation list in the mutable index.
In this way, I will only need to perform Solr commits (yes, I did
mean Solr commits rather than database commits below - my apologies)
on the current day's index, and closing and opening new searchers for
these commits shouldn't be as painful as it is currently.
To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and
optimise the index at this point)
- ideally, properly warm new searchers in the background to further
improve search performance on the changing index
Does that sound like a reasonable strategy in general, and has anyone
got advice on the specific points I raise above?
Thanks,
James
On 12 Feb 2008, at 11:45, Mike Klaas wrote:
On 11-Feb-08, at 11:38 PM, James Brady wrote:
Hello,
I'm looking for some configuration guidance to help improve
performance of my application, which tends to do a lot more
indexing than searching.
At present, it needs to index around two documents / sec - a
document being the stripped content of a webpage. However,
performance was so poor that I've had to disable indexing of the
webpage content as an emergency measure. In addition, some search
queries take an inordinate length of time - regularly over 60
seconds.
This is running on a medium sized EC2 instance (2 x 2GHz Opterons
and 8GB RAM), and there's not too much else going on on the box.
In total, there are about 1.5m documents in the index.
I'm using a fairly standard configuration - the things I've tried
changing so far have been parameters like maxMergeDocs,
mergeFactor and the autoCommit options. I'm only using the
StandardRequestHandler, no faceting. I have a scheduled task
causing a database commit every 15 seconds.
By "database commit" do you mean "solr commit"? If so, that is far
too frequent if you are sorting on big fields.
I use Solr to serve queries for ~10m docs on a medium size EC2
instance. This is an optimized configuration where highlighting is
broken off into a separate index, and load balanced into two
subindices of 5m docs a piece. I do a good deal of faceting but no
sorting. The only reason that this is possible is that the index
is only updated every few days.
On another box we have a several hundred thousand document index
which is updated relatively frequently (autocommit time: 20s).
These are merged with the static-er index to create an illusion of
real-time index updates.
When lucene supports efficient, reopen()able fieldcache upates,
this situation might improve, but the above architecture would
still probably be better. Note that the second index can be on the
same machine.
-Mike