Thanks Mike for your quick response - they were very informative and
useful.
I have one final question if you don't mind.... is it possible for a
single Solr instance to switch between multiple indexes? For
example, can
solr search in one index on one server partition then use another
index
located on another drive, without requiring a restart? This differs
slightly from the distributed search examples I've read in the
documentation where you have another server running solr with the
distributed index.
Thanks,
Willie
Mike Klaas <[EMAIL PROTECTED]>
01/07/2008 05:44 PM
Please respond to
solr-user@lucene.apache.org
To
solr-user@lucene.apache.org
cc
Subject
Re: Solr Capabilities/Limitations
On 1-Jul-08, at 8:37 AM, Willie Wong wrote:
I need to be able to search through terabytes of existing data.
Documents
may vary in size from 10 MB to 20 KB in size. Also at some point I?
ll
also need to feed in approximately approximately 1-5 million new
documents
a day.
This depends greatly on what kind of searching you want to do, and
what are the desired response times. I'm using Solr to full-text
search about 10 TB of data at the moment. Response times are around
~1s including dynamic snippet generation. The queries themselves are
relatively complicated by lucene standards, including a custom word-
proximity boosting query and link-analysis factors.
Of course, this is distributed over dozens of machines, and is a
mostly static index. There are about 10million docs per server.
Has anyone used Solr to conduct searches over terabytes of data? If
so,
are there any configuration parameters I should pay particular
attention
to such jvm size, mergeFactor etc?
JVM size will depend mostly on your sorting/faceting requirements.
Just remember to leave gobs of memory for the OS disk cache. Memory
is key to serving large indices (consequently, things won't be fast
until a decent amount of warming up is done). mergeFactor? You
should only be searching optimized indices of this size, so it isn't
terribly relevant. The daily new docs should probably be added in
their own index, which is then searched in parallel with the existing
indices.
Is there a limit to the number of shards Solr is capable of? I don?t
think there?s any way I can do this without some sort of distributed
search.
Not really, though you will want to move to a 2-level hierarchy
eventually. I can't speak for the distributed search implementation
in trunk (we built our own before this was available), but it should
be exactly what you need.
I?ve read that solr indexes can go into the millions if not billions
of
documents? however at what point do the index size become
impractical ? I
know this is a bit open ended, but I guess does Solr have a limit to
the
number of documents that can be in a single index?
Depends on query composition and document size. But for web docs,
about 10m seems practical.
Has anyone looked into any of these other search engines and are
there any
other search engines that would be better suited such as Fast or
Automomy:
http://mg4j.dsi.unimi.it/
http://www.egothor.org/performance.shtml
I haven't, but it should be possible to build a system based on those
engines. For a system this size, the distributed architecture will be
more important than the underlying index engine (though it sure helps
to use an engine as optimized as lucene).
-Mike