Re: Solr Capabilities/Limitations

Mike Klaas Tue, 01 Jul 2008 14:45:24 -0700

On 1-Jul-08, at 8:37 AM, Willie Wong wrote:

I need to be able to search through terabytes of existing data.Documents
may vary in size from 10 MB to 20 KB in size.  Also at some point I’ll
also need to feed in approximately approximately 1-5 million newdocuments
a day.

This depends greatly on what kind of searching you want to do, andwhat are the desired response times. I'm using Solr to full-textsearch about 10 TB of data at the moment. Response times are around~1s including dynamic snippet generation. The queries themselves arerelatively complicated by lucene standards, including a custom word-proximity boosting query and link-analysis factors.

Of course, this is distributed over dozens of machines, and is amostly static index. There are about 10million docs per server.

Has anyone used Solr to conduct searches over terabytes of data? Ifso,are there any configuration parameters I should pay particularattention
to such jvm size, mergeFactor etc?

JVM size will depend mostly on your sorting/faceting requirements.Just remember to leave gobs of memory for the OS disk cache. Memoryis key to serving large indices (consequently, things won't be fastuntil a decent amount of warming up is done). mergeFactor? Youshould only be searching optimized indices of this size, so it isn'tterribly relevant. The daily new docs should probably be added intheir own index, which is then searched in parallel with the existingindices.

Is there a limit to the number of shards Solr is capable of?  I don’t
think there’s any way I can do this without some sort of distributed
search.

Not really, though you will want to move to a 2-level hierarchyeventually. I can't speak for the distributed search implementationin trunk (we built our own before this was available), but it shouldbe exactly what you need.

I’ve read that solr indexes can go into the millions if not billionsofdocuments… however at what point do the index size becomeimpractical – Iknow this is a bit open ended, but I guess does Solr have a limit tothe
number of documents that can be in a single index?

Depends on query composition and document size. But for web docs,about 10m seems practical.

Has anyone looked into any of these other search engines and arethere anyother search engines that would be better suited such as Fast orAutomomy:
http://mg4j.dsi.unimi.it/
http://www.egothor.org/performance.shtml

I haven't, but it should be possible to build a system based on thoseengines. For a system this size, the distributed architecture will bemore important than the underlying index engine (though it sure helpsto use an engine as optimized as lucene).


-Mike

Re: Solr Capabilities/Limitations

Reply via email to