Re: Experience with indexing billions of documents?

Jason Rutherglen Wed, 14 Apr 2010 15:00:28 -0700

Tom,

Yes, we've (Biz360) indexed 3 billion and upwards... If indexing
is the issue (or rather re-indexing) we used SOLR-1301 with
Hadoop to re-index efficiently (ie, in a timely manner). For
querying we're currently using the out of the box Solr
distributed shards query mechanism, which is hard (read, near
impossible) to customize. I've been writing SOLR-1724 which
deploy cores out of HDFS. SOLR-1724 works in conjunction with
Solr Cloud which should allow for more efficient failover.  Katta
has a nice model for replicating cores across multiple servers
for redundancy. The issue with this is, it could feasibly
require 2 times as many servers for 2 times replication.


If you have more questions feel free to ping me or whatever.

Cheers,

Jason

On Fri, Apr 2, 2010 at 8:57 AM, Burton-West, Tom <tburt...@umich.edu> wrote:
> We are currently indexing 5 million books in Solr, scaling up over the next 
> few years to 20 million.  However we are using the entire book as a Solr 
> document.  We are evaluating the possibility of indexing individual pages as 
> there are some use cases where users want the most relevant pages regardless 
> of what book they occur in.  However, we estimate that we are talking about 
> somewhere between 1 and 6 billion pages and have concerns over whether Solr 
> will scale to this level.
>
> Does anyone have experience using Solr with 1-6 billion Solr documents?
>
> The lucene file format document 
> (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions 
> a limit of about 2 billion document ids.   I assume this is the lucene 
> internal document id and would therefore be a per index/per shard limit.  Is 
> this correct?
>
>
> Tom Burton-West.
>
>
>
>

Re: Experience with indexing billions of documents?

Reply via email to