On 2/4/2015 2:54 PM, Arumugam, Suresh wrote:
>
> Hi All,
>
>  
>
> We are trying to load 14+ Billion documents into Solr. But we are
> failing to load them into Solr.
>
>  
>
> Solr version: *4.8.0*
>
> Analyzer used: *ClassicTokenizer for index as well as query.*
>
>  
>
> Can someone help me in getting into the core of this issue?
>
>  
>
> For 14+ Billion document load, we are loading 2Billion batches using
> the dataimport with single thread.
>
>  
>
>                 First batch completed successfully & added 2 Billion
> documents
>
>                 Second batch, dataimport is showing as successful
> completion. But the no of documents is still 2 Billion with the
> following exception in the logs.
>

<snip>

> Caused by: java.lang.IllegalArgumentException: Too many documents,
> composite IndexReaders cannot exceed 2147483647

Solr is an application based on Lucene.  Lucene has exactly one hard
limitation -- a single index cannot contain more than 2147483647(Java's
Integer.MAX_VALUE) documents.  There are some ideas being kicked around
for removing this limitation, but it is not normally seen it as a major
stumbling block.  You're likely to hit performance bottlenecks with
indexes much smaller than 2 billion documents.

The document count includes deleted documents that have not yet been
merged away.  For a variety of reasons, we recommend not storing more
than about 100 million documents in any single index, although going up
to about 1 billion is feasible, if you have enough memory.

Solr, especially if you use SolrCloud, offers the ability to shard your
index so it is being served from many smaller indexes, on many hosts. 
If you're going to have billions of documents, you have no choice but to
shard your index.  In order to get good performance out of an index that
large, you'll need the memory and processing power of multiple physical
machines working together.

https://wiki.apache.org/solr/DistributedSearch
https://cwiki.apache.org/confluence/display/solr/SolrCloud

You will need a lot of hardware, especially memory, to handle a 14
billion document index with any kind of speed.

Thanks,
Shawn

Reply via email to