I guess my $0.02 is that you'd have to have strong evidence that extending Lucene to 64 bit is even useful. Or more generally, useful enough to pay the penalty. All the structures that allocate maxDoc id arrays would suddenly require twice the memory for instance, plus all the coding effort that could be spend doing other things.
My challenge for Suresh is to stop indexing at, say, 1 billion and try to search. Add in some faceting and sorting. Get past the GC pauses that will result from allocating enough memory to handle that many documents and... discover that you can't reasonably search that many documents in the first place. Or if you can it's a niche anyway. Indexing is not a memory-intensive operation and I've seen people happily index far more documents than they can search then hit a wall when they _do_ try. FWIW, Erick On Tue, Feb 10, 2015 at 8:58 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 2/4/2015 3:31 PM, Arumugam, Suresh wrote: >> We are trying to do a POC for searching our log files with a single node >> Solr(396 GB RAM with 14 TB Space). >> Since the server is powerful, added 2 Billion records successfully & search >> is working fine without much issues. >> >> Due to the restriction of the Lucence Index max Document, we were not able >> to load further. >> >> Is there a way to increase that limit from 2Billion to 4 or 5 Billion >> in Lucene? > > I thought I already sent this, but it has been sitting in my drafts > folder for several days. > > That Lucene restriction cannot be changed at this time, the result of > using a 32-bit value for the Lucene document identifier. The amount of > program code that would be affected by a switch to a 64-bit value is > HUGE, and the ripple effect would be highly unpredictable. Developers > that use the Lucene API expect long-term stability ... that change has > the potential for a lot of volatility. Even if we figure out how to > make the change, I wouldn't expect it anytime soon. It won't be in the > 5.0 release, and I don't even think that anyone is brave enough to > attempt it for the 6.0 release either. > >> If Lucene supports 2Billion per index then will it be the same issue >> with Solr Cloud also?? > > SolrCloud lets you shard your index, so there are no limits other than > available system resources and the number of servers. There are users > who have indexes as big as the one you are planning (and some even > larger) that use Solr successfully. > >> Recommended size for an index is 100 million means, do we need to have >> 20 indexes to support 2 Billion documents, is my understanding right?? > > The memory structures required within Java are much smaller and can be > manipulated more efficiently if the index has 100 million documents than > if the index has 1 or 2 billion documents. Within the hard Lucene > limitation, you can make your indexes as big as you like ... but > real-world experience has told us that 100 million on each server is a > good balance between resource requirements and performance. If you > don't care how many seconds your index takes to respond to a query, or > you can afford enormous amounts of memory and a commercial JVM with > low-pause characteristics, you can push the limits with your shard size. > > I have compiled some performance information for "normal" sized indexes > with millions of documents. On the billions scale, some of this info is > not very helpful: > > http://wiki.apache.org/solr/SolrPerformanceProblems > > Thanks, > Shawn >