Re: any plans to remove int32 limitation on the number of the documents in the index?

Otis Gospodnetic Fri, 05 Jul 2013 15:51:18 -0700

Does https://issues.apache.org/jira/browse/SOLR-2112 help?


Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jul 5, 2013 at 5:57 PM, Valery Giner <valgi...@research.att.com> wrote:
> As a simplest example, just write a query result into a file for processing
> by external programs (the programs are out of our control, and the result
> could contain millions of docs)
>
> Thanks,
> Val
>
> On 07/05/2013 04:41 PM, Walter Underwood wrote:
>>
>> What are you doing that start=500000 is normal?  --wunder
>>
>> On Jul 5, 2013, at 1:28 PM, Valery Giner wrote:
>>
>>> Eric,
>>>
>>> We did not have any RAM problems, but just the following official
>>> limitation makes our life too miserable to use the shards:
>>>
>>> "Makes it more inefficient to use a high "start" parameter. For example,
>>> if you request start=500000&rows=25 on an index with 500,000+ docs per
>>> shard, this will currently result in 500,000 records getting sent over the
>>> network from the shard to the coordinating Solr instance. If you had a
>>> single-shard index, in contrast, only 25 records would ever get sent over
>>> the network. (Granted, setting start this high is not something many people
>>> need to do.) "  http://wiki.apache.org/solr/DistributedSearch
>>>
>>> Reading millions of documents as a result of a query is a "normal" use
>>> case for us, not a "design defect".   Subdividing the "large" indexes into
>>> smaller ones seems too ugly to use as a way to scale up.  This turns solr
>>> from a perfect solution for us into something unacceptable for such cases.
>>>
>>> I wonder whether any one else has similar use cases/problem with
>>> sharding.
>>>
>>> Thanks,
>>> Val
>>>
>>> On 05/03/2013 12:10 PM, Erick Erickson wrote:
>>>>
>>>> My off the cuff thought is that there are significant costs trying to
>>>> do this that would be paid by 99.999% of setups out there. Also,
>>>> usually you'll run into other issues (RAM etc) long before you come
>>>> anywhere close to 2^31 docs.
>>>>
>>>> Lucene/Solr often allocates int[maxDoc] for various operations. when
>>>> maxDoc approaches 2^31, well memory goes through the roof. Now
>>>> consider allocating longs instead...
>>>>
>>>> which is a long way of saying that I don't really think anyone's going
>>>> to be working on this any time soon, especially when SolrCloud removes
>>>> a LOT of the pain /complexity (from a user perspective anyway) from
>>>> going to a sharded setup...
>>>>
>>>> FWIW,
>>>> Erick
>>>>
>>>> On Thu, May 2, 2013 at 1:17 PM, Valery Giner <valgi...@research.att.com>
>>>> wrote:
>>>>>
>>>>> Otis,
>>>>>
>>>>> The documents themselves are relatively small, tens of fields, only a
>>>>> few of
>>>>> them could be up to a hundred bytes.
>>>>> Lunix Servers with relatively large RAM (256),
>>>>> Minutes on the searches are fine for our purposes,  adding a few tens
>>>>> of
>>>>> millions of records in tens of minutes are also fine.
>>>>> We had to do some simple tricks for keeping indexing up to speed but
>>>>> nothing
>>>>> too fancy.
>>>>> Moving to the sharding adds a layer of complexity which we don't really
>>>>> need
>>>>> because of the above, ... and adding complexity may result in lower
>>>>> reliability :)
>>>>>
>>>>> Thanks,
>>>>> Val
>>>>>
>>>>>
>>>>> On 05/02/2013 03:41 PM, Otis Gospodnetic wrote:
>>>>>>
>>>>>> Val,
>>>>>>
>>>>>> Haven't seen this mentioned in a while...
>>>>>>
>>>>>> I'm curious...what sort of index, queries, hardware, and latency
>>>>>> requirements do you have?
>>>>>>
>>>>>> Otis
>>>>>> Solr & ElasticSearch Support
>>>>>> http://sematext.com/
>>>>>> On May 1, 2013 4:36 PM, "Valery Giner" <valgi...@research.att.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Solr Developers,
>>>>>>>
>>>>>>> I've been unable to find an answer to the question in the subject
>>>>>>> line of
>>>>>>> this e-mail, except of a vague one.
>>>>>>>
>>>>>>> We need to be able to index over 2bln+ documents.   We were doing
>>>>>>> well
>>>>>>> without sharding until the number of docs hit the limit ( 2bln+).
>>>>>>> The
>>>>>>> performance was satisfactory for the queries, updates and indexing of
>>>>>>> new
>>>>>>> documents.
>>>>>>>
>>>>>>> That is, except for the need to go around the int32 limit, we don't
>>>>>>> really
>>>>>>> have a need for setting up distributed solr.
>>>>>>>
>>>>>>> I wonder whether some one on the solr team could tell us when/what
>>>>>>> version
>>>>>>> of solr we could expect the limit to be removed.
>>>>>>>
>>>>>>> I hope this question may be of interest to some one else :)
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Val
>>>>>>>
>>>>>>>
>> --
>> Walter Underwood
>> wun...@wunderwood.org
>>
>>
>>
>>
>

Re: any plans to remove int32 limitation on the number of the documents in the index?

Reply via email to