Re: any plans to remove int32 limitation on the number of the documents in the index?

Valery Giner Fri, 05 Jul 2013 13:29:05 -0700

Eric,

We did not have any RAM problems, but just the following officiallimitation makes our life too miserable to use the shards:

"Makes it more inefficient to use a high "start" parameter. For example,if you request start=500000&rows=25 on an index with 500,000+ docs pershard, this will currently result in 500,000 records getting sent overthe network from the shard to the coordinating Solr instance. If you hada single-shard index, in contrast, only 25 records would ever get sentover the network. (Granted, setting start this high is not somethingmany people need to do.) " http://wiki.apache.org/solr/DistributedSearch

Reading millions of documents as a result of a query is a "normal" usecase for us, not a "design defect". Subdividing the "large" indexesinto smaller ones seems too ugly to use as a way to scale up. Thisturns solr from a perfect solution for us into something unacceptablefor such cases.


I wonder whether any one else has similar use cases/problem with sharding.

Thanks,
Val

On 05/03/2013 12:10 PM, Erick Erickson wrote:

My off the cuff thought is that there are significant costs trying to
do this that would be paid by 99.999% of setups out there. Also,
usually you'll run into other issues (RAM etc) long before you come
anywhere close to 2^31 docs.

Lucene/Solr often allocates int[maxDoc] for various operations. when
maxDoc approaches 2^31, well memory goes through the roof. Now
consider allocating longs instead...

which is a long way of saying that I don't really think anyone's going
to be working on this any time soon, especially when SolrCloud removes
a LOT of the pain /complexity (from a user perspective anyway) from
going to a sharded setup...

FWIW,
Erick

On Thu, May 2, 2013 at 1:17 PM, Valery Giner <valgi...@research.att.com> wrote:

Otis,

The documents themselves are relatively small, tens of fields, only a few of
them could be up to a hundred bytes.
Lunix Servers with relatively large RAM (256),
Minutes on the searches are fine for our purposes,  adding a few tens of
millions of records in tens of minutes are also fine.
We had to do some simple tricks for keeping indexing up to speed but nothing
too fancy.
Moving to the sharding adds a layer of complexity which we don't really need
because of the above, ... and adding complexity may result in lower
reliability :)

Thanks,
Val


On 05/02/2013 03:41 PM, Otis Gospodnetic wrote:

Val,

Haven't seen this mentioned in a while...

I'm curious...what sort of index, queries, hardware, and latency
requirements do you have?

Otis
Solr & ElasticSearch Support
http://sematext.com/
On May 1, 2013 4:36 PM, "Valery Giner" <valgi...@research.att.com> wrote:

Dear Solr Developers,

I've been unable to find an answer to the question in the subject line of
this e-mail, except of a vague one.

We need to be able to index over 2bln+ documents.   We were doing well
without sharding until the number of docs hit the limit ( 2bln+).   The
performance was satisfactory for the queries, updates and indexing of new
documents.

That is, except for the need to go around the int32 limit, we don't
really
have a need for setting up distributed solr.

I wonder whether some one on the solr team could tell us when/what
version
of solr we could expect the limit to be removed.

I hope this question may be of interest to some one else :)

--
Thanks,
Val

Re: any plans to remove int32 limitation on the number of the documents in the index?

Reply via email to