Re: How large is your solr index?

Erick Erickson Tue, 06 Jan 2015 10:56:23 -0800

Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
Even after that's done, the sub-shard needs to be physically moved to
another machine
(probably), but that too could be scripted.


May not be desirable, but I thought I'd mention it.

Best,
Erick

On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.stu...@gmail.com> wrote:
> Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
> performs reasonably well on commodity hardware with lots of faceting and
> concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
> it works.
>
> ++1 for the automagic shard creator. We've been looking into doing this
> sort of thing internally - i.e. when a shard reaches a certain size/num
> docs, it creates 'sub-shards' to which new commits are sent and queries to
> the 'parent' shard are included. The concept works, as long as you don't
> try any non-dist stuff - it's one reason why all our fields are always
> single valued. There are also other implications like cleanup, deletes and
> security to take into account, to name a few.
> A cool side-effect of sub-sharding (for lack of a snappy term) is that the
> parent shard then stops suffering from auto-warming latency due to commits
> (we do a fair amount of committing). In theory, you could carry on
> sub-sharding until your hardware starts gasping for air.
>
>
> On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam <bram.van...@intix.eu> wrote:
>
>> On 01/04/2015 02:22 AM, Jack Krupansky wrote:
>>
>>> The reality doesn't seem to
>>> be there today. 50 to 100 million documents, yes, but beyond that takes
>>> some kind of "heroic" effort, whether a much beefier box, very careful and
>>> limited data modeling or limiting of query capabilities or tolerance of
>>> higher latency, expert tuning, etc.
>>>
>>
>> I disagree. On the scale, at least. Up until 500M Solr performs "well"
>> (read: well enough considering the scale) in a single shard on a single box
>> of commodity hardware. Without any tuning or heroic efforts. Sure, some
>> queries aren't as snappy as you'd like, and sure, indexing and querying at
>> the same time will be somewhat unpleasant, but it will work, and it will
>> work well enough.
>>
>> Will it work for thousands of concurrent users? Of course not. Anyone who
>> is after that sort of thing won't find themselves in this scenario -- they
>> will throw hardware at the problem.
>>
>> There is something to be said for making sharding less painful. It would
>> be nice if, for instance, Solr would automagically create a new shard once
>> some magic number was reached (2B at the latest, I guess). But then that'll
>> break some query features ... :-(
>>
>> The reason we're using single large instances (sometimes on beefy
>> hardware) is that SolrCloud is a pain. Not just from an administrative
>> point of view (though that seems to be getting better, kudos for that!),
>> but mostly because some queries cannot be executed with distributed=true.
>> Our users, at least, prefer a slow query over an impossible query.
>>
>> Actually, this 2B limit is a good thing. It'll help me convince
>> $management to donate some of our time to Solr :-)
>>
>>  - Bram
>>

Re: How large is your solr index?

Reply via email to