Re: Index very large number of documents from large number of clients

Erick Erickson Sat, 15 Aug 2015 21:04:24 -0700

Piling on here. At the scale you're talking, I suspect you'll not only have
a bunch of servers, you'll really have a bunch of completely separate
"Solr Clouds", complete with their own Zookeepers etc. Partly for
administrative sake, partly for stability, etc.


Not sure that'll be true, mind you, but a "divide and conquer" approcah
seems in order.

And to be clear, the multiple clusters are NOT because of 3 Billion docs,
I've certainly seen that number of docs fit on 10 shards when the records
are as small as your's are. OTOH, I've seen it take 30 or 60 shards, but
that's usually for complex documents. As Shawn says, prototyping is the
only way to be sure.

It's because if you choose to have 6,000 _collections_, you'll need some
kind of divisions.

Now, if you can create a smaller number of collections and have, say,
a collection ID with each doc, you can simply add an fq=collectionID
to each query and that'll show you only the docs belonging to that collection.
This could be significantly simpler than maintaining 6,000 collections..

Best,
Erick

On Sat, Aug 15, 2015 at 8:40 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 8/15/2015 2:03 PM, Troy Edwards wrote:
>> I am using SolrCloud
>>
>> My initial requirements are:
>>
>> 1) There are about 6000 clients
>> 2) The number of documents from each client are about 500000 (average
>> document size is about 400 bytes)
>> 3 I have to wipe off the index/collection every night and create new
>>
>> Any thoughts/ideas/suggestions on:
>>
>> 1) How to index such large number of documents i.e. do I use an http client
>> to send documents or is data import handler right or should I try uploading
>> CSV files?
>
> This is general info only.
>
> 6000 clients, each with half a million docs?  That's 3 billion docs.
> There are some users who have more, but this is squarely in the realm of
> a HUGE install.
>
>> 2) How many collections should I use?
>>
>> 3) How many shards / replicas per collection should I use?
>
> Any answer we came up with for those two questions would involve quite a
> few assumptions, any one of which could be wrong.  The only way to
> really find out what you need is to set up a prototype system and test
> it with real data, real indexing requests, and real queries.  Record the
> results of the tests, change the configuration, rebuild the index(es),
> and run the tests again.
>
> The number one rule when it comes to Solr performance: Install enough
> memory so that all the index data on the server will fit in the
> available OS disk cache RAM.  You're going to have a lot of index data.
>
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> When the number of collections reaches the low hundreds, SolrCloud
> stability begins to suffer because of how much interaction with
> Zookeeper is required for very small cluster changes.  When there are
> thousands of collections, any little problem turns into a nightmare.
> Adding more machines doesn't help this particular problem.  Some ideas
> are being discussed to make this better, but users won't see the results
> of that effort until version 5.4 or 5.5, possibly later.
>
>> 4) Do I need multiple Solr servers?
>
> You would need multiple servers for any hope of redundancy, but the
> answer to the question I think you're trying to ask here is yes.
> Definitely.  Possibly a LOT of them.
>
> Thanks,
> Shawn
>

Re: Index very large number of documents from large number of clients

Reply via email to