RE: Solr cloud performance degradation with billions of documents

Wilburn, Scott Fri, 15 Aug 2014 10:11:10 -0700

Erick,
You make some very good valid points. Let me clear a few things up, though. We 
are not trying to put 7B docs into one single shard, because we are using 
collections, created daily, which spread the index across the 32 shards that 
make up the cloud/collection. Last I counted, we are putting about 160M docs 
per collection per shard. 
You are very correct about the memory issues. In fact, we cannot do any 
complicated searches or faceting without Solr returning memory errors. Basic 
searching still works fine, fortunately. This limit on search is acceptable in 
our case, though not ideal, to ensure the project succeeds and comes in under 
budget.

Thanks,
Scott 

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 15, 2014 7:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Toke:

You make valid points. You're completely right that my reflexes are for 
sub-second responses so I tend to think of lots and lots of memory being a 
requirement. I agree that depending on the problem space the percentage of the 
index that has to be in memory varies widely, I've seen a large variance in 
projects. And I know you've done some _very_ interesting things tuning-wise!

I guess that my main issue is that from everything I've seen so far, this 
project is doomed. You simply cannot put 7B documents in a single shard, 
period. Lucene has a 2B hard limit. Wilburn is making assumptions here that are 
simply wrong. Or my math is off, that's been known to happen too.

For instance, Wilburn is talking about only using 6G of memory. Even at 2B 
docs/shard, I'd be surprised to see it function at all. Don't try sorting on a 
timestamp for instance. I've never seen 2B docs fit on a shard and be OK 
performance-wise. Or, for that matter, perform at all. If there are situations 
like that I'd _love_ to know the details...

For any chance of success, Wilburn has to go back and do some reassessment IMO. 
There's no magic knob to turn to overcome the fundamental limitations that are 
going to creeping out of the woodwork. Indexing throughput is the least of his 
problems. I further suspect (but don't know for sure) that the first time 
realistic queries start hitting the system it'll OOM.

All that said, I don't know for sure of course.

On Thu, Aug 14, 2014 at 11:57 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
> Erick Erickson [erickerick...@gmail.com] wrote:
>> Solr requires holding large parts of the index in memory.
>> For the entire corpus. At once.
>
> That requirement is under the assumption that one must have the lowest 
> possible latency at each individual box. You might as well argue for the 
> fastest possible memory or the fastest possible CPU being a requirement. The 
> advice is good in some contexts and a waste of money in other.
>
> I not-so-humbly point to 
> http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we 
> (for simple searches) handily achieve our goal of sub-second response times 
> for a 10TB index with just 1.4% of the index cached in RAM. Had our goal been 
> sub-50ms, it would be another matter, but it is not. Just as Wilburn's 
> problem is not to minimize latency for each individual box, but to achieve a 
> certain throughput for indexing, while performing searches.
>
> Wilburn's hardware is currently able to keep up, although barely, with 300B 
> documents. He needs to handle 900B. Tripling (or quadrupling) the amount of 
> machines should do the trick. Increasing the amount of RAM on each current 
> machine might also work (qua the well known effect of RAM with Lucene/Solr). 
> Using local SSDs, if he is not doing so already, might also work (qua the 
> article above).
>
> - Toke Eskildsen

RE: Solr cloud performance degradation with billions of documents

Reply via email to