Erick, You make some very good valid points. Let me clear a few things up, though. We are not trying to put 7B docs into one single shard, because we are using collections, created daily, which spread the index across the 32 shards that make up the cloud/collection. Last I counted, we are putting about 160M docs per collection per shard. You are very correct about the memory issues. In fact, we cannot do any complicated searches or faceting without Solr returning memory errors. Basic searching still works fine, fortunately. This limit on search is acceptable in our case, though not ideal, to ensure the project succeeds and comes in under budget.
Thanks, Scott -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, August 15, 2014 7:52 AM To: solr-user@lucene.apache.org Subject: Re: Solr cloud performance degradation with billions of documents Toke: You make valid points. You're completely right that my reflexes are for sub-second responses so I tend to think of lots and lots of memory being a requirement. I agree that depending on the problem space the percentage of the index that has to be in memory varies widely, I've seen a large variance in projects. And I know you've done some _very_ interesting things tuning-wise! I guess that my main issue is that from everything I've seen so far, this project is doomed. You simply cannot put 7B documents in a single shard, period. Lucene has a 2B hard limit. Wilburn is making assumptions here that are simply wrong. Or my math is off, that's been known to happen too. For instance, Wilburn is talking about only using 6G of memory. Even at 2B docs/shard, I'd be surprised to see it function at all. Don't try sorting on a timestamp for instance. I've never seen 2B docs fit on a shard and be OK performance-wise. Or, for that matter, perform at all. If there are situations like that I'd _love_ to know the details... For any chance of success, Wilburn has to go back and do some reassessment IMO. There's no magic knob to turn to overcome the fundamental limitations that are going to creeping out of the woodwork. Indexing throughput is the least of his problems. I further suspect (but don't know for sure) that the first time realistic queries start hitting the system it'll OOM. All that said, I don't know for sure of course. On Thu, Aug 14, 2014 at 11:57 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Erick Erickson [erickerick...@gmail.com] wrote: >> Solr requires holding large parts of the index in memory. >> For the entire corpus. At once. > > That requirement is under the assumption that one must have the lowest > possible latency at each individual box. You might as well argue for the > fastest possible memory or the fastest possible CPU being a requirement. The > advice is good in some contexts and a waste of money in other. > > I not-so-humbly point to > http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we > (for simple searches) handily achieve our goal of sub-second response times > for a 10TB index with just 1.4% of the index cached in RAM. Had our goal been > sub-50ms, it would be another matter, but it is not. Just as Wilburn's > problem is not to minimize latency for each individual box, but to achieve a > certain throughput for indexing, while performing searches. > > Wilburn's hardware is currently able to keep up, although barely, with 300B > documents. He needs to handle 900B. Tripling (or quadrupling) the amount of > machines should do the trick. Increasing the amount of RAM on each current > machine might also work (qua the well known effect of RAM with Lucene/Solr). > Using local SSDs, if he is not doing so already, might also work (qua the > article above). > > - Toke Eskildsen