Toke:

You make valid points. You're completely right that my reflexes are
for sub-second responses so I tend to think of lots and lots of memory
being a requirement. I agree that depending on the problem space the
percentage of the index that has to be in memory varies widely, I've
seen a large variance in projects. And I know you've done some _very_
interesting things tuning-wise!

I guess that my main issue is that from everything I've seen so far,
this project is doomed. You simply cannot put 7B documents in a single
shard, period. Lucene has a 2B hard limit. Wilburn is making
assumptions here that are simply wrong. Or my math is off, that's been
known to happen too.

For instance, Wilburn is talking about only using 6G of memory. Even
at 2B docs/shard, I'd be surprised to see it function at all. Don't
try sorting on a timestamp for instance. I've never seen 2B docs fit
on a shard and be OK performance-wise. Or, for that matter, perform at
all. If there are situations like that I'd _love_ to know the
details...

For any chance of success, Wilburn has to go back and do some
reassessment IMO. There's no magic knob to turn to overcome the
fundamental limitations that are going to creeping out of the
woodwork. Indexing throughput is the least of his problems. I further
suspect (but don't know for sure) that the first time realistic
queries start hitting the system it'll OOM.

All that said, I don't know for sure of course.

On Thu, Aug 14, 2014 at 11:57 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
> Erick Erickson [erickerick...@gmail.com] wrote:
>> Solr requires holding large parts of the index in memory.
>> For the entire corpus. At once.
>
> That requirement is under the assumption that one must have the lowest 
> possible latency at each individual box. You might as well argue for the 
> fastest possible memory or the fastest possible CPU being a requirement. The 
> advice is good in some contexts and a waste of money in other.
>
> I not-so-humbly point to 
> http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we 
> (for simple searches) handily achieve our goal of sub-second response times 
> for a 10TB index with just 1.4% of the index cached in RAM. Had our goal been 
> sub-50ms, it would be another matter, but it is not. Just as Wilburn's 
> problem is not to minimize latency for each individual box, but to achieve a 
> certain throughput for indexing, while performing searches.
>
> Wilburn's hardware is currently able to keep up, although barely, with 300B 
> documents. He needs to handle 900B. Tripling (or quadrupling) the amount of 
> machines should do the trick. Increasing the amount of RAM on each current 
> machine might also work (qua the well known effect of RAM with Lucene/Solr). 
> Using local SSDs, if he is not doing so already, might also work (qua the 
> article above).
>
> - Toke Eskildsen

Reply via email to