Re: Solr cloud performance degradation with billions of documents

Erick Erickson Sun, 17 Aug 2014 10:34:49 -0700

bq: I am interested in knowing, when you have multiple
collections like this case (60), and you just query one collection,


Yes.. and no. You're correct about OS memory swapping in
and out, but the JVM memory is a different matter. There'll
be some low-level caches filled up. Each collection may have
filterCache entries. Or sort entries. Or... There are lots of
Java memory-resident structures that are _not_ swapped
out. Furthermore, each collection may have 1-n warming
queries fired when it's loaded. And the top-level caches
configured in solrconfig.xml may have autowarm counts. And....

The long and short of it is that each collection will consume
memory when it's loaded in varying amounts. Memory that
MUST live in the JVM. Other memory will be paged in and
out, see Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
So as you add more and more collections, you'll hit this
kind of problem.

Now, there is the "LotsOfCores" code, see:
http://wiki.apache.org/solr/LotsOfCores
WARNING: This is NOT supported (yet) for SolrCloud.
That code does load/unload cores as necessary based
on the configuration parameters that determine
1> whether a core can be unloaded/loaded
2> how many "transient" cores can be in memory at
     once

Best,
Erick


On Sat, Aug 16, 2014 at 6:43 PM, shushuai zhu <ss...@yahoo.com.invalid> wrote:
> Erik,
>
> ---------------------------
> I fear the problem will be this: you won't even be able to do basic searches 
> as the number of shards on a particular machine increase. To test, fire off a 
> simple search for each of your 60 days. I expect it'll blow you out of the 
> water. This assumes that all your shards are hosted in the same JVM on each 
> of your 32 machines. But that's totally a guess.
> ---------------------------
>
> In this case, assuming there are 60 collections, and only one collection is 
> queried each time, should the memory requirements be those for that 
> collection only? My understanding is, when a new collection is queried, the 
> indexes (cores) of the old collection in OS cache are to be swapped out and 
> the indexes of new collection are brought in, but the memory requirements 
> should be roughly the same as long as two collections have similar sizes.
>
> I am interested in knowing, when you have multiple collections like this case 
> (60), and you just query one collection, should other collections matter from 
> performance perspective? Since different collections contain different cores, 
> if querying one collection involves cores in other collections, is it a bug?
>
> Thanks.
>
> Shushuai
>
>
>  From: Erick Erickson <erickerick...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, August 15, 2014 7:30 PM
> Subject: Re: Solr cloud performance degradation with billions of documents
>
>
> Toke:
>
> bq: I would have agreed with you fully an hour ago.....
>
> Well, I now disagree with myself too :).... I don't mind
> talking to myself. I don't even mind arguing with myself. I
> really _do_ mind losing the arguments I have with
> myself though.
>
> Scott:
>
> OK, that has a much better chance of working, I obviously
> misunderstood. So you'll have 60 different collections and each
> collection will have one shard on each machine.
>
> When the time comes to roll some of the collections off the
> end due to age, "collection aliasing" may be helpful. I still think
> you're significantly undersized, but you know your problem
> space better than I do.
>
> I fear the problem will be this: you won't even be able to do
> basic searches as the number of shards on a particular
> machine increase. To test, fire off a simple search for each of
> your 60 days. I expect it'll blow you out of the water. This
> assumes that all your shards are hosted in the same JVM
> on each of your 32 machines. But that's totally a guess.
>
> Keep us posted!
>
>
> On Fri, Aug 15, 2014 at 2:40 PM, Toke Eskildsen <t...@statsbiblioteket.dk> 
> wrote:
>> Erick Erickson [erickerick...@gmail.com] wrote:
>>> I guess that my main issue is that from everything I've seen so far,
>>> this project is doomed. You simply cannot put 7B documents in a single
>>> shard, period. Lucene has a 2B hard limit.
>>
>> I would have agreed with you fully an hour ago and actually planned to ask 
>> Wilbur to check if he had corrupted his indexes. However, his latest post 
>> suggests that the scenario is more about having a larger amount of more 
>> resonably sized shards in play than building gigantic shards.
>>
>>> For instance, Wilburn is talking about only using 6G of memory. Even
>>> at 2B docs/shard, I'd be surprised to see it function at all. Don't
>>> try sorting on a timestamp for instance.
>>
>> I haven't understood Wilburns setup completely, as it seems to me that he 
>> will quickly run out of memory for starting new shards. But if we are 
>> looking at shards of 30GB and 160M documents, 6GB sounds a lot better.
>>
>> Regards,
>> Toke Eskildsen

Re: Solr cloud performance degradation with billions of documents

Reply via email to