AW: AW: AW: Scaling to large Number of Collections

Christoph Schmidt Mon, 01 Sep 2014 06:51:32 -0700

Isn't this not just a question of configuration and my hardware. The better my 
hardware the more cores I can keep in memory. Let's say I can keep 10.000 cores 
in memory, when I reach the limit, I will put the last recent used to "dormant" 
and "wakeup" the new one I need.

So I can scale further by reuse of memory, threads, etc. paying the price of 
"wakeup" the cores.

Isn't it just reopen the reader and searcher? How big is the solr overhead?

-----Ursprüngliche Nachricht-----
Von: Jack Krupansky [mailto:j...@basetechnology.com] 
Gesendet: Montag, 1. September 2014 14:50
An: solr-user@lucene.apache.org
Betreff: Re: AW: Scaling to large Number of Collections

And I would add another suggested requirement - "dormant collections" - 
collections which may once have been active, but have not seen any recent 
activity and can hence be "suspended" or "swapped out" until such time as 
activity resumes and they can then be "reactivated" or "reloaded". That 
inactivity threshold might be something like an hour, but should be 
configurable globally and per-collection. The alternative is an application 
server which maintains that activity state and starts up and shuts down 
discrete Solr server instances for each tenant's collection(s).

This raises the question: How many of your collections need to be 
simultaneously active? Say, in a one-hour period, how many of them will be 
updating and serving queries, and what query load per-collection and total 
query load do you need to design for?

-- Jack Krupansky
-----Original Message-----
From: Christoph Schmidt
Sent: Monday, September 1, 2014 3:50 AM
To: solr-user@lucene.apache.org
Subject: AW: Scaling to large Number of Collections

Yes, this would help us in our scenario.

-----Ursprüngliche Nachricht-----
Von: Jack Krupansky [mailto:j...@basetechnology.com]
Gesendet: Sonntag, 31. August 2014 18:10
An: solr-user@lucene.apache.org
Betreff: Re: Scaling to large Number of Collections

We should also consider "lightly-sharded" collections. IOW, even if a cluster 
has dozens or a hundred nodes or more, the goal may not be to shard all 
collections across all shards, which is fine for the really large collections, 
but to also support collections which may only need to be sharded for a few 
shards or even just a single shard, and to instead focus the attention on large 
number of collections rather than heavily-sharded collections.

-- Jack Krupansky

-----Original Message-----
From: Erick Erickson
Sent: Sunday, August 31, 2014 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Scaling to large Number of Collections

What is your access pattern? By that I mean do all the cores need to be 
searched at the same time or is it reasonable for them to be loaded on demand? 
This latter would impose the penalty of the first time a collection was 
accessed there would be a delay while the core loaded. I suppose I'm asking 
"how many customers are using the system simultaneously?". One way around that 
is to fire a dummy query behind the scenes when a user logs on but before she 
actually executes a search.

Why I'm asking:

See this page: http://wiki.apache.org/solr/LotsOfCores. It was intended for the 
multi-tenancy case in which you could count on a subset of users being logged 
on at once.

WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has 
been some talk of extending support for SolrCloud, but no action as it's one of 
those cases that has lots of implications particularly around ZooKeeper knowing 
the state of all the cores, cores going into recovery in a cascading 
fashionetc. It's not at all clear that it _can_ be extended to SolrCloud for 
that matter without doing great violence to the code.

With the LotsOfCores approach (and assuming somebody volunteers to code it up), 
the number of cores hosted on a particular node can be many thousands.
The limits will come from how many of them have to be up and running 
simultaneously. The limits would come from two places:
1> The time it takes to recursively walk your SOLR_HOME directory and
discover the cores (I see about 1,000 cores/second discovered on my laptop, 
admittedly an SSD, and there has been no optimization done to this process).
2> having to keep a table of all the cores and their information (home
directory and the like) in memory, but practically I don't think this is a 
problem. I haven't actually measured, but the size of each entry is almost 
certainly less than 1K and probably closer to 0.5K.

But it really does bring us back to the question of whether all these cores are 
necessary or not. The "usual" technique for handling this with the LotsOfCores 
option is to combine the records into a number of smaller cores. 
Without knowing your requirements in detail, something like a customers core 
and a products core where, say, each product has a field with tokens indicating 
what users had access or vice versa, and (possibly) using pseudo joins. In one 
view, this is an ACL problem which has several solutions, each with drawbacks 
of course.

Or just de-normalizing your data entirely and just have a core per customer 
with _all_ the products indexed in to it.

Like I said, I don't know enough details to have a clue whether the data would 
explode unacceptably.

Anyway, enough on a Sunday morning!

Best,
Erick

On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey <s...@elyograg.org> wrote:

> On 8/31/2014 8:58 AM, Joseph Obernberger wrote:
> > Could you add another field(s) to your application and use that 
> > instead
> of
> > creating collections/cores?  When you execute a search, instead of
> picking
> > a core, just search a single large core but add in a field which 
> > contains some core ID.
>
> This is a nice idea.  Have one big collection in your cloud and use an 
> additional field in your queries to filter down to a specific user's data.
>
> It'd be really nice to write a custom search component that ensures 
> there is a filter query for that specific field, and if it's not 
> present, change the search results to include a document that informs 
> the caller that they're not doing it right.
>
> http://www.portal2sounds.com/1780
>
> (That URL probably won't work correctly on mobile browsers)
>
> Thanks,
> Shawn
>
>

AW: AW: AW: Scaling to large Number of Collections

Reply via email to