Spike in SOLR Process and Frequent GC

2016-08-27 Thread Thiru M
Dear Folks,

We are using Solr 5.4.0 - "stand-alone" mode in our production boxes hosted
in Red Hat Enterprise Linux (RHEL) OS.

Each box have number of different cores. Have attached the screenshot shot
with the Solr core & system details.

1. Earlier indexing was performed every 30 minutes in both production
servers,

2. In linux-a server 30 (stand-alone) cores created on same day and content
indexed into it,

3. we then spotted unusual GC performing every 2 to 7 seconds in linux-a
server and the  Solr process spiked,

4. Then we removed the indexing in linux-a server for a week, monitored the
both Solr process and GC.(No indexing performed during this time),

5. No one uses the system during night time which we ensured it from our
end. But both Solr process and GC were in its peak, even "during non
business hours",

6. Have restated Solr instance in linux-a server. GC started again after
Solr instance brought up,

7. In linux-b server no spike in Solr process and no issues with GC,

We couldn't figure out why Solr process is hight and GC performing
frequently in linux-a server.

Have anyone encountered similar issue?
Can anyone help to suggest on ways to spot the issue please ?

Regards,
Thirukumaran M


Re: Status Collection Down

2016-08-27 Thread Erick Erickson
Please review:

http://wiki.apache.org/solr/UsingMailingLists

There isn't nearly enough information here to even begin to help. And have
you looked at the Solr logs for the replicas that are down to try to
diagnose the underlying issue?

Best,
Erick

On Fri, Aug 26, 2016 at 6:45 PM, Hardika Catur S <
hardika.sa...@solusi247.com.invalid> wrote:

> Hi,
>
> I find problems in collection status in apache solr,
> when solr restart some collection status to "down". It happened at server
> 00 and 01.
> how to turn some of the collection??
>
>
>
> Please help me to find a solution.
>
> Thanks,
> Hardika CS.
>


Re: Solr for Multi Tenant architecture

2016-08-27 Thread Erick Erickson
There's no one right answer here. I've also seen a hybrid approach
where there are multiple collections each of which has some
number of tenants resident. Eventually, you need to think of some
kind of partitioning, my rough number of documents for a single core
is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).

All that said, you may also be interested in the "transient cores"
option, see: 
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
and the transient and transientCacheSize (this latter in solr.xml). Note
that this is stand-alone only so you can't move that concept to
SolrCloud if you eventually go there.

Best,
Erick

On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha  wrote:
> Dear Solr Members,
>
> We are using SolrCloud as the search provider of a multi-tenant cloud based
> application. We have one schema for all the tenants. The indexes will have
> large number(millions) of documents.
>
> As of our research, we have two options,
>
>- One large collection for all the tenants and use Composite-ID routing
>- Collection per tenant
>
> The below mail says,
>
>
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3c5324cd4b.2020...@protulae.com%3E
>
> SolrCloud is *more scalable in terms of index size*. Plus you get
> redundancy which can't be underestimated in a hosted solution.
>
>
> AND
>
> The issue is management. 1000s of cores/collections require a level of
> automation. On the other hand, having a single core/collection means if
> you make one change to the schema or solrconfig, it affects everyone.
>
>
> Based on the above facts we think One large collection will be the way to
> go.
>
> Questions:
>
>1. Is that the right way to go?
>2. Will it be a hassle when we need to do reindexing?
>3. What is the chance of entire collection crash? (in that case all
>tenants will be affected and reindexing will be painful.
>
> Thank you in advance for your kind opinion.
>
> Best Regards,
> Chamil
>
> --
> http://kavimalla.blgospot.com
> http://kdchamil.blogspot.com


Re: How to update from Solr Cloud 5.4.1 to 5.5.1

2016-08-27 Thread Shawn Heisey
On 8/26/2016 10:22 AM, D'agostino Victor wrote:
> Do you know in which version index format changes and if I should
> update to a higher version ?

In version 6.0, and again in the just-released 6.2, one aspect of the
index format has been updated.  Version 6.1 didn't have any format
changes from 6.0.  You won't see the new version reflected in any of the
filenames in the index directory.

Whether or not to upgrade depends on what features you need, and whether
you need fixes included in the new version.  Not all of the fixed bugs
in 6.x are applicable to 5.x -- some are fixes for problems introduced
during 6.x development.

> And about ZooKeeper ; the 3.4.8 is fine or should I update it too ?

That's the newest stable version of zookeeper.  There are alpha releases
of version 3.5.

Solr includes zookeeper 3.4.6.  A 3.4.8 server will work, but no
guarantees can be made about the 3.5 alpha versions.

Thanks,
Shawn



Re: Solr for Multi Tenant architecture

2016-08-27 Thread Shawn Heisey
On 8/26/2016 1:13 PM, Chamil Jeewantha wrote:
> We are using SolrCloud as the search provider of a multi-tenant cloud based
> application. We have one schema for all the tenants. The indexes will have
> large number(millions) of documents.
>
> As of our research, we have two options,
>
>- One large collection for all the tenants and use Composite-ID routing
>- Collection per tenant

I would tend to agree that you should use SolrCloud.  And to avoid
potential problems, each tenant should have their own collection or
collections.

You probably also need to put a smart load balancer in front of Solr
that can restrict access to URL paths containing the collection names to
the source addresses for each tenant.  The tenants should have no access
to the admin UI, because it's not possible to keep people using the
admin UI from seeing collections that aren't theirs.  Developing that
kind of security could be possible, but won't be easy at all.

If access to the admin UI is something that your customers demand, then
I think you'll need to have an entire cloud per tenant -- which probably
means you're going to want to delve into virtualization, possibly using
one of the lightweight implementations like Docker.  Note that if you
take this path, you're going to need a LOT of RAM -- much more than you
might imagine.

Thanks,
Shawn



Re: Spike in SOLR Process and Frequent GC

2016-08-27 Thread Shawn Heisey
On 8/27/2016 9:08 AM, Thiru M wrote:
> We are using Solr 5.4.0 - "stand-alone" mode in our production boxes
> hosted in Red Hat Enterprise Linux (RHEL) OS.
>
> Each box have number of different cores. Have attached the screenshot
> shot with the Solr core & system details.
>
> 1. Earlier indexing was performed every 30 minutes in both production
> servers,
>
> 2. In linux-a server 30 (stand-alone) cores created on same day and
> content indexed into it,
>
> 3. we then spotted unusual GC performing every 2 to 7 seconds in
> linux-a server and the  Solr process spiked,
>
> 4. Then we removed the indexing in linux-a server for a week,
> monitored the both Solr process and GC.(No indexing performed during
> this time),
>
> 5. No one uses the system during night time which we ensured it from
> our end. But both Solr process and GC were in its peak, even "during
> non business hours",
>
> 6. Have restated Solr instance in linux-a server. GC started again
> after Solr instance brought up,
>
> 7. In linux-b server no spike in Solr process and no issues with GC,
>

Indexing creates a LOT of garbage.  Queries also create garbage, but not
nearly as fast as indexing.  Solr has some background processes, and
these will create garbage too.  Java uses a garbage collection memory
model, so this is completely normal for java applications.

What precisely were you measuring during the night with no activity, and
what precise methods were you using to measure it?  What part of the
information you obtained represents a problem in your mind?

We'll also need some details about these servers:

* Total index size of all Solr cores on the server.
* Total amount of memory installed in the server.
* Total number of documents contained in all Solr cores.
* How many Solr instances per server?  What is the max heap size of each
instance?

Attachments rarely make it to the list.  The screenshot you mentioned
did not make it.  You'll need to put it somewhere on the Internet an
provide a URL.  Sharing sites like drobox or imgur are good choices for
image data.

Since I don't have a clear idea of what the exact issue is here, I don't
have any immediate suggestions, aside from possibly increasing your max
heap size ... but depending on the answers to the questions above, that
might make things worse.

Thanks,
Shawn



Re: High load, frequent updates, low latency requirement use case

2016-08-27 Thread Shawn Heisey
On 8/25/2016 8:51 PM, Brent P wrote:

Replies inline.  Hopefully they'll be easily visible.

> It will be writing documents at a rate of approximately 500 docs/second,
> and running search queries at about the same rate.

500 queries per second is a LOT.  You're going to probably need a lot of
replicas to handle the load.

> The documents are fairly small, with about 10 fields, most of which range
> in size from a simple int to a string that holds a UUID. There's a date
> field, and then three text fields that typically hold in the range of 350
> to 500 chars.
> Documents should be available for searching within 30 seconds of being
> added.
> We need an average search latency of 50 ms or faster.

Bottom line here -- there is no easy answer.

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

> We've been using DataStax Enterprise with decent results, but trying to
> determine if we can get more out of the latest version of Solr Cloud, as we
> originally chose DSE ~4 years ago *I believe* because its Cassandra-backed
> Solr provided redundancy/high availability features that weren't currently
> available with straight Solr (not even sure if Solr Cloud was available
> then).

SolrCloud became officially available in a stable release in Solr 4.0.0,
October 2012.  It was available in 4.0.0-ALPHA and 4.0.0-BETA some
months before that.  I'm reasonably certain that Solr had cloud before
DSE did.  I don't know which Solr release DSE is incorporating, but it's
likely a 4.x version.

A great many bugs were found and fixed over the next couple of years as
subsequent 4.x releases became available.  Solr 5.x and 6.x continued
the stability evolution.  Version 6.2 is the newest release, just
announced yesterday.

> We have 24 fairly beefy servers (96 CPU cores, 256 GB RAM, SSDs) for the
> task, and I'm trying to figure out the best way to distribute the documents
> into collections, cores, and shards.
>
> If I can categorize a document into one of 8 "types", should I create 8
> collections? Is that going to provide better performance than putting them
> all into one collection and then using a filter query with the type field
> when doing a search?
>
> What are the options/things to consider when deciding on the number of
> shards for each collection? As far as I know, I don't choose the number of
> Solr cores, that is just determined base on the replication factor (and
> shard count?).

The only answer I can give you, as mentioned by the blog post above, is
"It Depends."  How many documents are going to be in the index?  Can you
project how large the index directory would be if you indexed all those
documents into one collection with one shard?  With some rough numbers,
I can make a recommendation -- but you need to understand that it would
only be a recommendation, one that could turn out to be wrong once you
make it to production.

> Some of the settings I'm using in my solrconfig that seem important:
> ${solr.lock.type:native}
> 
>   ${solr.autoCommit.maxTime:3}
>   false
> 
> 
>   ${solr.autoSoftCommit.maxTime:1000}
> 
> true
> 8

Right away I can tell you that one second latency,. which you have
configured iin autoSoftCommit, is likely to cause BIG problems for you. 
That should be increased to the largest value you can stand to have --
several minutes would be ideal.  Since you said you need 30 second
latency, setting it to 25 seconds might be the way to go.

Repeating something Emir said:  You should set maxWarmingSearchers back
to 2.  If you increased this because of messages you saw in the log, you
should know that increasing maxWarmingSearchers is likely to make things
worse, not better.  The underlying problem -- commits taking too long --
is what you should address.

> I've got the updateLog/transaction log enabled, as I think I read it's
> required for Solr Cloud.
>
> Are there any settings I should look at that affect performance
> significantly, especially outside of the solrconfig.xml for each collection
> (like jetty configs, logging properties, etc)?

Hard to say without a ton more information -- mostly about the index
size, which I already asked about.

> How much impact do the  directives in the solrconfig have on
> performance? Do they only take effect if I have something configured that
> requires them, and therefore if I'm missing one that I need, I'd get an
> error if it's not defined?

Loading a jar will take a small amount of memory, but if the feature
added by that jar is never used, it is likely to have little visible impact.

My recommendation is to remove all the  directives, and remove
config in your schema and solrconfig.xml that references features you
don't need.  If you find that you DO require a feature that needs one or
more jars, place them into $SOLR_HOME/lib.  They will be loaded once,
and become available to all cores, without any need to configure 
directives.

Thanks,
Shawn



Re: Default stop word list

2016-08-27 Thread Shawn Heisey
On 8/26/2016 7:13 AM, Steven White wrote:
> But what about the current "default" list that comes with Solr?  How was
> that list, for all supported languages, determined?

That list of stopwords was created from years of history with Lucene,
taking the expertise of many people and the wisdom of the Internet into
account.

> What I fear is this, when someone puts Solr into production, no one makes a
> change to that list, so if the list is not "valid" this will impacting
> search, but if the list is valid, how was it determined, just by the
> development team of Solr / Lucene or input from linguistic expert?

The list of stopwords that come with Solr is a *starting point*.  The
person who sets Solr up should review the list and adjust it to their
needs ... or possibly remove the stopword filter entirely.

I personally think that stopword removal is more of a problem than a
solution.  In the long forgotten days of history, when computers had far
less processing power, storage, and memory than they do now ... removing
stopwords was a significant performance advantage, because it made the
indexes smaller.

With typical modern server configurations and small to medium sized
indexes, the performance benefit is minimal, and the removal can
sometimes cause significant disadvantages.

The classic example query related to stopwords (in English) is trying to
search for "to be or not to be" -- a phrase made up of words that almost
always appear in a stopword list, causing big problems.  A more relevant
example is searching an entertainment database for "the who".  That
search returns mostly irrelevant results when stopwords are removed. 
Imagine searching a music database for "the the" and not finding
anything at all relating to this band:

https://en.wikipedia.org/wiki/The_The

Thanks,
Shawn



Re: Default stop word list

2016-08-27 Thread Shawn Heisey
On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> I personally think that stopword removal is more of a problem than a
> solution.

There actually is one thing that a stopword filter can dothat has little
to do with the purpose it was designed for.  You can make it impossible
to search for certain words.

Imagine that your original data contains the word "frisbee" but for some
reason you do not want anybody to be able to locate results using that
word.  You can create a stopword list containing just "frisbee" and any
other variations that you want to limit like "frisbees", then place it
as a filter on the index side of your analysis.  With this in place,
searching for those terms will retrieve zero results.

Thanks,
Shawn



Re: Solr for Multi Tenant architecture

2016-08-27 Thread John Bickerstaff
In my own work, the risk to the business if every single client cannot
access search is so great, we would never consider putting everything in
one.  You should certainly ask that question of the business stakeholders
before you decide.

For that reason, I might recommend that each of the multiple collections
suggested above by Erick could also be on a separate SolrCloud (or single
Solr instance) so that no single failure can ever take down every tenant's
ability to search -- only those on that particular SolrCloud...

On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson 
wrote:

> There's no one right answer here. I've also seen a hybrid approach
> where there are multiple collections each of which has some
> number of tenants resident. Eventually, you need to think of some
> kind of partitioning, my rough number of documents for a single core
> is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
>
> All that said, you may also be interested in the "transient cores"
> option, see: https://cwiki.apache.org/confluence/display/solr/
> Defining+core.properties
> and the transient and transientCacheSize (this latter in solr.xml). Note
> that this is stand-alone only so you can't move that concept to
> SolrCloud if you eventually go there.
>
> Best,
> Erick
>
> On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha 
> wrote:
> > Dear Solr Members,
> >
> > We are using SolrCloud as the search provider of a multi-tenant cloud
> based
> > application. We have one schema for all the tenants. The indexes will
> have
> > large number(millions) of documents.
> >
> > As of our research, we have two options,
> >
> >- One large collection for all the tenants and use Composite-ID
> routing
> >- Collection per tenant
> >
> > The below mail says,
> >
> >
> > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
> 201403.mbox/%3c5324cd4b.2020...@protulae.com%3E
> >
> > SolrCloud is *more scalable in terms of index size*. Plus you get
> > redundancy which can't be underestimated in a hosted solution.
> >
> >
> > AND
> >
> > The issue is management. 1000s of cores/collections require a level of
> > automation. On the other hand, having a single core/collection means if
> > you make one change to the schema or solrconfig, it affects everyone.
> >
> >
> > Based on the above facts we think One large collection will be the way to
> > go.
> >
> > Questions:
> >
> >1. Is that the right way to go?
> >2. Will it be a hassle when we need to do reindexing?
> >3. What is the chance of entire collection crash? (in that case all
> >tenants will be affected and reindexing will be painful.
> >
> > Thank you in advance for your kind opinion.
> >
> > Best Regards,
> > Chamil
> >
> > --
> > http://kavimalla.blgospot.com
> > http://kdchamil.blogspot.com
>


Re: Solr for Multi Tenant architecture

2016-08-27 Thread Chamil Jeewantha
Thank you everyone for your great support.

I will update you with our final approach.

Best regards,
Chamil

On Aug 28, 2016 01:34, "John Bickerstaff"  wrote:

> In my own work, the risk to the business if every single client cannot
> access search is so great, we would never consider putting everything in
> one.  You should certainly ask that question of the business stakeholders
> before you decide.
>
> For that reason, I might recommend that each of the multiple collections
> suggested above by Erick could also be on a separate SolrCloud (or single
> Solr instance) so that no single failure can ever take down every tenant's
> ability to search -- only those on that particular SolrCloud...
>
> On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson 
> wrote:
>
> > There's no one right answer here. I've also seen a hybrid approach
> > where there are multiple collections each of which has some
> > number of tenants resident. Eventually, you need to think of some
> > kind of partitioning, my rough number of documents for a single core
> > is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
> >
> > All that said, you may also be interested in the "transient cores"
> > option, see: https://cwiki.apache.org/confluence/display/solr/
> > Defining+core.properties
> > and the transient and transientCacheSize (this latter in solr.xml). Note
> > that this is stand-alone only so you can't move that concept to
> > SolrCloud if you eventually go there.
> >
> > Best,
> > Erick
> >
> > On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha 
> > wrote:
> > > Dear Solr Members,
> > >
> > > We are using SolrCloud as the search provider of a multi-tenant cloud
> > based
> > > application. We have one schema for all the tenants. The indexes will
> > have
> > > large number(millions) of documents.
> > >
> > > As of our research, we have two options,
> > >
> > >- One large collection for all the tenants and use Composite-ID
> > routing
> > >- Collection per tenant
> > >
> > > The below mail says,
> > >
> > >
> > > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
> > 201403.mbox/%3c5324cd4b.2020...@protulae.com%3E
> > >
> > > SolrCloud is *more scalable in terms of index size*. Plus you get
> > > redundancy which can't be underestimated in a hosted solution.
> > >
> > >
> > > AND
> > >
> > > The issue is management. 1000s of cores/collections require a level of
> > > automation. On the other hand, having a single core/collection means if
> > > you make one change to the schema or solrconfig, it affects everyone.
> > >
> > >
> > > Based on the above facts we think One large collection will be the way
> to
> > > go.
> > >
> > > Questions:
> > >
> > >1. Is that the right way to go?
> > >2. Will it be a hassle when we need to do reindexing?
> > >3. What is the chance of entire collection crash? (in that case all
> > >tenants will be affected and reindexing will be painful.
> > >
> > > Thank you in advance for your kind opinion.
> > >
> > > Best Regards,
> > > Chamil
> > >
> > > --
> > > http://kavimalla.blgospot.com
> > > http://kdchamil.blogspot.com
> >
>