Solrcloud 6.6 becomes nuts

2020-05-17 Thread Dominique Bejean
Hi,

I have a six node Solrcoud that suddenly has its six nodes failed with OOM
at the same time.
This can happen even when the Solrcloud is not under heavy load and there
is no indexing.

I do not see any raison for this to happen. Here are the description of the
issue. Thank you for your suggestions and advices.


One or two hours before the nodes stop with OOM, we see this scenario on
all six nodes during the same five minutes time frame :
* a little bit more young gc : from one each second (duration<0.05secs) to
one each two or three seconds (duration <0.15 sec)
* full gc start occurs each 5sec with 0 bytes reclaimed
* young gc start reclaim less bytes
* long full gc start reclaim bytes but with less and less reclaimed bytes
* then no more young GC
Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png


Just before the problem occurs :
* there is no more requests per seconds
* no update/commit/merge
* CPU usage and load are low
* disk I/O are low
After the problem starts, requests become longer and longer but still no
increase of CPU usage or disk I/O


During last issue, we dumped the threads on one node just before OOM but
unfortunately, more than one hour after the problem starts.
85% of threads (more than 3000) are BLOCKED and related to log4j
Solr either try to log slow query or try to log problems in requesthandler
at org.apache.solr.common.SolrException.log(SolrException.java:148)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)

This high count of BLOCKED threads is more a consequence than a cause. We
will dump threads each minute until the next issue.


About Solr environment :
* Solr 6.6
* Java Oracle 1.8.0_112 25.112-b15

* 1 collection with 10 millions small documents
* 3 shards x 2 replicas
* 3.5 millions docs per core
* 90 Gb index size per core

* Server with 6 processors and 90 Gb of RAM
* Swappiness set to 1, nearly no swap used
* 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
seconds) each 15 to 30 minutes when all is fine.

* Default JVM settings with CMS GC
* JMX enabled
* Average Request per seconds in pic on one core : 170, but during the last
issue the Average Request per seconds was 30 !!!
* Average Time per seconds : < 30 ms

About updates :
* Very few add/updates in general
* Some deleteByQuery (nearly 2000 per day) but not before the problem occurs
* autocommit maxTime:15000ms

About queries :
* Queries are standard queries or suggesters
* Queries generate facets but there is no fields with very high number of
unique values
* No grouping
* High usage of function query for relevance computing


Thank you.

Dominique


RE: Filtering large amount of values

2020-05-17 Thread Rudenko, Artur
Hi Mikhail,

Thank you for the help, with you suggestion we actually managed to improve the 
results.

We now get and store the docValues in this method instead of inside collect() 
method:

@Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
super.doSetNextReader(context);
sortedDocValues = DocValues.getSorted(context.reader(), 
FileFilterPostQuery.this.metaField);
}

We see a big improvement. Is this the most efficient way?
Since it's a post filter, we have to return "false" in getCache method. Is 
there a way to implement it with cache?

Thanks,
Artur Rudenko

-Original Message-
From: Mikhail Khludnev 
Sent: Thursday, May 14, 2020 2:57 PM
To: solr-user 
Subject: Re: Filtering large amount of values

Hi, Artur.

Please, don't tell me that you obtain docValues per every doc? It's deadly slow 
see https://issues.apache.org/jira/browse/LUCENE-9328 for related problem.
Make sure you obtain them once per segment, when leaf reader is injected.
Recently there are some new method(s) for {!terms} I'm wondering if any of them 
might solve the problem.

On Thu, May 14, 2020 at 2:36 PM Rudenko, Artur 
wrote:

> Hi,
> We have a requirement of implementing a boolean filter with up to 500k
> values.
>
> We took the approach of post filter.
>
> Our environment has 7 servers of 128gb ram and 64cpus each server. We
> have 20-40m very large documents. Each solr instance has 64 shards
> with 2 replicas and JVM memory xms and xmx set to 31GB.
>
> We are seeing that using single post filter with 1000 on 20m documents
> takes about 4.5 seconds.
>
> Logic in our collect method:
> numericDocValues =
> reader.getNumericDocValues(FileFilterPostQuery.this.metaField);
>
> if (numericDocValues != null &&
> numericDocValues.advanceExact(docNumber)) {
> longVal = numericDocValues.longValue();
> } else {
> return;
> }
> }
>
> if (numericValuesSet.contains(longVal)) {
> super.collect(docNumber);
> }
>
>
> Is it the best we can get?
>
>
> Thanks,
> Artur Rudenko
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>


--
Sincerely yours
Mikhail Khludnev


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Re: Filtering large amount of values

2020-05-17 Thread Mikhail Khludnev
On Sun, May 17, 2020 at 4:57 PM Rudenko, Artur 
wrote:

> Hi Mikhail,
>
> Thank you for the help, with you suggestion we actually managed to improve
> the results.
>
> We now get and store the docValues in this method instead of inside
> collect() method:
>
> @Override
> protected void doSetNextReader(LeafReaderContext context) throws
> IOException {
> super.doSetNextReader(context);
> sortedDocValues = DocValues.getSorted(context.reader(),
> FileFilterPostQuery.this.metaField);
> }
>
> We see a big improvement. Is this the most efficient way?
>
Who knows...

Since it's a post filter, we have to return "false" in getCache method. Is
> there a way to implement it with cache?
>
if getCache()==true this query will be used as standalone query ignoring
filterCollector. In this case retrieved docs will be cached.


> Thanks,
> Artur Rudenko
>
> -Original Message-
> From: Mikhail Khludnev 
> Sent: Thursday, May 14, 2020 2:57 PM
> To: solr-user 
> Subject: Re: Filtering large amount of values
>
> Hi, Artur.
>
> Please, don't tell me that you obtain docValues per every doc? It's deadly
> slow see https://issues.apache.org/jira/browse/LUCENE-9328 for related
> problem.
> Make sure you obtain them once per segment, when leaf reader is injected.
> Recently there are some new method(s) for {!terms} I'm wondering if any of
> them might solve the problem.
>
> On Thu, May 14, 2020 at 2:36 PM Rudenko, Artur 
> wrote:
>
> > Hi,
> > We have a requirement of implementing a boolean filter with up to 500k
> > values.
> >
> > We took the approach of post filter.
> >
> > Our environment has 7 servers of 128gb ram and 64cpus each server. We
> > have 20-40m very large documents. Each solr instance has 64 shards
> > with 2 replicas and JVM memory xms and xmx set to 31GB.
> >
> > We are seeing that using single post filter with 1000 on 20m documents
> > takes about 4.5 seconds.
> >
> > Logic in our collect method:
> > numericDocValues =
> > reader.getNumericDocValues(FileFilterPostQuery.this.metaField);
> >
> > if (numericDocValues != null &&
> > numericDocValues.advanceExact(docNumber)) {
> > longVal = numericDocValues.longValue();
> > } else {
> > return;
> > }
> > }
> >
> > if (numericValuesSet.contains(longVal)) {
> > super.collect(docNumber);
> > }
> >
> >
> > Is it the best we can get?
> >
> >
> > Thanks,
> > Artur Rudenko
> >
> >
> > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> > not use, copy, disclose or distribute to anyone this message or any
> > information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Solrcloud 6.6 becomes nuts

2020-05-17 Thread Mikhail Khludnev
Hello, Dominique.
What did it log? Which exception?
Do you have a chance to review heap dump? What did consume whole heap?

On Sun, May 17, 2020 at 11:05 AM Dominique Bejean 
wrote:

> Hi,
>
> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
> at the same time.
> This can happen even when the Solrcloud is not under heavy load and there
> is no indexing.
>
> I do not see any raison for this to happen. Here are the description of the
> issue. Thank you for your suggestions and advices.
>
>
> One or two hours before the nodes stop with OOM, we see this scenario on
> all six nodes during the same five minutes time frame :
> * a little bit more young gc : from one each second (duration<0.05secs) to
> one each two or three seconds (duration <0.15 sec)
> * full gc start occurs each 5sec with 0 bytes reclaimed
> * young gc start reclaim less bytes
> * long full gc start reclaim bytes but with less and less reclaimed bytes
> * then no more young GC
> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>
>
> Just before the problem occurs :
> * there is no more requests per seconds
> * no update/commit/merge
> * CPU usage and load are low
> * disk I/O are low
> After the problem starts, requests become longer and longer but still no
> increase of CPU usage or disk I/O
>
>
> During last issue, we dumped the threads on one node just before OOM but
> unfortunately, more than one hour after the problem starts.
> 85% of threads (more than 3000) are BLOCKED and related to log4j
> Solr either try to log slow query or try to log problems in requesthandler
> at org.apache.solr.common.SolrException.log(SolrException.java:148)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>
> This high count of BLOCKED threads is more a consequence than a cause. We
> will dump threads each minute until the next issue.
>
>
> About Solr environment :
> * Solr 6.6
> * Java Oracle 1.8.0_112 25.112-b15
>
> * 1 collection with 10 millions small documents
> * 3 shards x 2 replicas
> * 3.5 millions docs per core
> * 90 Gb index size per core
>
> * Server with 6 processors and 90 Gb of RAM
> * Swappiness set to 1, nearly no swap used
> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
> seconds) each 15 to 30 minutes when all is fine.
>
> * Default JVM settings with CMS GC
> * JMX enabled
> * Average Request per seconds in pic on one core : 170, but during the last
> issue the Average Request per seconds was 30 !!!
> * Average Time per seconds : < 30 ms
>
> About updates :
> * Very few add/updates in general
> * Some deleteByQuery (nearly 2000 per day) but not before the problem
> occurs
> * autocommit maxTime:15000ms
>
> About queries :
> * Queries are standard queries or suggesters
> * Queries generate facets but there is no fields with very high number of
> unique values
> * No grouping
> * High usage of function query for relevance computing
>
>
> Thank you.
>
> Dominique
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Rule-Based Auth - update not working

2020-05-17 Thread Jason Gerlowski
Hi Isabelle,

Two things to keep in mind with Solr's Rule-Based Authorization.

1. Each request is controlled by the first permission to that matches
the request.
2. With the permissions you have present, Solr will check them in
descending list order.  (This isn't always true - collection-specific
and path-specific permissions are given precedence, so you don't need
to consider that.)

As you can imagine given the rules above - permission order is very
important.  In your case the "all" rule will match pretty much all
requests, which explains why an "indexing" user can't actually index.
Generally speaking, it's best to put the most specific rules first,
with the broader ones coming later.

For more information, see the "Permission Ordering and Resolution"
section in the page you linked to in your request.

Good luck, hope that helps.

Jason

On Tue, May 12, 2020 at 12:34 PM Isabelle Giguere
 wrote:
>
> Hi;
>
> I'm using Solr 8.5.0.
>
> I'm having trouble setting up some permissions using the rule-based 
> authorization plugin: 
> https://lucene.apache.org/solr/guide/8_5/rule-based-authorization-plugin.html
>
> I have 3 users: "admin", "search", and "indexer".
>
> I have set permissions and user roles:
> "permissions": [  {  "name": "all", "role": "admin", "index": 1  },
>   { "name": "admin-luke", "collection": "*", "role": "luke", "index": 2, 
> "path": "/admin/luke"  },
>   { "name": "read", "role": "searching", "index": 3  },
>   {  "name": "update", "role": "indexing", "index": 4 }],
> "user-role": {  "admin": "admin",
>   "search": ["searching","luke"],
>   "indexer": "indexing"   }  }
> Attached: full output of GET /admin/authorization
>
> So why can't user "indexer" add anything in a collection ?  I always get HTTP 
> 403 Forbidden.
> Using Postman, I click the checkbox to show the password, so I'm sure I typed 
> the right one.
>
> Note that user "search" can't use the /select handler either, as should be 
> the case with permission to "read".   This user can, however, use the Luke 
> handler, as the custom permission allows.
>
> User "admin" can use any API.  So at least the predefined permission "all" 
> does work.
>
> Note that the collections were created before enabling authentication and 
> authorization.  Could that be the cause of the permission issues ?
>
> Thanks;
>
> Isabelle Giguère
> Computational Linguist & Java Developer
> Linguiste informaticienne & développeur java
>
>


Re: Rule-Based Auth - update not working

2020-05-17 Thread Jason Gerlowski
One slight correction: I missed that you actually do have a
path/collection-specific permission in your list there.  So Solr will
check the permissions in descending list-order for most requests - the
exception being /luke requests when the /luke permission filters to
the top and is checked first.

We should really change this resolution order to be something more commonsense.

Jason

On Sun, May 17, 2020 at 2:52 PM Jason Gerlowski  wrote:
>
> Hi Isabelle,
>
> Two things to keep in mind with Solr's Rule-Based Authorization.
>
> 1. Each request is controlled by the first permission to that matches
> the request.
> 2. With the permissions you have present, Solr will check them in
> descending list order.  (This isn't always true - collection-specific
> and path-specific permissions are given precedence, so you don't need
> to consider that.)
>
> As you can imagine given the rules above - permission order is very
> important.  In your case the "all" rule will match pretty much all
> requests, which explains why an "indexing" user can't actually index.
> Generally speaking, it's best to put the most specific rules first,
> with the broader ones coming later.
>
> For more information, see the "Permission Ordering and Resolution"
> section in the page you linked to in your request.
>
> Good luck, hope that helps.
>
> Jason
>
> On Tue, May 12, 2020 at 12:34 PM Isabelle Giguere
>  wrote:
> >
> > Hi;
> >
> > I'm using Solr 8.5.0.
> >
> > I'm having trouble setting up some permissions using the rule-based 
> > authorization plugin: 
> > https://lucene.apache.org/solr/guide/8_5/rule-based-authorization-plugin.html
> >
> > I have 3 users: "admin", "search", and "indexer".
> >
> > I have set permissions and user roles:
> > "permissions": [  {  "name": "all", "role": "admin", "index": 1  },
> >   { "name": "admin-luke", "collection": "*", "role": "luke", "index": 
> > 2, "path": "/admin/luke"  },
> >   { "name": "read", "role": "searching", "index": 3  },
> >   {  "name": "update", "role": "indexing", "index": 4 }],
> > "user-role": {  "admin": "admin",
> >   "search": ["searching","luke"],
> >   "indexer": "indexing"   }  }
> > Attached: full output of GET /admin/authorization
> >
> > So why can't user "indexer" add anything in a collection ?  I always get 
> > HTTP 403 Forbidden.
> > Using Postman, I click the checkbox to show the password, so I'm sure I 
> > typed the right one.
> >
> > Note that user "search" can't use the /select handler either, as should be 
> > the case with permission to "read".   This user can, however, use the Luke 
> > handler, as the custom permission allows.
> >
> > User "admin" can use any API.  So at least the predefined permission "all" 
> > does work.
> >
> > Note that the collections were created before enabling authentication and 
> > authorization.  Could that be the cause of the permission issues ?
> >
> > Thanks;
> >
> > Isabelle Giguère
> > Computational Linguist & Java Developer
> > Linguiste informaticienne & développeur java
> >
> >


Re: Solrcloud 6.6 becomes nuts

2020-05-17 Thread Shawn Heisey

On 5/17/2020 2:05 AM, Dominique Bejean wrote:

One or two hours before the nodes stop with OOM, we see this scenario on
all six nodes during the same five minutes time frame :
* a little bit more young gc : from one each second (duration<0.05secs) to
one each two or three seconds (duration <0.15 sec)
* full gc start occurs each 5sec with 0 bytes reclaimed
* young gc start reclaim less bytes
* long full gc start reclaim bytes but with less and less reclaimed bytes
* then no more young GC
Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png


Do you have the OutOfMemoryException in the solr log?  From the graph 
you provided, it does look likely that it was heap memory on the OOME, 
I'd just like to be sure, by seeing the logged exception.


Between 15:00 and 15:30, something happened which suddenly required 
additional heap memory.  Do you have any idea what that was?  If you can 
zoom in on the graph, you could get a more accurate time for this.  I am 
looking specifically at the "heap usage before GC" graph.  The "heap 
usage after GC" graph that gceasy makes, which has not been included 
here, is potentially more useful.


I found that I most frequently ran into memory problems when I executed 
a data mining query -- doing facets or grouping on a high cardinality 
field, for example.  Those kinds of queries required a LOT of extra memory.


If the servers have any memory left, you might need to increase the max 
heap beyond where it currently sits.  To handle your indexes and 
queries, Solr may simply require more memory than you have allowed.


Thanks,
Shawn


Re: Solrcloud 6.6 becomes nuts

2020-05-17 Thread Dominique Bejean
Mickhail,


Thank you for your response.


--- For the logs

On not leader replica, there are no error in log, only WARN due to slow
queries.

On leader replica, there are these errors:

* Twice per minute during all the day before the problem starts and also
after the problem start
RequestHandlerBase org.apache.solr.common.SolrException: Collection: xx
not found
where xx is the alias name pointing on the collection

* Just after the problem start
2020-05-13 15:24:41.450 ERROR (qtp1682092198-315202) [c:xx_2 s:shard3
r:core_node1 x:xx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XX127:8983/solr/xx_2_shard1_replica1,
http://XX132:8983/solr/xx_2_shard2_replica0]
2020-05-13 15:24:41.451 ERROR (qtp1682092198-315202) [c:xx_2 s:shard3
r:core_node1 x:xx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XX127:8983/solr/xx_2_shard1_replica1,
http://XX132:8983/solr/xx_2_shard2_replica0]

2020-05-13 15:25:49.642 ERROR (qtp1682092198-315193) [c:xx_2 s:shard3
r:core_node1 x:xx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 51815/5 ms

and later until the JVM hangs
2020-05-13 15:58:54.397 ERROR (qtp1682092198-316314) [c:xx_2 s:shard3
r:core_node1 x:xx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: no servers hosting shard:
xx_2_shard2

No OOM errors in Solr logs, just OOM killer scripts log
Running OOM killer script for process 4488 for Solr on port 8983
Killed process 4488


--- For heap dump

I have dump for one shard leader just before the OOM script kill the JVM
but more than one hour the problem starts. I will take a look.

Regards.

Dominique










Le dim. 17 mai 2020 à 20:22, Mikhail Khludnev  a écrit :

> Hello, Dominique.
> What did it log? Which exception?
> Do you have a chance to review heap dump? What did consume whole heap?
>
> On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <
> dominique.bej...@eolya.fr> wrote:
>
>> Hi,
>>
>> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
>> at the same time.
>> This can happen even when the Solrcloud is not under heavy load and there
>> is no indexing.
>>
>> I do not see any raison for this to happen. Here are the description of
>> the
>> issue. Thank you for your suggestions and advices.
>>
>>
>> One or two hours before the nodes stop with OOM, we see this scenario on
>> all six nodes during the same five minutes time frame :
>> * a little bit more young gc : from one each second (duration<0.05secs) to
>> one each two or three seconds (duration <0.15 sec)
>> * full gc start occurs each 5sec with 0 bytes reclaimed
>> * young gc start reclaim less bytes
>> * long full gc start reclaim bytes but with less and less reclaimed bytes
>> * then no more young GC
>> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>>
>>
>> Just before the problem occurs :
>> * there is no more requests per seconds
>> * no update/commit/merge
>> * CPU usage and load are low
>> * disk I/O are low
>> After the problem starts, requests become longer and longer but still no
>> increase of CPU usage or disk I/O
>>
>>
>> During last issue, we dumped the threads on one node just before OOM but
>> unfortunately, more than one hour after the problem starts.
>> 85% of threads (more than 3000) are BLOCKED and related to log4j
>> Solr either try to log slow query or try to log problems in requesthandler
>> at org.apache.solr.common.SolrException.log(SolrException.java:148)
>> at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>>
>> This high count of BLOCKED threads is more a consequence than a cause. We
>> will dump threads each minute until the next issue.
>>
>>
>> About Solr environment :
>> * Solr 6.6
>> * Java Oracle 1.8.0_112 25.112-b15
>>
>> * 1 collection with 10 millions small documents
>> * 3 shards x 2 replicas
>> * 3.5 millions docs per core
>> * 90 Gb index size per core
>>
>> * Server with 6 processors and 90 Gb of RAM
>> * Swappiness set to 1, nearly no swap used
>> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC
>> (3
>> seconds) each 15 to 30 minutes when all is fine.
>>
>> * Default JVM settings with CMS GC
>> * JMX enabled
>> * Average Request per seconds in pic on one core : 170, but during the
>> last
>> issue the Average Request per seconds was 30 !!!
>> * Average Time per seconds : < 30 ms
>>
>> About updates :
>> * Very few add/updates in general
>> * Some deleteByQuery (nearly 2000 per day) but not before the problem
>> occurs
>> * autocommit maxTime:15000ms
>>
>> About queries :
>> * Queri

Re: Solrcloud 6.6 becomes nuts

2020-05-17 Thread Dominique Bejean
Hi Shawn,

There is no OOM error in logs. I gave more details in response to  Mickhail.

The problem starts with full GC near 15h20 but Young GC changed a little
starting 15h10.
Here are the heap usage before and after during this period.
https://www.eolya.fr/solr_issue_heap_before_after.png

There is no grouping but there are faceting.
The collection contains 10.000.000 documents

2 fields contains each 60.000 and 750.000 uniq values

These two fields were used in query for faceting 1 to 10 times per hour
before the problem starts
They are used a lot during the 20 minutes the problem starts
* 50 times for the field with  750.000 uniq values
* 250 times for the field with 60.000 uniq values

Hits count for these queries are mainly under 10, a couple of time between
100 and 1000.
Once hits count is  2000 for the field with  60.000 uniq values

In the other hand these queries are very long.

We will investigate this !

I was not thinking that queries using facet with fields with high number
of unique value but with low hits count can be the origin of this problem.


Regards

Dominique







Le dim. 17 mai 2020 à 21:45, Shawn Heisey  a écrit :

> On 5/17/2020 2:05 AM, Dominique Bejean wrote:
> > One or two hours before the nodes stop with OOM, we see this scenario on
> > all six nodes during the same five minutes time frame :
> > * a little bit more young gc : from one each second (duration<0.05secs)
> to
> > one each two or three seconds (duration <0.15 sec)
> > * full gc start occurs each 5sec with 0 bytes reclaimed
> > * young gc start reclaim less bytes
> > * long full gc start reclaim bytes but with less and less reclaimed bytes
> > * then no more young GC
> > Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>
> Do you have the OutOfMemoryException in the solr log?  From the graph
> you provided, it does look likely that it was heap memory on the OOME,
> I'd just like to be sure, by seeing the logged exception.
>
> Between 15:00 and 15:30, something happened which suddenly required
> additional heap memory.  Do you have any idea what that was?  If you can
> zoom in on the graph, you could get a more accurate time for this.  I am
> looking specifically at the "heap usage before GC" graph.  The "heap
> usage after GC" graph that gceasy makes, which has not been included
> here, is potentially more useful.
>
> I found that I most frequently ran into memory problems when I executed
> a data mining query -- doing facets or grouping on a high cardinality
> field, for example.  Those kinds of queries required a LOT of extra memory.
>
> If the servers have any memory left, you might need to increase the max
> heap beyond where it currently sits.  To handle your indexes and
> queries, Solr may simply require more memory than you have allowed.
>
> Thanks,
> Shawn
>


Re: Solrcloud 6.6 becomes nuts

2020-05-17 Thread Shawn Heisey

On 5/17/2020 4:18 PM, Dominique Bejean wrote:

I was not thinking that queries using facet with fields with high number
of unique value but with low hits count can be the origin of this problem.


Performance for most things does not depend on numFound (hit count) or 
the rows parameter.  The number of terms in the field and the total 
number of documents in the index matters a lot more.


If you do facets or grouping on a field with 750K unique terms, it's 
going to be very slow and require a LOT of memory.  I would not be 
surprised to see it require more than 4GB.  These features are designed 
to work best with fields that have a relatively small number of possible 
values.


Thanks,
Shawn