RE: Possible performance bug - JSON facet - numBuckets:true

2020-03-15 Thread Rudenko, Artur
Update:

I started working on a fix to this issue and I found that the result for 
"numBuckets" in the original implementation is not accurate:

Query using my fix for limit -1:

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":31,
"params":{
  "q":"*:*",
  "json.facet":"{\"Chart_01_Bins\":{type:terms, field:date, mincount:1, 
limit:-1, numBuckets:true, missing:false, refine:true }}",
  "rows":"0"}},
  "response":{"numFound":170500,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
"count":170500,
"Chart_01":{
  "numBuckets":2660,
  "buckets":[{
  "val":"2019-01-16T15:17:03Z",
  "count":749},
{
  "val":"2019-01-23T21:46:44Z",
  "count":742},
{
  "val":"2019-01-04T11:06:22Z",
  "count":603},
{
  "val":"2019-01-08T01:08:58Z",
  "count":484},
 .
 .
 .
 .
{
  "val":"2019-01-26T06:30:33Z",
  "count":3}]}}}


Query with high limit that should include all buckets, based on current solr 
implementation:
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":29,
"params":{
  "q":"*:*",
  "json.facet":"{\"Chart_01_Bins\":{type:terms, field:date, mincount:1, 
limit:5000, numBuckets:true, missing:false, refine:true }}",
  "rows":"0"}},
  "response":{"numFound":170500,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
"count":170500,
"Chart_01_Bins":{
  "numBuckets":2671,
  "buckets":[{
  "val":"2019-01-16T15:17:03Z",
  "count":749},
{
  "val":"2019-01-23T21:46:44Z",
  "count":742},
{
  "val":"2019-01-04T11:06:22Z",
  "count":603},
{
  "val":"2019-01-08T01:08:58Z",
  "count":484},
 .
 .
 .
 .
  "val":"2019-01-26T06:30:33Z",
  "count":3}]}}}

There is 2660 buckets (which is the result of my fix) while the original solr 
implementation claims there are 2671 buckets (11 more)
The result of both query were compared with comparing tool and except of QTime, 
different limit value and numbuckets value all were the same (I decided not to 
pace all the buckets response but all were the same = 2660 and not 2671).
I also could not find in the docs that "numbuckets" is an estimation. For low 
cardinality values, the result was accurate.

Is this the expected behavior?


Artur Rudenko

-Original Message-
From: Mikhail Khludnev 
Sent: Tuesday, March 10, 2020 8:46 AM
To: solr-user 
Subject: Re: Possible performance bug - JSON facet - numBuckets:true

Hello, Artur.

Thanks for your interest.
Perhaps, we can amend doc mentioning this effect. In long term it can be 
optimized by adding a proper condition. Both patches are welcome.

On Wed, Feb 12, 2020 at 10:48 PM Rudenko, Artur 
wrote:

> Hello everyone,
> I'm am currently investigating a performance issue in our environment
> and it looks like we found a performance bug.
> Our environment:
> 20M large PARENT documents and 800M nested small CHILD documents.
> The system inserts about 400K PARENT documents and 16M CHILD documents
> per day. (Currently we stopped the calls insertion to investigate the
> performance issue) This is a solr cloud 8.3 environment with 7 servers
> (64 VCPU 128 GB RAM each, 24GB allocated to Solr) with single
> collection (32 shards and replication factor 2).
>
> The below query runs in about 14-16 seconds (we have to use limit:-1
> due to a business case - cardinality is 1K values).
>
> fq=channel:345133
> &fq=content_type:PARENT
> &fq=Meta_is_organizationIds:(344996998 344594999 34501 total
> of int 562 values)
> &q=*:*
> &json.facet={
> "Chart_01_Bins":{
> type:terms,
> field:groupIds,
> mincount:1,
> limit:-1,
> numBuckets:true,
> missing:false,
> refine:true,
> facet:{
>
> min_score_avg:"avg(min_score)",
>
> max_score_avg:"avg(max_score)",
>
> avg_score_avg:"avg(avg_score)"
> }
> },
> "Chart_01_FIELD_NOT_EXISTS":{
> type:query,
> q:"-groupIds:[* TO *]",
> facet:{
>
> min_score_avg:"avg(min_score)",
>
> max_score_avg:"avg(max_score)",
>
> avg_score_avg:"avg(avg_score)"
> }
> }
> }
> &rows=0
>
> Also, when the facet is simplified, it takes about 4-6 seconds
>
> fq=channel:345133
> &fq=content_type:PARENT
> &fq=Meta_is_organizationIds:(344996998 344594999 34501 total
> of int 562 values)
> &q=*:*
> &

Copying data

2020-03-15 Thread Jayadevan Maymala
Hi all,

I have a 3 node Solr cluster in production (with zoo keeper). In dev, I
have one node Solr instance, no zoo keeper. Which is the best way to copy
over the production solr data to dev?
Operating system is CentOS 7.7, Solr Version 7.3
Collection size is in the 40-50 GB range.

Regards,
Jayadevan


Re: Query is taking a time in Solr 6.1.0

2020-03-15 Thread vishal patel
How can I use the tokenizing differently?

Sent from Outlook

From: Erik Hatcher 
Sent: Friday, March 13, 2020 6:20 PM
To: solr-user@lucene.apache.org 
Subject: Re: Query is taking a time in Solr 6.1.0

Looks like you have two, maybe three, wildcard/prefix clauses in there.  
Consider tokenizing differently so you can optimize the queries to not need 
wildcards - thats my first observation and suggestion.

Erik

> On Mar 13, 2020, at 05:56, vishal patel  wrote:
>
> Some query is taking time in Solr 6.1.0.
>
> 2020-03-12 11:05:36.752 INFO  (qtp1239731077-2513155) [c:documents s:shard1 
> r:core_node1 x:documents] o.a.s.c.S.Request [documents]  webapp=/solr 
> path=/select 
> params={df=summary&distrib=false&fl=id&shards.purpose=4&start=0&fsv=true&sort=doc_ref+asc,id+desc&fq=&shard.url=s3.test.com:8983/solr/documents|s3r1.test.com:8983/solr/documents&rows=250&version=2&q=(doc_ref:((*n205*)+))+AND+(title:((*Distribution\+Board\+Schedule*)+))+AND+project_id:(2104616)+AND+is_active:true+AND+((isLatest:(true)+AND+isFolderActive:true+AND+isXref:false+AND+-document_type_id:(3+7)+AND+((is_public:true+OR+distribution_list:7249777+OR+folderadmin_list:7249777+OR+author_user_id:7249777)+AND+(((allowedUsers:(7249777)+OR+allowedRoles:(6368666)+OR+combinationUsers:(7249777))+AND+-blockedUsers:(7249777))+OR+(defaultAccess:(true)+AND+-blockedUsers:(7249777)+AND+-blockedRoles:(6368666)+OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(7249777)+AND+-folderadmin_list:(7249777)))&shards.tolerant=true&NOW=1584011129462&isShard=true&wt=javabin}
>  hits=0 status=0 QTime=7276.
>
> Is there any way to reduce the query execution time(7276 Milli)?