[ 
https://issues.apache.org/jira/browse/SOLR-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118727#comment-17118727
 ] 

Joel Bernstein edited comment on SOLR-14518 at 5/28/20, 2:22 PM:
-----------------------------------------------------------------

[~mkhl], As I dug deeper into the *unique* implementation I found two things:

1) When you co-locate group records on the same shard, unique produces accurate 
counts.

2) The number of unique term values looked up and sent to be merged is capped 
at 100 per bucket. So the hit for the merging logic is not as large as I 
anticipated. 

So, *unique* as is produces correct counts and is decently optimized when group 
records are co-located.

So, I think I'll close out this ticket.

I wanted to bring up something I found during testing. I tested querying a 
sharded e-commerce index two ways to produce a multi-select facet e-commerce 
experience:

*Approach 1:*

*collapse* on *product_group_id,* exclude the collapse in the facet domain, and 
then unique(product_group_id). 

 

*Approach 2:*

parent block join with same blocks used for Approach 1 collapse, change to 
child domain in facets, and then uniqueBlock(_root_)

These approaches produce basically the same result set which makes sense.

But what surprised me was that in a sharded environment Approach 1 was just as 
fast as Approach 2 under load.

I would have expected the block join approach to be faster under load because 
of the data locality advantages of the block join. I'm wondering if it's worth 
investigating why its not faster. 

 

 

 

 

 


was (Author: joel.bernstein):
[~mkhl], As I dug deeper into the *unique* implementation I found two things:

1) When you co-locate groups records on the same shard, unique produces 
accurate counts.

2) The number of unique term values looked up and sent to be merged is capped 
at 100 per bucket. So the hit for the merging logic is not as large as I 
anticipated. 

So, *unique* as is produces correct counts and is decently optimized when group 
records are co-located.

So, I think I'll close out this ticket.

I wanted to bring up something I found during testing. I tested querying a 
sharded e-commerce index two ways to produce a multi-select facet e-commerce 
experience:

*Approach 1:*

*collapse* on *product_group_id,* exclude the collapse in the facet domain, and 
then unique(product_group_id). 

 

*Approach 2:*

parent block join with same blocks used for Approach 1 collapse, change to 
child domain in facets, and then uniqueBlock(_root_)

These approaches produce basically the same result set which makes sense.

But what surprised me was that in a sharded environment Approach 1 was just as 
fast as Approach 2 under load.

I would have expected the block join approach to be faster under load because 
of the data locality advantages of the block join. I'm wondering if it's worth 
investigating why its not faster. 

 

 

 

 

 

> Add support for partitioned unique agg to JSON facets
> -----------------------------------------------------
>
>                 Key: SOLR-14518
>                 URL: https://issues.apache.org/jira/browse/SOLR-14518
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>            Reporter: Joel Bernstein
>            Priority: Major
>
> There are scenarios where documents are partitioned across shards based on 
> the same field that the *unique* agg is applied to with JSON facets. In this 
> scenario exact unique counts can be calculated by simply sending the bucket 
> level unique counts to the aggregator where they can be summed. Suggested 
> syntax is to add a boolean flag to the unique aggregation function: 
> *unique*(partitioned_field, true).
> The *true* value turns on the "partitioned" unique logic. The default is 
> false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to