[ https://issues.apache.org/jira/browse/SOLR-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118727#comment-17118727 ]
Joel Bernstein edited comment on SOLR-14518 at 5/28/20, 2:39 PM: ----------------------------------------------------------------- [~mkhl], As I dug deeper into the *unique* implementation I found two things: 1) When you co-locate group records on the same shard, unique produces accurate counts. 2) The number of unique term values looked up and sent to be merged is capped at 100 per bucket. So the hit for the merging logic is not as large as I anticipated. So, *unique* as is produces correct counts and is decently optimized when group records are co-located. So, I think I'll close out this ticket. I wanted to bring up something I found during testing. I tested querying a sharded e-commerce index two ways to produce a multi-select facet e-commerce experience: *Approach 1:* *collapse* on *product_group_id,* exclude the collapse in the facet domain, and then unique(product_group_id). *Approach 2:* parent block join with same blocks used for Approach 1 collapse, change to child domain in facets, and then uniqueBlock(_root_) These approaches produce basically the same result set which makes sense. But what surprised me was that in a sharded environment Approach 1 was just as fast as Approach 2 under load. I would have expected the block join approach to be faster under load because of the data locality advantages of the block join. I'm wondering if it's worth investigating why its not faster. was (Author: joel.bernstein): [~mkhl], As I dug deeper into the *unique* implementation I found two things: 1) When you co-locate group records on the same shard, unique produces accurate counts. 2) The number of unique term values looked up and sent to be merged is capped at 100 per bucket. So the hit for the merging logic is not as large as I anticipated. So, *unique* as is produces correct counts and is decently optimized when group records are co-located. So, I think I'll close out this ticket. I wanted to bring up something I found during testing. I tested querying a sharded e-commerce index two ways to produce a multi-select facet e-commerce experience: *Approach 1:* *collapse* on *product_group_id,* exclude the collapse in the facet domain, and then unique(product_group_id). *Approach 2:* parent block join with same blocks used for Approach 1 collapse, change to child domain in facets, and then uniqueBlock(_root_) These approaches produce basically the same result set which makes sense. But what surprised me was that in a sharded environment Approach 1 was just as fast as Approach 2 under load. I would have expected the block join approach to be faster under load because of the data locality advantages of the block join. I'm wondering if it's worth investigating why its not faster. > Add support for partitioned unique agg to JSON facets > ----------------------------------------------------- > > Key: SOLR-14518 > URL: https://issues.apache.org/jira/browse/SOLR-14518 > Project: Solr > Issue Type: New Feature > Components: Facet Module > Reporter: Joel Bernstein > Priority: Major > > There are scenarios where documents are partitioned across shards based on > the same field that the *unique* agg is applied to with JSON facets. In this > scenario exact unique counts can be calculated by simply sending the bucket > level unique counts to the aggregator where they can be summed. Suggested > syntax is to add a boolean flag to the unique aggregation function: > *unique*(partitioned_field, true). > The *true* value turns on the "partitioned" unique logic. The default is > false. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org