[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

Radu Gheorghe (Jira) Thu, 19 Nov 2020 03:06:36 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Radu Gheorghe updated SOLR-15008:
---------------------------------
    Description: 
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
    "QTime":3869,
    "params":{
      "json":"{\"query\": \"*:*\",
      \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
      \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
      "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
    "count":333,
    "keywords":{
      "buckets":[{
          "val":"value1",
          "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building (see attached screenshot). If I read the code 
right, an OrdinalMap is built with every facet. And it's expensive since there 
are many unique values in the shard (previously, there we more smaller shards, 
making latency better, but this approach doesn't scale for this particular 
use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 

  was:
I'm running against the following scenario:
 * [JSON] faceting on a high cardinality field
 * few matching documents => few unique values

Yet the query almost always takes a long time. Here's an example taking almost 
4s for ~300 documents and unique values (edited a bit):

 
{code:java}
    "QTime":3869,
    "params":{
      "json":"{\"query\": \"*:*\",
      \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
\"unique_id:49866\"]
      \"facet\": 
{\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
      "rows":"0"}},
  
"response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
  },
  "facets":{
    "count":333,
    "keywords":{
      "buckets":[{
          "val":"value1",
          "count":124},
  ...
{code}
I did some [profiling with our Sematext 
Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
points me to OrdinalMap building. If I read the code right, an OrdinalMap is 
built with every facet. And it's expensive since there are many unique values 
in the shard (previously, there we more smaller shards, making latency better, 
but this approach doesn't scale for this particular use-case).

If I'm right up to this point, I see a couple of potential improvements, 
[inspired from 
Elasticsearch|[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint]:]
 # Keep the OrdinalMap cached until the next softCommit, so that only the first 
query takes the penalty
 # Allow faceting on actual values (a Map) rather than ordinals, for situations 
like the one above where we have few matching documents. We could potentially 
auto-detect this scenario (e.g. by configuring a threshold) and use a Map when 
there are few documents

I'm curious about what you're thinking:
 * would a PR/patch be welcome for any of the two ideas above?
 * do you see better options? am I missing something?

 


> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
>                 Key: SOLR-15008
>                 URL: https://issues.apache.org/jira/browse/SOLR-15008
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: 8.7
>            Reporter: Radu Gheorghe
>            Priority: Major
>              Labels: performance
>         Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
>     "QTime":3869,
>     "params":{
>       "json":"{\"query\": \"*:*\",
>       \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
> \"unique_id:49866\"]
>       \"facet\": 
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>       "rows":"0"}},
>   
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
>     "count":333,
>     "keywords":{
>       "buckets":[{
>           "val":"value1",
>           "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext 
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
> points me to OrdinalMap building (see attached screenshot). If I read the 
> code right, an OrdinalMap is built with every facet. And it's expensive since 
> there are many unique values in the shard (previously, there we more smaller 
> shards, making latency better, but this approach doesn't scale for this 
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, 
> [inspired from 
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:]
>  # Keep the OrdinalMap cached until the next softCommit, so that only the 
> first query takes the penalty
>  # Allow faceting on actual values (a Map) rather than ordinals, for 
> situations like the one above where we have few matching documents. We could 
> potentially auto-detect this scenario (e.g. by configuring a threshold) and 
> use a Map when there are few documents
> I'm curious about what you're thinking:
>  * would a PR/patch be welcome for any of the two ideas above?
>  * do you see better options? am I missing something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15008) Avoid building OrdinalMap for each facet

Reply via email to