[
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236320#comment-17236320
]
Michael Gibney edited comment on SOLR-15008 at 11/20/20, 5:25 PM:
------------------------------------------------------------------
Excellent, thanks for reporting back!
{quote}This smells like an over-optimization so I guess we can close this issue?
{quote}
Yes, I'd be inclined to agree (pending confirmation that warming query fixes
the issue).
For closure/completeness, fwiw it would definitely be possible to choose a
hypothetical "facet-by-value" approach dynamically (e.g., based on some
function of field cardinality and domain selectivity ... perhaps configurable
to account for expected term distribution for a given field?) – all the
requisite information should indeed be available to dynamically swap out
faceting implementation in
[FacetField.createFacetProcessor(...)|https://github.com/apache/lucene-solr/blob/9c066f60f1804c26db8be226429a0be046c5a4db/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L74-L148].
was (Author: mgibney):
Excellent, thanks for reporting back!
{quote}This smells like an over-optimization so I guess we can close this issue?
{quote}
Yes, I'd be inclined to agree (pending confirmation that warming query fixes
the issue).
For closure/completeness, fwiw it would definitely be possible to dynamically
choose a hypothetical "facet-by-value" approach dynamically (e.g., based on
some function of field cardinality and domain selectivity ... perhaps
configurable to account for expected term distribution for a given field?) --
all the requisite information should indeed be available to dynamically swap
out faceting implementation in
[FacetField.createFacetProcessor(...)|https://github.com/apache/lucene-solr/blob/9c066f60f1804c26db8be226429a0be046c5a4db/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L74-L148].
> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: 8.7
> Reporter: Radu Gheorghe
> Priority: Major
> Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png, writes_commits.png
>
>
> I'm running against the following scenario:
> * [JSON] faceting on a high cardinality field
> * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking
> almost 4s for ~300 documents and unique values (edited a bit):
>
> {code:java}
> "QTime":3869,
> "params":{
> "json":"{\"query\": \"*:*\",
> \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\",
> \"unique_id:49866\"]
> \"facet\":
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
> "rows":"0"}},
>
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
> },
> "facets":{
> "count":333,
> "keywords":{
> "buckets":[{
> "val":"value1",
> "count":124},
> ...
> {code}
> I did some [profiling with our Sematext
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it
> points me to OrdinalMap building (see attached screenshot). If I read the
> code right, an OrdinalMap is built with every facet. And it's expensive since
> there are many unique values in the shard (previously, there we more smaller
> shards, making latency better, but this approach doesn't scale for this
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements,
> [inspired from
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
> # *Keep the OrdinalMap cached until the next softCommit*, so that only the
> first query takes the penalty
> # *Allow faceting on actual values (a Map) rather than ordinals*, for
> situations like the one above where we have few matching documents. We could
> potentially auto-detect this scenario (e.g. by configuring a threshold) and
> use a Map when there are few documents
> I'm curious about what you're thinking:
> * would a PR/patch be welcome for any of the two ideas above?
> * do you see better options? am I missing something?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]