[
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236291#comment-17236291
]
Radu Gheorghe commented on SOLR-15008:
--------------------------------------
Ahhh, I feel so stupid because I did try with distrib=false before and I saw
the problem on and off (as I do with distributed searches) and I thought that
it doesn't matter and moved on. But obviously it does. The profile ran on a
single host, but if you look at it, it's 2s out of 2m of CPU time. I thought
that's because of the profiler sampling, but it's more likely because that one
host wasn't doing much for most of the time.
I tried 2) right now and it confirms your hypothesis. If I run the query in a
loop, it's slow the first time, then every minute (it actually skips some
minutes, I assume there are no updates there).
I will try the warmup query and see if this is enough for now. It should be, I
see an average of 1 QPS (per core), so it would make sense to have an extra
query per minute per core as we wouldn't feel the load. Thanks so much!
> If warming queries do indeed help here, the only argument I can see for
>pursuing facet-by-value would be if you expect to facet on a field
>_exclusively_ for low-cardinality domains, _and_ the field is sufficiently
>high-cardinality that either CPU of building {{OrdinalMap}} in a warming
>query, or memory of keeping it hanging around on the heap, is deemed
>prohibitively expensive
I agree. And it would be a narrow use-case, unless we detect at facet-time (can
we?) what's the number of documents (which could also indicate the cardinality
of the domain) and not worry about OrdinalMap (unless it's already cached?).
This smells like an over-optimization so I guess we can close this issue? I'll
know for sure after I test with the warmup.
> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
> Key: SOLR-15008
> URL: https://issues.apache.org/jira/browse/SOLR-15008
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: 8.7
> Reporter: Radu Gheorghe
> Priority: Major
> Labels: performance
> Attachments: Screenshot 2020-11-19 at 12.01.55.png, writes_commits.png
>
>
> I'm running against the following scenario:
> * [JSON] faceting on a high cardinality field
> * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking
> almost 4s for ~300 documents and unique values (edited a bit):
>
> {code:java}
> "QTime":3869,
> "params":{
> "json":"{\"query\": \"*:*\",
> \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\",
> \"unique_id:49866\"]
> \"facet\":
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
> "rows":"0"}},
>
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
> },
> "facets":{
> "count":333,
> "keywords":{
> "buckets":[{
> "val":"value1",
> "count":124},
> ...
> {code}
> I did some [profiling with our Sematext
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it
> points me to OrdinalMap building (see attached screenshot). If I read the
> code right, an OrdinalMap is built with every facet. And it's expensive since
> there are many unique values in the shard (previously, there we more smaller
> shards, making latency better, but this approach doesn't scale for this
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements,
> [inspired from
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
> # *Keep the OrdinalMap cached until the next softCommit*, so that only the
> first query takes the penalty
> # *Allow faceting on actual values (a Map) rather than ordinals*, for
> situations like the one above where we have few matching documents. We could
> potentially auto-detect this scenario (e.g. by configuring a threshold) and
> use a Map when there are few documents
> I'm curious about what you're thinking:
> * would a PR/patch be welcome for any of the two ideas above?
> * do you see better options? am I missing something?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]