[ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379220#comment-17379220 ]
Michael Gibney commented on LUCENE-10023: ----------------------------------------- Thanks for the Elasticsearch perspective, Adrien! (and I agree that you've identified the two relevant use cases). I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to naive [terms aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html] (simple count) on text fields? As noted in the docs, [terms aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html] on text fields remains dependent on enabling fielddata (uninverting). Alternatively of course, I gather that the single-token "tag cloud of analyzed tokens" case is supported via {{normalize()}} (formerly {{MultTermComponentAware}}) -- though this notably doesn't support synonym-style expansion of otherwise single-token analysis. wrt the "significant terms analysis" case: I infer direct support for the term vectors approach in the [significant text|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significanttext-aggregation.html] aggregation (as distinct from the [significant terms|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html] aggregation)? The Elasticsearch docs for "significant terms" aggregation state that [DocValues are not supported as sources of term data|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html#_significant_terms_must_be_indexed_values]. Is this still the case? FWIW, from the Solr perspective, "relatedness" (analogous to Elasticsearch "significant terms") _is_ now [computed over DocValues|https://github.com/apache/lucene-solr/commit/40e2122b5a5b89f446e51692ef0d72e48c7b71e5], and could be viewed at a lower level as a composite/special case of naive "terms" aggregations over different domains (foreground, background; an approach that yields [particular performance benefits|https://issues.apache.org/jira/browse/SOLR-13132?focusedCommentId=17153994#comment-17153994] for high-cardinality -- e.g., full-text -- fields). I view this as being a significant point to raise, because the value of naive "terms" aggregation over full text is dubious, relative to the value of doc-frequency-normalized "significant terms" aggregation. If from the Elasticsearch perspective full-domain "significant terms" aggregation isn't supported/recommended, there'd be no reason to prefer DocValues as opposed to term vectors. This is not the case from the Solr perspective. > Multi-token post-analysis DocValues > ----------------------------------- > > Key: LUCENE-10023 > URL: https://issues.apache.org/jira/browse/LUCENE-10023 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael Gibney > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The single-token case for post-analysis DocValues is accounted for by > {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but > there are cases where it would be desirable to have post-analysis DocValues > based on multi-token fields. > The main use cases that I can think of are variants of faceting/terms > aggregation. I understand that this could be viewed as "trappy" for the naive > "Moby Dick word cloud" case; but: > # I think this can be supported fairly cleanly in Lucene > # Explicit user configuration of this option would help prevent people > shooting themselves in the foot > # The current situation is arguably "trappy" as well; it just offloads the > trappiness onto Lucene-external workarounds for systems/users that want to > support this kind of behavior > # Integrating this functionality directly in Lucene would afford consistency > guarantees that present opportunities for future optimizations (e.g., shared > Terms dictionary between indexed terms and DocValues). > This issue proposes adding support for multi-token post-analysis DocValues > directly to {{IndexingChain}}. The initial proposal involves extending the > API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to > existing {{IndexableFieldType.docValuesType()}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org