[
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379220#comment-17379220
]
Michael Gibney commented on LUCENE-10023:
-----------------------------------------
Thanks for the Elasticsearch perspective, Adrien! (and I agree that you've
identified the two relevant use cases).
I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to
naive [terms
aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html]
(simple count) on text fields? As noted in the docs, [terms
aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html]
on text fields remains dependent on enabling fielddata (uninverting).
Alternatively of course, I gather that the single-token "tag cloud of analyzed
tokens" case is supported via {{normalize()}} (formerly
{{MultTermComponentAware}}) -- though this notably doesn't support
synonym-style expansion of otherwise single-token analysis.
wrt the "significant terms analysis" case: I infer direct support for the term
vectors approach in the [significant
text|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significanttext-aggregation.html]
aggregation (as distinct from the [significant
terms|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html]
aggregation)?
The Elasticsearch docs for "significant terms" aggregation state that
[DocValues are not supported as sources of term
data|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html#_significant_terms_must_be_indexed_values].
Is this still the case? FWIW, from the Solr perspective, "relatedness"
(analogous to Elasticsearch "significant terms") _is_ now [computed over
DocValues|https://github.com/apache/lucene-solr/commit/40e2122b5a5b89f446e51692ef0d72e48c7b71e5],
and could be viewed at a lower level as a composite/special case of naive
"terms" aggregations over different domains (foreground, background; an
approach that yields [particular performance
benefits|https://issues.apache.org/jira/browse/SOLR-13132?focusedCommentId=17153994#comment-17153994]
for high-cardinality -- e.g., full-text -- fields). I view this as being a
significant point to raise, because the value of naive "terms" aggregation over
full text is dubious, relative to the value of doc-frequency-normalized
"significant terms" aggregation. If from the Elasticsearch perspective
full-domain "significant terms" aggregation isn't supported/recommended,
there'd be no reason to prefer DocValues as opposed to term vectors. This is
not the case from the Solr perspective.
> Multi-token post-analysis DocValues
> -----------------------------------
>
> Key: LUCENE-10023
> URL: https://issues.apache.org/jira/browse/LUCENE-10023
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael Gibney
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but
> there are cases where it would be desirable to have post-analysis DocValues
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms
> aggregation. I understand that this could be viewed as "trappy" for the naive
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the
> trappiness onto Lucene-external workarounds for systems/users that want to
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency
> guarantees that present opportunities for future optimizations (e.g., shared
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues
> directly to {{IndexingChain}}. The initial proposal involves extending the
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to
> existing {{IndexableFieldType.docValuesType()}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]