[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Mon, 12 Jul 2021 08:18:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379220#comment-17379220
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

Thanks for the Elasticsearch perspective, Adrien! (and I agree that you've 
identified the two relevant use cases).

I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to 
naive [terms 
aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html]
 (simple count) on text fields? As noted in the docs, [terms 
aggregation|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-terms-aggregation.html]
 on text fields remains dependent on enabling fielddata (uninverting). 
Alternatively of course, I gather that the single-token "tag cloud of analyzed 
tokens" case is supported via {{normalize()}} (formerly 
{{MultTermComponentAware}}) -- though this notably doesn't support 
synonym-style expansion of otherwise single-token analysis.

wrt the "significant terms analysis" case: I infer direct support for the term 
vectors approach in the [significant 
text|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significanttext-aggregation.html]
 aggregation (as distinct from the [significant 
terms|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html]
 aggregation)? 

The Elasticsearch docs for "significant terms" aggregation state that 
[DocValues are not supported as sources of term 
data|https://www.elastic.co/guide/en/elasticsearch/reference/7.13/search-aggregations-bucket-significantterms-aggregation.html#_significant_terms_must_be_indexed_values].
 Is this still the case? FWIW, from the Solr perspective, "relatedness" 
(analogous to Elasticsearch "significant terms") _is_ now [computed over 
DocValues|https://github.com/apache/lucene-solr/commit/40e2122b5a5b89f446e51692ef0d72e48c7b71e5],
 and could be viewed at a lower level as a composite/special case of naive 
"terms" aggregations over different domains (foreground, background; an 
approach that yields [particular performance 
benefits|https://issues.apache.org/jira/browse/SOLR-13132?focusedCommentId=17153994#comment-17153994]
 for high-cardinality -- e.g., full-text -- fields). I view this as being a 
significant point to raise, because the value of naive "terms" aggregation over 
full text is dubious, relative to the value of doc-frequency-normalized 
"significant terms" aggregation. If from the Elasticsearch perspective 
full-domain "significant terms" aggregation isn't supported/recommended, 
there'd be no reason to prefer DocValues as opposed to term vectors. This is 
not the case from the Solr perspective.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to