[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Adrien Grand (Jira) Mon, 12 Jul 2021 23:53:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379658#comment-17379658
 ]


Adrien Grand commented on LUCENE-10023:
---------------------------------------

bq. I'm wondering whether you mean "tag clouds of analyzed tokens" to refer to 
naive terms aggregation (simple count) on text fields?

Yes. 

bq. wrt the "significant terms analysis" case: I infer direct support for the 
term vectors approach in the significant text aggregation (as distinct from the 
significant terms aggregation)?

Yes indeed.

bq.  DocValues are not supported as sources of term data. Is this still the 
case?

I need to check but I think that this sentence is a bit misleading and means 
that we don't support this aggregation on fields that enable _only_ doc values, 
ie. we require the field to be indexed to have access to term frequencies.

bq. f from the Elasticsearch perspective full-domain "significant terms" 
aggregation isn't supported/recommended

It's no longer recommended for text fields anymore indeed, we point our users 
to significant_text instead and we only keep significant_terms for keyword 
fields like tags.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to