[
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379129#comment-17379129
]
Adrien Grand commented on LUCENE-10023:
---------------------------------------
bq. if there's interest in leveraging this from "other-than-Solr" (e.g.,
Elasticsearch, etc. ...
We had considered supporting something like that a few years ago before
changing our minds. There were really only two use-cases for this, which were
building tag clouds of analyzed tokens - which is a much less frequent need
than doing the same over non-analyzed strings, e.g. tags - and performing
analysis of significant terms. It didn't feel right to increase the API surface
area of text fields with a new option to index doc values for a tiny minority
of users so we decided against adding this option. We have since moved to term
vectors instead for the few use-cases that need something like this, with a
recommendation of only running such analysis on top hits.
> Multi-token post-analysis DocValues
> -----------------------------------
>
> Key: LUCENE-10023
> URL: https://issues.apache.org/jira/browse/LUCENE-10023
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael Gibney
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but
> there are cases where it would be desirable to have post-analysis DocValues
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms
> aggregation. I understand that this could be viewed as "trappy" for the naive
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the
> trappiness onto Lucene-external workarounds for systems/users that want to
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency
> guarantees that present opportunities for future optimizations (e.g., shared
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues
> directly to {{IndexingChain}}. The initial proposal involves extending the
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to
> existing {{IndexableFieldType.docValuesType()}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]