[
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383402#comment-17383402
]
Michael Gibney commented on LUCENE-10023:
-----------------------------------------
Thanks for the clarification, Adrien. I understand, and I was pointing out the
availability of a scalable pure-docValues "significant terms" implementation
because in the _absence_ of such an implementation I agree it would be
difficult to argue that this change is "core-enough". It seems there's
consensus that this change is _not_ core-enough even in the context of scalable
full-domain pure-docValues "significant terms", so I'll just accept whatever
performance hit is associated with double-consuming the token stream and
buffering tokens.
To wrap things up on this issue, from my perspective: I'm now inclined to
propose this double-TokenStream consumption/buffering in Solr (as opposed to in
Lucene sandbox), given that there doesn't seem to be interest at the moment on
the Elasticsearch side, and also because there's probably value in putting the
performance hit higher up, closer to the application code (for better
transparency).
Some final thoughts: the main use cases supported by this would be:
# faceting/terms aggregation over analyzed (potentially multi-token, e.g.,
synonym- or cross-reference-expanded) "tags"; (over analyzed _full-text_ would
also incidentally be supported; the main benefit for the full-text case would
be to avoid the need to "special-case" full-text terms aggregation, from the
user's perspective -- but the full-text use case would be of limited practical
utility, compared to ....)
# full-domain "relatedness"/"significant_terms" aggregations over full-text
fields (text corpus analytics, etc.). Note: the viability of this use case
depends on an implementation that does not require "pivoting" to per-term
"inverted" index lookups.
The main benefits of an approach integrated in IndexChain (as opposed to a
custom FieldType that double-consumes the TokenStream and buffers tokens as
"standard" docValues for indexing) are:
# index-time performance (avoid double-consuming TokenStream and extra token
buffering)
# (potential): Lucene-internal (guaranteed) consistency between indexed terms
and docValues terms, with potential optimizations such as shared terms
dictionary, and stronger guarantees about the appropriateness of access
patterns that rely on consistency/compatibility between indexed terms and
docValues (e.g., "refinement", etc.).
Thanks again Robert and Adrien for the feedback!
> Multi-token post-analysis DocValues
> -----------------------------------
>
> Key: LUCENE-10023
> URL: https://issues.apache.org/jira/browse/LUCENE-10023
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael Gibney
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but
> there are cases where it would be desirable to have post-analysis DocValues
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms
> aggregation. I understand that this could be viewed as "trappy" for the naive
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the
> trappiness onto Lucene-external workarounds for systems/users that want to
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency
> guarantees that present opportunities for future optimizations (e.g., shared
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues
> directly to {{IndexingChain}}. The initial proposal involves extending the
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to
> existing {{IndexableFieldType.docValuesType()}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]