[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Mon, 19 Jul 2021 08:23:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383402#comment-17383402
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

Thanks for the clarification, Adrien. I understand, and I was pointing out the 
availability of a scalable pure-docValues "significant terms" implementation 
because in the _absence_ of such an implementation I agree it would be 
difficult to argue that this change is "core-enough". It seems there's 
consensus that this change is _not_ core-enough even in the context of scalable 
full-domain pure-docValues "significant terms", so I'll just accept whatever 
performance hit is associated with double-consuming the token stream and 
buffering tokens.

To wrap things up on this issue, from my perspective: I'm now inclined to 
propose this double-TokenStream consumption/buffering in Solr (as opposed to in 
Lucene sandbox), given that there doesn't seem to be interest at the moment on 
the Elasticsearch side, and also because there's probably value in putting the 
performance hit higher up, closer to the application code (for better 
transparency).

Some final thoughts: the main use cases supported by this would be:
# faceting/terms aggregation over analyzed (potentially multi-token, e.g., 
synonym- or cross-reference-expanded) "tags"; (over analyzed _full-text_ would 
also incidentally be supported; the main benefit for the full-text case would 
be to avoid the need to "special-case" full-text terms aggregation, from the 
user's perspective -- but the full-text use case would be of limited practical 
utility, compared to ....)
# full-domain "relatedness"/"significant_terms" aggregations over full-text 
fields (text corpus analytics, etc.). Note: the viability of this use case 
depends on an implementation that does not require "pivoting" to per-term 
"inverted" index lookups.

The main benefits of an approach integrated in IndexChain (as opposed to a 
custom FieldType that double-consumes the TokenStream and buffers tokens as 
"standard" docValues for indexing) are:
# index-time performance (avoid double-consuming TokenStream and extra token 
buffering)
# (potential): Lucene-internal (guaranteed) consistency between indexed terms 
and docValues terms, with potential optimizations such as shared terms 
dictionary, and stronger guarantees about the appropriateness of access 
patterns that rely on consistency/compatibility between indexed terms and 
docValues (e.g., "refinement", etc.).

Thanks again Robert and Adrien for the feedback!

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to