[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Tue, 13 Jul 2021 11:00:13 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380092#comment-17380092
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

{quote}this sentence is a bit misleading and means that we don't support this 
aggregation on fields that enable only doc values, ie. we require the field to 
be indexed to have access to term frequencies.
{quote}

Ah, ok! For "significant_terms" it looks like the "subset" (foreground set) 
count is calculated via docValues API, but the field must be indexed in order 
to calculate "superset" (background set) count, via one of:
# accessing static doc freq (for terms with no backgroundFilter), or
# calculating the intersection of backgroundFilter with each candidate bucket 
value (either via FilterableTermsEnum or BooleanQuery).

In any case, iiuc this approach is problematic for "full text" mainly because 
"full text" fields tend to be high-cardinality. Put another way: 
"significant_terms" over a hypothetical "full text" field with post-analysis 
DocValues enabled would be no less performant than over a DocValues-enabled 
keyword field of equivalent cardinality (or perhaps _slightly_ less performant 
due to higher mean per-term docFreq). This is not a revolutionary observation 
... but it's relevant because an entirely DocValues-driven method of 
calculating "relatedness"/"significant_terms" (as is the case now in Solr) 
should scale well enough wrt field cardinality that full-domain 
"significant_terms" would become viable over "full text" fields. In this 
context, there is a practical reason to prefer multi-token post-analysis 
DocValues for "full text" fields, as opposed to a restricted-domain, 
term-vectors-based approach.

I'm mainly mentioning this because I agree that in the _absence_ of an purely 
DocValues-driven approach to calculating "relatedness"/"significant_terms", the 
practical argument in favor of multi-token post-analysis DocValues for 
"significant_terms" over full text would indeed be weak; so it's worth noting 
that such a purely DocValues-driven approach has in fact been implemented.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to