[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Fri, 09 Jul 2021 08:02:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378110#comment-17378110
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

{quote}This is the wrong way to think about it. With a library like this, we 
need to think about it the other way.
{quote}

I'm on board with that. I should have phrased the "essential question" 
differently -- I didn't mean to imply that the burden of proof was on the 
"against" side -- it's of course the opposite.

I did my best to explain the arguments in favor of inclusion, and don't have 
anything to add on that side. I do think it's likely more efficient than users 
doing this themselves ("more efficient by how much" is an open, relevant 
question, and indeed the benefit would be more pronounced for the cases to 
which you're least sympathetic).

It's a fair point about long-term complexity/scope creep (particularly in such 
a central component). Every change carries that risk, but on the spectrum of 
risk I honestly think this particular change comes in relatively low: It hooks 
in cleanly (reading values from TermToBytesRefAttribute at exactly the same 
point as indexing does, writing directly to docValuesWriter); and once the 
docValues are written, they're "just docValues" -- there are no additional 
lower-level restrictions or assumptions to make/violate.

{quote}I imagine users hitting problems that simple limits would [not] solve
{quote}
Yes, there are a number of nuanced limits that could be useful (and that 
could/should be configured via TokenFilters); but the idea to add a simple 
limit directly in IndexingChain isn't as a convenience, but strictly as a 
failsafe to prevent users shooting themselves in the foot. A simple limit is 
sufficient for this purpose, and with an open mind, I honestly don't anticipate 
a strong temptation to add more nuanced limits directly to IndexingChain.

>From my perspective, the complexity/negative consequences I could anticipate 
>from this PR are: at the index level, "docValues are docValues" -- no 
>distinction between pre- and post-analysis. So if checking consistency between 
>FieldType and index on-disk, you'd have to check against 
>FieldType.docValuesType() _and_ FieldType.tokenDocValuesType(), and perhaps 
>check that these two are mutually exclusive for a given FieldType. The other 
>potential issue would be if there's a chance that calling 
>{{indexDocValue(...)}} from _within_ the {{IndexingChain.invert(int, 
>IndexableField, boolean)}} would for some reason be disallowed.

I'm not seeking to needlessly prolong this discussion, just continuing to think 
through these issues. Robert, I recognize I'm not going to end up convincing 
you ultimately, but I sincerely appreciate the feedback/consideration.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to