[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Robert Muir (Jira) Thu, 08 Jul 2021 20:42:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377769#comment-17377769
 ]


Robert Muir commented on LUCENE-10023:
--------------------------------------

{quote}
I guess the essence of the question is whether the likelihood of accidental 
misuse (and severity of associated consequences) warrant the exclusion of a 
fairly superficial change that would add efficient first-class support for a 
number of legitimate use cases. (This in addition to weighing the complexity of 
the change, which to me does not seem inordinately great).
{quote}

This is the wrong way to think about it. With a library like this, we need to 
think about it the other way. 

Why should this be in indexwriter? It isnt any more efficient than users doing 
it themselves, and it is a 1% case. These 1% cases kill us, they all stack up 
and make it ultimately impossible to optimize the 99% case. 

We should make the common cases easy and the rare/corner cases possible, and 
that is it.

Please, don't look at your PR and think that the code will look the same as 
that 5 years later, it won't. The simplest changes always double and triple in 
size after refactorings, corner-cases, etc.

For example in this case, I imagine users hitting problems that simple limits 
would solve. Suddenly now we have more complexity as various limits are then 
added to indexwriter for it, all of which could be avoided if the change was 
just kept out of indexwriter in the first place. That way, users themselves 
doing expert shit could add their own logic (e.g. only put top-N most common 
terms per doc in the thing, limit to N terms, both, or other more complicated 
things).

Personally, I don't agree with this change, as it only buys us future 
maintenance burden. I don't see it being more efficient than the user doing it 
themselves.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to