[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Thu, 08 Jul 2021 18:39:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377681#comment-17377681
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

{quote}I'm not seeing this looking at your patch. I see duplicate inversion: 
e.g. tokenstream consumed twice.
{quote}

The [usual 
case|https://github.com/apache/lucene/blob/75339d672eff6ea2d7e67e1877b801208740f49d/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1316-L1320]
 is within the [main "while 
loop"|https://github.com/apache/lucene/blob/75339d672eff6ea2d7e67e1877b801208740f49d/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1232-L1343]
 of {{IndexingChain.invert(int, IndexableField, boolean)}}. The 
[IndexingChain.tokenizedDocValuesOnly(int, 
IndexableField)|https://github.com/apache/lucene/blob/75339d672eff6ea2d7e67e1877b801208740f49d/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1146-L1186]
 method is a mutually exclusive simplification of the usual case (largely 
redundant, as called out in the javadocs for that method). Unless I'm missing 
something, the code indeed should only consume the TokenStream once for any 
given field.

{quote}it isn't clear the current PR works in all cases (e.g. indexed=false, 
but Field passed with a Tokenstream)
{quote}

There's only one place in IndexingChain where the {{stream.incrementToken()}} 
is called in a loop -- i.e., where the TokenStream is consumed; since that's 
exactly where the PR hooks in (via {{TermToBytesRefAttribute}}) to generate 
post-analysis DocValues, I'd be surprised if it were missing edge cases. To be 
sure, it wouldn't be the first time I've been surprised :-) ... but it'd be an 
easy enough thing check if/when making tests more robust.

{quote}I don't view this as an advantage. In fact it seems to only matter for 
exactly the trappy cases that I am concerned about. If docs are so large, and 
you dont have enough ram to index a single doc into SortedSetDocValues itself, 
well, that says everything :-)
{quote}

Fair point :-); and accordingly, given that the "trappy" use case isn't close 
to my heart, I don't intend to argue this point to the hilt. I do however 
wonder in principle about the "text corpus analytics" use case ... 
_intentional_ "misuse" being otherwise known as simply: "use". (FWIW, 
equivalent functionality is available in Solr, but only via "uninverting" -- a 
rather extreme baseline for evaluating "trappiness"!).

To my mind, a lot hinges on whether this PR really _does_ miss edge cases. In 
its current state (for the moment assuming it _properly_ handles edge cases), 
the PR feels pretty clean and straightforward to me. The change is quite small 
-- both in terms of "lines of code" and also, I'd argue, it is _kind of_ 
"sugar" in the sense that it doesn't actually introduce fundamental changes -- 
it simply adds first-class support (and some efficiency gains) for a particular 
use case.

I guess the essence of the question is whether the likelihood of accidental 
misuse (and severity of associated consequences) warrant the exclusion of a 
fairly superficial change that would add efficient first-class support for a 
number of legitimate use cases. (This in addition to weighing the complexity of 
the change, which to me does not seem inordinately great).

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to