[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Michael Gibney (Jira) Fri, 09 Jul 2021 12:58:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378269#comment-17378269
 ]


Michael Gibney commented on LUCENE-10023:
-----------------------------------------

That would make sense. Really the main difference in a sandbox-based impl would 
be performance (double-consuming TokenStream and extra buffering of token 
BytesRefs). Having a concrete sandbox impl available would give a baseline for 
evaluating any performance difference, and would also address the desire to add 
this functionality in a place where it would be factored out and accessible to 
Elasticsearch, Solr, etc...

My only hesitation about the sandbox approach is that if there's not even a 
remote thought of ever considering/evaluating the performance gain that would 
come from integrating this in IndexingChain, and entertaining the legitimacy of 
the "text corpus analytics"/many-token use case (with trappiness somehow 
mitigated), then the sandbox change would be _exclusively_ sugar.

This is neither here nor there, and not an argument against the sandbox 
approach, but tbh it likely wouldn't have occurred to me to file a Lucene issue 
for this if the change were _strictly_ about sugar, with no performance aspect. 
That said, I think it still might be worth pursuing a sandbox-based approach, 
particularly if:
# there's _any_ potential of revisiting closer integration in IndexingChain, 
for performance reasons, or
# there's interest in leveraging this from "other-than-Solr" (e.g., 
Elasticsearch, etc. ... I'm approaching this from the Solr side, so could just 
as well implement it there, in a custom Solr FieldType, I think).

I probably won't move immediately on to implementing a sandbox-based approach; 
I'd be interested if anyone feels inclined to weigh in on whether they'd find 
such an approach useful.

> Multi-token post-analysis DocValues
> -----------------------------------
>
>                 Key: LUCENE-10023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10023
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Gibney
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The single-token case for post-analysis DocValues is accounted for by 
> {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but 
> there are cases where it would be desirable to have post-analysis DocValues 
> based on multi-token fields.
> The main use cases that I can think of are variants of faceting/terms 
> aggregation. I understand that this could be viewed as "trappy" for the naive 
> "Moby Dick word cloud" case; but:
> # I think this can be supported fairly cleanly in Lucene
> # Explicit user configuration of this option would help prevent people 
> shooting themselves in the foot
> # The current situation is arguably "trappy" as well; it just offloads the 
> trappiness onto Lucene-external workarounds for systems/users that want to 
> support this kind of behavior
> # Integrating this functionality directly in Lucene would afford consistency 
> guarantees that present opportunities for future optimizations (e.g., shared 
> Terms dictionary between indexed terms and DocValues).
> This issue proposes adding support for multi-token post-analysis DocValues 
> directly to {{IndexingChain}}. The initial proposal involves extending the 
> API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to 
> existing {{IndexableFieldType.docValuesType()}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues

Reply via email to