[ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378110#comment-17378110 ]
Michael Gibney commented on LUCENE-10023: ----------------------------------------- {quote}This is the wrong way to think about it. With a library like this, we need to think about it the other way. {quote} I'm on board with that. I should have phrased the "essential question" differently -- I didn't mean to imply that the burden of proof was on the "against" side -- it's of course the opposite. I did my best to explain the arguments in favor of inclusion, and don't have anything to add on that side. I do think it's likely more efficient than users doing this themselves ("more efficient by how much" is an open, relevant question, and indeed the benefit would be more pronounced for the cases to which you're least sympathetic). It's a fair point about long-term complexity/scope creep (particularly in such a central component). Every change carries that risk, but on the spectrum of risk I honestly think this particular change comes in relatively low: It hooks in cleanly (reading values from TermToBytesRefAttribute at exactly the same point as indexing does, writing directly to docValuesWriter); and once the docValues are written, they're "just docValues" -- there are no additional lower-level restrictions or assumptions to make/violate. {quote}I imagine users hitting problems that simple limits would [not] solve {quote} Yes, there are a number of nuanced limits that could be useful (and that could/should be configured via TokenFilters); but the idea to add a simple limit directly in IndexingChain isn't as a convenience, but strictly as a failsafe to prevent users shooting themselves in the foot. A simple limit is sufficient for this purpose, and with an open mind, I honestly don't anticipate a strong temptation to add more nuanced limits directly to IndexingChain. >From my perspective, the complexity/negative consequences I could anticipate >from this PR are: at the index level, "docValues are docValues" -- no >distinction between pre- and post-analysis. So if checking consistency between >FieldType and index on-disk, you'd have to check against >FieldType.docValuesType() _and_ FieldType.tokenDocValuesType(), and perhaps >check that these two are mutually exclusive for a given FieldType. The other >potential issue would be if there's a chance that calling >{{indexDocValue(...)}} from _within_ the {{IndexingChain.invert(int, >IndexableField, boolean)}} would for some reason be disallowed. I'm not seeking to needlessly prolong this discussion, just continuing to think through these issues. Robert, I recognize I'm not going to end up convincing you ultimately, but I sincerely appreciate the feedback/consideration. > Multi-token post-analysis DocValues > ----------------------------------- > > Key: LUCENE-10023 > URL: https://issues.apache.org/jira/browse/LUCENE-10023 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael Gibney > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The single-token case for post-analysis DocValues is accounted for by > {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but > there are cases where it would be desirable to have post-analysis DocValues > based on multi-token fields. > The main use cases that I can think of are variants of faceting/terms > aggregation. I understand that this could be viewed as "trappy" for the naive > "Moby Dick word cloud" case; but: > # I think this can be supported fairly cleanly in Lucene > # Explicit user configuration of this option would help prevent people > shooting themselves in the foot > # The current situation is arguably "trappy" as well; it just offloads the > trappiness onto Lucene-external workarounds for systems/users that want to > support this kind of behavior > # Integrating this functionality directly in Lucene would afford consistency > guarantees that present opportunities for future optimizations (e.g., shared > Terms dictionary between indexed terms and DocValues). > This issue proposes adding support for multi-token post-analysis DocValues > directly to {{IndexingChain}}. The initial proposal involves extending the > API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to > existing {{IndexableFieldType.docValuesType()}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org