[ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377769#comment-17377769 ]
Robert Muir commented on LUCENE-10023: -------------------------------------- {quote} I guess the essence of the question is whether the likelihood of accidental misuse (and severity of associated consequences) warrant the exclusion of a fairly superficial change that would add efficient first-class support for a number of legitimate use cases. (This in addition to weighing the complexity of the change, which to me does not seem inordinately great). {quote} This is the wrong way to think about it. With a library like this, we need to think about it the other way. Why should this be in indexwriter? It isnt any more efficient than users doing it themselves, and it is a 1% case. These 1% cases kill us, they all stack up and make it ultimately impossible to optimize the 99% case. We should make the common cases easy and the rare/corner cases possible, and that is it. Please, don't look at your PR and think that the code will look the same as that 5 years later, it won't. The simplest changes always double and triple in size after refactorings, corner-cases, etc. For example in this case, I imagine users hitting problems that simple limits would solve. Suddenly now we have more complexity as various limits are then added to indexwriter for it, all of which could be avoided if the change was just kept out of indexwriter in the first place. That way, users themselves doing expert shit could add their own logic (e.g. only put top-N most common terms per doc in the thing, limit to N terms, both, or other more complicated things). Personally, I don't agree with this change, as it only buys us future maintenance burden. I don't see it being more efficient than the user doing it themselves. > Multi-token post-analysis DocValues > ----------------------------------- > > Key: LUCENE-10023 > URL: https://issues.apache.org/jira/browse/LUCENE-10023 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael Gibney > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The single-token case for post-analysis DocValues is accounted for by > {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but > there are cases where it would be desirable to have post-analysis DocValues > based on multi-token fields. > The main use cases that I can think of are variants of faceting/terms > aggregation. I understand that this could be viewed as "trappy" for the naive > "Moby Dick word cloud" case; but: > # I think this can be supported fairly cleanly in Lucene > # Explicit user configuration of this option would help prevent people > shooting themselves in the foot > # The current situation is arguably "trappy" as well; it just offloads the > trappiness onto Lucene-external workarounds for systems/users that want to > support this kind of behavior > # Integrating this functionality directly in Lucene would afford consistency > guarantees that present opportunities for future optimizations (e.g., shared > Terms dictionary between indexed terms and DocValues). > This issue proposes adding support for multi-token post-analysis DocValues > directly to {{IndexingChain}}. The initial proposal involves extending the > API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to > existing {{IndexableFieldType.docValuesType()}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org