rmuir commented on PR #13757: URL: https://github.com/apache/lucene/pull/13757#issuecomment-2344662997
> This looks good to me. @rmuir For my understanding, is there ever a good reason to set discountOverlaps to false? The `discountOverlaps` is something we had from times where TF/IDF was the only scoring and index statistics were limited to `docfreq()` and `maxdoc()`. It is "easy" to understand: should the document's length be punished by synonyms? But at the same time, it makes it tricky to measure how well it is working, as the lucene user has the ability to easily inject a lot of artificial "synonym-like-things" in nearly infinite ways (e.g. word-delimiter-filters and stuff) with the analysis chain. So what would even be a fair measure? Most of the modern scorers are doing something like BM25's `dl/avgdl` which makes this option harder to reason about. For example `discountOverlaps` still works in BM25 case: document doesn't get punished relative to other documents simply because it happened to have more synonyms or word-delimiters. But all documents get "skewed" in their scoring: this option only "discounts" the per-document `dl`, but does not impact the `avgdl` for the field (which is computed solely from term dictionary statistics). Since this is same skew to all documents: if you have tons of injected synonyms and are worried about it, you might want to try decreasing BM25's `b` parameter to adjust? Sorry, I haven't played too much with this, just general thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org