Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

via GitHub Wed, 11 Sep 2024 13:41:29 -0700


rmuir commented on PR #13757:
URL: https://github.com/apache/lucene/pull/13757#issuecomment-2344662997


   > This looks good to me. @rmuir For my understanding, is there ever a good 
reason to set discountOverlaps to false?
   
   The `discountOverlaps` is something we had from times where TF/IDF was the 
only scoring and index statistics were limited to `docfreq()` and `maxdoc()`.
   
   It is "easy" to understand: should the document's length be punished by 
synonyms? But at the same time, it makes it tricky to measure how well it is 
working, as the lucene user has the ability to easily inject a lot of 
artificial "synonym-like-things" in nearly infinite ways (e.g. 
word-delimiter-filters and stuff) with the analysis chain. So what would even 
be a fair measure?
   
   Most of the modern scorers are doing something like BM25's `dl/avgdl` which 
makes this option harder to reason about. For example `discountOverlaps` still 
works in BM25 case: document doesn't get punished relative to other documents 
simply because it happened to have more synonyms or word-delimiters.  But all 
documents get "skewed" in their scoring: this option only "discounts" the 
per-document `dl`, but does not impact the `avgdl` for the field (which is 
computed solely from term dictionary statistics). Since this is same skew to 
all documents: if you have tons of injected synonyms and are worried about it, 
you might want to try decreasing BM25's `b` parameter to adjust? Sorry, I 
haven't played too much with this, just general thoughts.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

Reply via email to