Yilun Cui created LUCENE-9635:
---------------------------------
Summary: BM25FQuery - MultiNormsLeafSimScorer needs to mask long
value for long documents
Key: LUCENE-9635
URL: https://issues.apache.org/jira/browse/LUCENE-9635
Project: Lucene - Core
Issue Type: Bug
Components: modules/sandbox
Affects Versions: 8.6
Reporter: Yilun Cui
Through some experimentation with with the BM25FQuery on long documents, I've
discovered that there is a bug that doesn't mask the encoded norm's long value
during scoring. For long documents (or long fields) this may cause
ArrayIndexOutOfBoundsExceptions.
The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131
Here is a similar use in BM25Similarity with the masking
https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233
My experimentation shows that to expose this bug, there must be a match for a
token in more than one field (which is what BM25FQuery is for). In addition one
of the fields must be >= 32792 tokens long.
I've provided tests in the pull request to demonstrate this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]