[ 
https://issues.apache.org/jira/browse/LUCENE-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yilun Cui updated LUCENE-9635:
------------------------------
    Description: 
Through some experimentation with the BM25FQuery on long documents, I've 
discovered that there is a bug that doesn't mask the encoded norm's long value 
during scoring. For long documents (or long fields) this may cause 
ArrayIndexOutOfBoundsExceptions.

The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131

Here is a similar use in BM25Similarity with the masking
https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233

My experimentation shows that to expose this bug, there must be a match for a 
token in more than one field (which is what BM25FQuery is for). In addition one 
of the fields must be >= 32792 tokens long.

I've provided tests in the pull request to demonstrate this.

  was:
Through some experimentation with with the BM25FQuery on long documents, I've 
discovered that there is a bug that doesn't mask the encoded norm's long value 
during scoring. For long documents (or long fields) this may cause 
ArrayIndexOutOfBoundsExceptions.

The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131

Here is a similar use in BM25Similarity with the masking
https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233

My experimentation shows that to expose this bug, there must be a match for a 
token in more than one field (which is what BM25FQuery is for). In addition one 
of the fields must be >= 32792 tokens long.

I've provided tests in the pull request to demonstrate this.


> BM25FQuery - MultiNormsLeafSimScorer needs to mask long value for long 
> documents
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-9635
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9635
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/sandbox
>    Affects Versions: 8.6
>            Reporter: Yilun Cui
>            Priority: Minor
>              Labels: pull-request-available
>
> Through some experimentation with the BM25FQuery on long documents, I've 
> discovered that there is a bug that doesn't mask the encoded norm's long 
> value during scoring. For long documents (or long fields) this may cause 
> ArrayIndexOutOfBoundsExceptions.
> The line where I suspect the bug is being exposed is here
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131
> Here is a similar use in BM25Similarity with the masking
> https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233
> My experimentation shows that to expose this bug, there must be a match for a 
> token in more than one field (which is what BM25FQuery is for). In addition 
> one of the fields must be >= 32792 tokens long.
> I've provided tests in the pull request to demonstrate this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to