[ https://issues.apache.org/jira/browse/LUCENE-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yilun Cui updated LUCENE-9635: ------------------------------ Description: Through some experimentation with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions. The line where I suspect the bug is being exposed is here https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131 Here is a similar use in BM25Similarity with the masking https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233 My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long. I've provided tests in the pull request to demonstrate this. was: Through some experimentation with with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions. The line where I suspect the bug is being exposed is here https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131 Here is a similar use in BM25Similarity with the masking https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233 My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long. I've provided tests in the pull request to demonstrate this. > BM25FQuery - MultiNormsLeafSimScorer needs to mask long value for long > documents > -------------------------------------------------------------------------------- > > Key: LUCENE-9635 > URL: https://issues.apache.org/jira/browse/LUCENE-9635 > Project: Lucene - Core > Issue Type: Bug > Components: modules/sandbox > Affects Versions: 8.6 > Reporter: Yilun Cui > Priority: Minor > Labels: pull-request-available > > Through some experimentation with the BM25FQuery on long documents, I've > discovered that there is a bug that doesn't mask the encoded norm's long > value during scoring. For long documents (or long fields) this may cause > ArrayIndexOutOfBoundsExceptions. > The line where I suspect the bug is being exposed is here > https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131 > Here is a similar use in BM25Similarity with the masking > https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233 > My experimentation shows that to expose this bug, there must be a match for a > token in more than one field (which is what BM25FQuery is for). In addition > one of the fields must be >= 32792 tokens long. > I've provided tests in the pull request to demonstrate this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org