fragosoluana opened a new issue, #15433: URL: https://github.com/apache/lucene/issues/15433
### Description During the expansion of phrase queries for highlighting, the same phrase can appear twice. When a query includes overlapping phrases, the expansion process may generate duplicate phrases—one with the original (possibly high) user-defined boost, and another one with the [boost of 1](https://github.com/apache/lucene/blob/main/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldQuery.java#L255-L257). As a result, the final boost value assigned to the phrase may be incorrect, since it is determined by whichever duplicate is processed last during the [creation of the QueryPhraseMap](https://github.com/apache/lucene/blob/main/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldQuery.java#L411-L416) in the [markTerminal method](https://github.com/apache/lucene/blob/main/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldQuery.java#L432). For example, the original query [“a b c”: 100, “a b”: 20, “b c”: 50] assigns a boost of 100 to "a b c", but during expansion a duplicate "a b c" ("a b" + "b c") is generated with boost 1, which could ultimately override the intended boost in the QueryPhraseMap from 100 to 1. Unit test that would fail illustrating this example, but should pass: ``` public void testQueryPhraseMapDuplicate() throws IOException { BooleanQuery.Builder query = new BooleanQuery.Builder(); Query bq = toPhraseQuery(analyze("a b c", F, analyzerB), F); bq = new BoostQuery(bq, 100); query.add(bq, Occur.SHOULD); bq = toPhraseQuery(analyze("a b", F, analyzerB), F); bq = new BoostQuery(bq, 20); query.add(bq, Occur.SHOULD); bq = toPhraseQuery(analyze("b c", F, analyzerB), F); bq = new BoostQuery(bq, 50); query.add(bq, Occur.SHOULD); bq = query.build(); FieldQuery fq = new FieldQuery(bq, true, true); Set<Query> flatQueries = new LinkedHashSet<>(); fq.flatten(bq, searcher, flatQueries, 1f); assertCollectionQueries( fq.expand(flatQueries), pqF(100, "a", "b", "c"), pqF(20, "a", "b"), // "a b c": 1 -> expanded "a b" + "b c" new BoostQuery(pqF(1f, "a", "b", "c"), 1f), pqF(50, "b", "c")); Map<String, QueryPhraseMap> map = fq.rootMaps; QueryPhraseMap qpm = map.get("f").subMap.get("a"); assertEquals(0, qpm.boost, 0.0); QueryPhraseMap qpm1 = qpm.subMap.get("b"); assertEquals(20, qpm1.boost, 0.0); QueryPhraseMap qpm2 = qpm1.subMap.get("c"); // fails here because qm2.boost is 1 assertEquals(100, qpm2.boost, 0.0); QueryPhraseMap qpm3 = map.get("f").subMap.get("b"); assertEquals(0, qpm3.boost, 0.0); QueryPhraseMap qpm4 = qpm3.subMap.get("c"); assertEquals(50, qpm4.boost, 0.0); } ``` ### Version and environment details - Lucene version: 10.3.2 - Component: lucene-highlighter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
