tomjmul opened a new issue, #15333:
URL: https://github.com/apache/lucene/issues/15333
### Description
# Highlighter.getBestFragments() merges zero-scored fragments with scored
fragments, polluting highlight results
## Description
The `Highlighter.getBestFragments()` method merges contiguous fragments
regardless of score, causing zero-scored (non-matching) fragments to be merged
with scored fragments and returned as highlights. This results in large blocks
of irrelevant text appearing in highlight results simply because they're
adjacent to actual matches.
## Environment
- Lucene version: 10.2.2 (also present in 10.3.1)
- Component: lucene-highlighter
## Steps to Reproduce
1. Create a document with a single term match ("credit") surrounded by
substantial text
2. Configure `SimpleSpanFragmenter` with `fragmentSize=100`
3. Call `highlighter.getBestFragments()` with `maxFragments=3`
4. Observe that 14 fragments are created, but only 1 has score > 0
5. The FragmentQueue selects the top 3 fragments (fragment 0 with score 1.0,
fragments 1 and 2 with score 0.0)
6. `mergeContiguousFragments()` merges all three into a single ~300 char
result
## Actual Behaviour
With `maxFragments=3`, returns a single merged fragment of ~300 characters:
```
@meta name "Process Payment" @meta description "Process a payment for an
order using <em>credit</em> card" @meta tags ["payments", "create", "checkout"]
@meta collection "Payment Processing API" /* * Payment processing endpoint with
PCI compliance * NOTE: All card data must be tokenised before
```
The result includes ~250 characters of zero-scored content merged with the
~50 characters containing the actual match.
## Expected Behaviour
Highlights should only include fragments containing actual matches (score >
0). Zero-scored fragments should either:
1. Not be selected by the FragmentQueue, or
2. Not be merged with scored fragments, or
3. Be filtered out before being returned
Expected result:
```
@meta description "Process a payment for an order using <em>credit</em> card"
```
## Root Cause Analysis
In `getBestTextFragments()`:
1. The fragmenter creates 14 fragments across the document
2. Only 1 fragment contains the search term and has score 1.0
3. The remaining 13 fragments have score 0.0
4. `FragmentQueue(maxNumFragments)` keeps the top N fragments by score
5. Since 13 fragments have identical zero scores, the queue arbitrarily
selects the first N-1 zero-scored fragments encountered (fragments 1, 2, etc.)
6. `mergeContiguousFragments()` merges any adjacent fragments regardless of
score
7. The merged fragment inherits the highest score (1.0), so it passes the
`score > 0` filter
## Problematic Code
In `Highlighter.getBestTextFragments()`, the merge happens unconditionally:
```java
if (mergeContiguousFragments) {
mergeContiguousFragments(frag);
}
```
## Impact
- `maxFragments` parameter behaves counterintuitively - changing it from 2
to 3 changes the result size from ~200 to ~300 chars
- Users get large blocks of irrelevant text in their highlights
- No way to control this behaviour through configuration
## Workaround
Call `getBestTextFragments()` directly with `mergeContiguousFragments=false`
and manually filter:
```java
TextFragment[] fragments = highlighter.getBestTextFragments(
tokenStream, text, false, maxFragments);
List<String> results = new ArrayList<>();
for (TextFragment frag : fragments) {
if (frag != null && frag.getScore() > 0) {
results.add(frag.toString());
}
}
```
## Suggested Fix
**Option 1:** Only merge fragments where both fragments have score >
threshold (e.g., 0.1)
**Option 2:** Add a configuration parameter to control merge behaviour:
```java
highlighter.setMergeScoreThreshold(0.1);
```
**Option 3:** Filter zero-scored fragments from the FragmentQueue before
merging
## Questions
Is the current behaviour intentional? If so, would you consider adding
configuration to control which fragments are eligible for merging based on
their scores?
### Version and environment details
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]