tomjmul opened a new issue, #15333:
URL: https://github.com/apache/lucene/issues/15333

   ### Description
   
   # Highlighter.getBestFragments() merges zero-scored fragments with scored 
fragments, polluting highlight results
   
   ## Description
   
   The `Highlighter.getBestFragments()` method merges contiguous fragments 
regardless of score, causing zero-scored (non-matching) fragments to be merged 
with scored fragments and returned as highlights. This results in large blocks 
of irrelevant text appearing in highlight results simply because they're 
adjacent to actual matches.
   
   ## Environment
   - Lucene version: 10.2.2 (also present in 10.3.1)
   - Component: lucene-highlighter
   
   ## Steps to Reproduce
   
   1. Create a document with a single term match ("credit") surrounded by 
substantial text
   2. Configure `SimpleSpanFragmenter` with `fragmentSize=100`
   3. Call `highlighter.getBestFragments()` with `maxFragments=3`
   4. Observe that 14 fragments are created, but only 1 has score > 0
   5. The FragmentQueue selects the top 3 fragments (fragment 0 with score 1.0, 
fragments 1 and 2 with score 0.0)
   6. `mergeContiguousFragments()` merges all three into a single ~300 char 
result
   
   ## Actual Behaviour
   
   With `maxFragments=3`, returns a single merged fragment of ~300 characters:
   
   ```
   @meta name "Process Payment" @meta description "Process a payment for an 
order using <em>credit</em> card" @meta tags ["payments", "create", "checkout"] 
@meta collection "Payment Processing API" /* * Payment processing endpoint with 
PCI compliance * NOTE: All card data must be tokenised before
   ```
   
   The result includes ~250 characters of zero-scored content merged with the 
~50 characters containing the actual match.
   
   ## Expected Behaviour
   
   Highlights should only include fragments containing actual matches (score > 
0). Zero-scored fragments should either:
   1. Not be selected by the FragmentQueue, or
   2. Not be merged with scored fragments, or
   3. Be filtered out before being returned
   
   Expected result:
   
   ```
   @meta description "Process a payment for an order using <em>credit</em> card"
   ```
   
   ## Root Cause Analysis
   
   In `getBestTextFragments()`:
   
   1. The fragmenter creates 14 fragments across the document
   2. Only 1 fragment contains the search term and has score 1.0
   3. The remaining 13 fragments have score 0.0
   4. `FragmentQueue(maxNumFragments)` keeps the top N fragments by score
   5. Since 13 fragments have identical zero scores, the queue arbitrarily 
selects the first N-1 zero-scored fragments encountered (fragments 1, 2, etc.)
   6. `mergeContiguousFragments()` merges any adjacent fragments regardless of 
score
   7. The merged fragment inherits the highest score (1.0), so it passes the 
`score > 0` filter
   
   ## Problematic Code
   
   In `Highlighter.getBestTextFragments()`, the merge happens unconditionally:
   
   ```java
   if (mergeContiguousFragments) {
       mergeContiguousFragments(frag);
   }
   ```
   
   ## Impact
   
   - `maxFragments` parameter behaves counterintuitively - changing it from 2 
to 3 changes the result size from ~200 to ~300 chars
   - Users get large blocks of irrelevant text in their highlights
   - No way to control this behaviour through configuration
   
   ## Workaround
   
   Call `getBestTextFragments()` directly with `mergeContiguousFragments=false` 
and manually filter:
   
   ```java
   TextFragment[] fragments = highlighter.getBestTextFragments(
       tokenStream, text, false, maxFragments);
   
   List<String> results = new ArrayList<>();
   for (TextFragment frag : fragments) {
       if (frag != null && frag.getScore() > 0) {
           results.add(frag.toString());
       }
   }
   ```
   
   ## Suggested Fix
   
   **Option 1:** Only merge fragments where both fragments have score > 
threshold (e.g., 0.1)
   
   **Option 2:** Add a configuration parameter to control merge behaviour:
   
   ```java
   highlighter.setMergeScoreThreshold(0.1);
   ```
   
   **Option 3:** Filter zero-scored fragments from the FragmentQueue before 
merging
   
   ## Questions
   
   Is the current behaviour intentional? If so, would you consider adding 
configuration to control which fragments are eligible for merging based on 
their scores?
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to