Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545
 
 

 ##########
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##########
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int 
docId, String content)
           break;
         }
         // advance breakIterator
-        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), 0));
-        passage.setEndOffset(Math.min(this.breakIterator.following(start), 
contentLength));
+        passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), lastPassageEnd));
+        lastPassageEnd = Math.min(this.breakIterator.following(start), 
contentLength);
 
 Review comment:
   Changing the `following(start)` to `following(end - 1)` would have been a 
great idea before this changeset. However these two calls on the 
`LengthGoalBreakIterator` are very specifically crafted. The `preceding()` 
prepares state data that `following()` will use. Currently the workflow is:
   1. The call to `preceding(start + 1)` will identify the `startIdx` and 
`endIdx` around the match by the underlying BI. This `startIdx` can be earlier 
than the start of the match. Similarly this `endIdx` can be later than the end 
of the match. (this is the usual case for SENTENCE BI) Both are needed at this 
point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the 
remaining size to give as context around the first fragment. Then 
`context_size*fragalign` is the length `preceding()` will try to prepend to the 
fragment. Finally it will return the final start position.
   2. The call to `following(start)` ignores the argument because the end of 
the match's fragment (`endIdx`) was already found in `preceding()`. 
`following()` will try to append the missing length also prepared by 
`preceding()`. Then return the final end position.
   
   This is what I was referring to in my comment when I submitted the first 
version of the patch. Ideally there was a single method on the 
`LengthGoalBreakIterator` that would receive the start **and** end positions of 
the match to work with. That way the processing would be more optimal. The 
underlying BI would not need to iterate over the content from start to end when 
searching for the `endIdx`. It could start from `end - 1` of the match.
   
   I have tried to change up this part of the code more, but a whole bunch of 
tests broke. I think there is some other BreakIterator subclass that gets 
passed around here. So this part can't be rewritten to cater to this specific 
behavior without greater changes.
   
   Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the 
`preceding()` call can go backwards and create overlaps with the previous 
passage.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to