Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545
########## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java ########## @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content) break; } // advance breakIterator - passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0)); - passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength)); + passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd)); + lastPassageEnd = Math.min(this.breakIterator.following(start), contentLength); Review comment: Changing the `following(start)` to `following(end - 1)` would have been a great idea before this changeset. However these two calls on the `LengthGoalBreakIterator` are very specifically crafted. The `preceding()` prepares state data that `following()` will use. Currently the workflow is: 1. The call to `preceding(start + 1)` will identify the `startIdx` and `endIdx` around the match by the underlying BI. This `startIdx` can be earlier than the start of the match. Similarly this `endIdx` can be later than the end of the match. (this is the usual case for SENTENCE BI) Both are needed at this point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the remaining size to give as context around the first fragment. Then `context_size*fragalign` is the length `preceding()` will try to prepend to the fragment. Finally it will return the final start position. 2. The call to `following(start)` ignores the argument because the end of the match's fragment (`endIdx`) was already found in `preceding()`. `following()` will try to append the missing length also prepared by `preceding()`. Then return the final end position. This is what I was referring to in my comment when I submitted the first version of the patch. Ideally there was a single method on the `LengthGoalBreakIterator` that would receive the start **and** end positions of the match to work with. That way the processing would be more optimal. The underlying BI would not need to iterate over the content from start to end when searching for the `endIdx`. It could start from `end - 1` of the match. I have tried to change up this part of the code more, but a whole bunch of tests broke. I think there is some other BreakIterator subclass that gets passed around here. So this part can't be rewritten to cater to this specific behavior without greater changes. Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the `preceding()` call can go backwards and create overlaps with the previous passage. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org