Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545
##########
File path:
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
##########
@@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int
docId, String content)
break;
}
// advance breakIterator
- passage.setStartOffset(Math.max(this.breakIterator.preceding(start +
1), 0));
- passage.setEndOffset(Math.min(this.breakIterator.following(start),
contentLength));
+ passage.setStartOffset(Math.max(this.breakIterator.preceding(start +
1), lastPassageEnd));
+ lastPassageEnd = Math.min(this.breakIterator.following(start),
contentLength);
Review comment:
Changing the `following(start)` to `following(end - 1)` would have been a
great idea before this changeset. However these two calls on the
`LengthGoalBreakIterator` are very specifically crafted. The `preceding()`
prepares state data that `following()` will use. Currently the workflow is:
1. The call to `preceding(start + 1)` will identify the `startIdx` and
`endIdx` around the match by the underlying BI. This `startIdx` can be earlier
than the start of the match. Similarly this `endIdx` can be later than the end
of the match. (this is the usual case for SENTENCE BI) Both are needed at this
point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the
remaining size to give as context around the first fragment. Then
`context_size*fragalign` is the length `preceding()` will try to prepend to the
fragment. Finally it will return the final start position.
2. The call to `following(start)` ignores the argument because the end of
the match's fragment (`endIdx`) was already found in `preceding()`.
`following()` will try to append the missing length also prepared by
`preceding()`. Then return the final end position.
This is what I was referring to in my comment when I submitted the first
version of the patch. Ideally there was a single method on the
`LengthGoalBreakIterator` that would receive the start **and** end positions of
the match to work with. That way the processing would be more optimal. The
underlying BI would not need to iterate over the content from start to end when
searching for the `endIdx`. It could start from `end - 1` of the match.
I have tried to change up this part of the code more, but a whole bunch of
tests broke. I think there is some other BreakIterator subclass that gets
passed around here. So this part can't be rewritten to cater to this specific
behavior without greater changes.
Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the
`preceding()` call can go backwards and create overlaps with the previous
passage.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]