dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361701592
##########
File path:
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
##########
@@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx
is a known break
// called at start of new Passage given first word start offset
@Override
public int preceding(int offset) {
- return baseIter.preceding(offset); // no change needed
+ final int fragmentStart = Math.max(baseIter.preceding(offset), 0); //
convert DONE to 0
+ fragmentEndFromPreceding = baseIter.following(fragmentStart);
+ if (fragmentEndFromPreceding == DONE) {
+ fragmentEndFromPreceding = baseIter.last();
+ }
+ final int centerLength = fragmentEndFromPreceding - fragmentStart;
+ final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) *
fragmentAlignment);
Review comment:
Thanks for the detailed example.
> As you can see the fragsize is not about the final size of the snippet,
but about the text around the match.
I think it's a bit strong to claim fragsize is _not_ about the final size of
the snippet. There is a linear relationship. Decreasing this reduces snippet
sizes and increasing it will increase snippet sizes. It's just rather indirect
because we are prioritizing the alignment. And perhaps we overshoot too much.
Also; couldn't we subtract out the match length itself if it's unduly
influencing the final snippet size?
Lets pretend we're at a more perfect future and not concerning ourselves
with backwards compatibility. Realistically, do you imagine a user would want
to use fragalign of anything other than 0.5? I doubt it; I think we should
assume users want this. Advanced users can choose _not_ to use this
BreakIterator -- there are others plus they can write their own (which I've
done). Also, rather than assuming users need 50% of fragsize (lengthGoal) on
both sides, maybe assume 60% _of that_ (I'm looking at default slop in
LuceneRegexFragmenter which is similar), so thus ~0.3 (or perhaps 1/3rd)
fragsize minimum left & right. I think this will lead to snippets that are
both closer to the desired fragsize on average and lead to more balanced
passages than we have today. LMK what you think.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]