dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361701592
########## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ########## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { - return baseIter.preceding(offset); // no change needed + final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 + fragmentEndFromPreceding = baseIter.following(fragmentStart); + if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); + } + final int centerLength = fragmentEndFromPreceding - fragmentStart; + final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: Thanks for the detailed example. > As you can see the fragsize is not about the final size of the snippet, but about the text around the match. I think it's a bit strong to claim fragsize is _not_ about the final size of the snippet. There is a linear relationship. Decreasing this reduces snippet sizes and increasing it will increase snippet sizes. It's just rather indirect because we are prioritizing the alignment. And perhaps we overshoot too much. Also; couldn't we subtract out the match length itself if it's unduly influencing the final snippet size? Lets pretend we're at a more perfect future and not concerning ourselves with backwards compatibility. Realistically, do you imagine a user would want to use fragalign of anything other than 0.5? I doubt it; I think we should assume users want this. Advanced users can choose _not_ to use this BreakIterator -- there are others plus they can write their own (which I've done). Also, rather than assuming users need 50% of fragsize (lengthGoal) on both sides, maybe assume 60% _of that_ (I'm looking at default slop in LuceneRegexFragmenter which is similar), so thus ~0.3 (or perhaps 1/3rd) fragsize minimum left & right. I think this will lead to snippets that are both closer to the desired fragsize on average and lead to more balanced passages than we have today. LMK what you think. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org