[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

GitBox Fri, 27 Dec 2019 08:39:16 -0800

dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361701592


 ##########
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   Thanks for the detailed example.
   
   > As you can see the fragsize is not about the final size of the snippet, 
but about the text around the match.
   
   I think it's a bit strong to claim fragsize is _not_ about the final size of 
the snippet.  There is a linear relationship.  Decreasing this reduces snippet 
sizes and increasing it will increase snippet sizes.  It's just rather indirect 
because we are prioritizing the alignment.  And perhaps we overshoot too much.
   
   Also; couldn't we subtract out the match length itself if it's unduly 
influencing the final snippet size?
   
   Lets pretend we're at a more perfect future and not concerning ourselves 
with backwards compatibility.  Realistically, do you imagine a user would want 
to use fragalign of anything other than 0.5?  I doubt it; I think we should 
assume users want this.  Advanced users can choose _not_ to use this 
BreakIterator -- there are others plus they can write their own (which I've 
done).  Also, rather than assuming users need 50% of fragsize (lengthGoal) on 
both sides, maybe assume 60% _of that_ (I'm looking at default slop in 
LuceneRegexFragmenter which is similar), so thus ~0.3 (or perhaps 1/3rd) 
fragsize minimum left & right.  I think this will lead to snippets that are 
both closer to the desired fragsize on average and lead to more balanced 
passages than we have today.  LMK what you think.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

Reply via email to