Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439
##########
File path:
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
##########
@@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx
is a known break
// called at start of new Passage given first word start offset
@Override
public int preceding(int offset) {
- return baseIter.preceding(offset); // no change needed
+ final int fragmentStart = Math.max(baseIter.preceding(offset), 0); //
convert DONE to 0
+ fragmentEndFromPreceding = baseIter.following(fragmentStart);
+ if (fragmentEndFromPreceding == DONE) {
+ fragmentEndFromPreceding = baseIter.last();
+ }
+ final int centerLength = fragmentEndFromPreceding - fragmentStart;
+ final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) *
fragmentAlignment);
Review comment:
> Also; couldn't we subtract out the match length itself if it's unduly
influencing the final snippet size?
My original patch was so convoluted because that cannot be done easily in
the current implementation with the BreakIterator base class. The call to
`preceding()` on the LGBI receives the `matchStart` and returns the final start
of the snippet. Since this method has no way to know the other end of the match
it cannot adjust to the length of the match. Using private state the
`following()` could know it, but that would only be able to adjust on the
right hand side and not the left.
Something like `public int[] getPassageRange(int matchStart, int matchEnd) {
... }` would allow a better implementation to be written. But that requires
changing `protected final BreakIterator breakIterator;` inside
`FieldHighlighter` to a custom BI base that has such method.
> Realistically, do you imagine a user would want to use fragalign of
anything other than 0.5?
I'd say the margin values of `fragalign` are of little use, but having the
ability to set it in the range of`0.35-0.65` could be worthwhile.
I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too
beneficial.
- The `RegexFragmenter` processes the whole stored text (until
`hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`.
Interestingly it also uses the tokenizer's offsets from the index (I think),
because even if a break would be in the middle of a word it will not break it
in half. Unless I'm completely missing something the regex seems to only give a
strong recommendation to where a break should go.
- Implementing an acceptable range (`slop`) would require to build some list
of `hotSpots` or at least to look for them when extracting the fragment. For
this we could only use `preceding()` and `following()` on the wrapped BI, which
we wanted to avoid if possible for performance reasons.
- For WORD BI the current solution is pretty accurate and is as optimal as
it gets. (single call of `preceding()` and `following()` on the underlying BI)
- For SENTENCE BI the slop would not help enough to be worth the trouble I
think. Let's say I have to extend the snippet on the right side with 100 chars
of a slop 0.4 (valid range: 60-140):
- If the next sentence is 300 chars long there's nothing better to do what
we already do.
- If the next sentences have the lengths of [30, 50, 200, 40] the better
choice would be to have the first two sentences appended with a sum length of
80. (that's only 20 off from the goal) For this the LGBI can be created with
`closestTo` mode instead of `minLength` mode. It would cost an extra call of
`preceding()` and `following()` each but would produce the better result
instead of appending 3 of the following sentences with 280 total length.
So we already have a `slop` of sorts... it's just a boolean that cannot be
turned on right now as it is missing a parameter. I've mentioned [this
todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330)
about it before.
After looking though all of this, my suggestion is to:
- keep `hl.fragalign` and have it be `0.5` by default
- add `hl.closestTo` with default `false`
- a better parameter name would be welcome
- we could consider making it automatically turned on if the BI is SENTENCE
- do not worry about keeping a backward-compatible behavior with this change
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]