[ 
https://issues.apache.org/jira/browse/SOLR-11516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995738#comment-16995738
 ] 

Nándor Mátravölgyi commented on SOLR-11516:
-------------------------------------------

As previously stated the UnifiedHighlighter always returns full sentences (with 
SENTENCE bs.type), effectively not adhering to the fragsize parameter. Changing 
the breakiterator type to WORD makes the fragsize work as expected, but the 
matches are not "centered" in the snippets, making their context much less 
apparent in some cases.

The trimming on the client-side is relatively a bad solution in my opinion. 
Let's say I receive a highlight that is several times longer than I want, but 
the matches are very unevenly distributed because of a long sentence: 
OXOOOOOOOOOXXX (imagine X represents the matches and O the text around them). 
The client (to truly be correct) has to parse the highlight for the pre-post 
tags and strip the middle of the text. In a more primitive solution the 
highlight would be bluntly truncated and the valuable matches at the end are 
lost to the client. This work is redundant and wasteful if solr could do it 
like in the other highlighters.

I really wanted to have the fragsize be around what I specified even with much 
longer sentences, so I've spent some time analyzing the code and designing 
possible changes.

The UnifiedHighlighter chains the selected breakiterator instance requested by 
the hl.bs.type parameter with a LengthGoalBreakIterator. (unless fragsize <= 1 
or type == WHOLE) It is actually the LengthGoalBreakIterator that decides what 
parts should be in the snippet around the actual match.

Currently this class always starts the snippet from the first break before the 
match indicated by the wrapped iterator, and may only extend the snippet beyond 
the match until fragsize is reached.

There is a "closestTo" mode implemented in it, but it's always starts like the 
used one and it is not selectable because it would require some additional 
missing parameter. ([view on 
github|https://github.com/apache/lucene-solr/blob/e5df183a42967c0eb79b5c2c65cd3ab618318f23/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])

So far I can see two ways to improve this:
 # Improve the LengthGoalBreakIterator to have a "centerAround" mode. This has 
the benefit of working with all other hl.bs.types. Even though it would mostly 
be meaningful for SEPARATOR and WORD. In SENTENCE mode a great enough fragsize 
could include a preceding sentence in the snippet as well. To use this mode a 
new parameter has to be created. Something like "hl.bs.snippetAlignment" maybe, 
which could have the values of "min" - current behavior, "closest" - currently 
unreachable and "center" - the proposed behavior.
 # Make a new hl.bs.type, AROUND_MATCH maybe and create a different 
breakiterator wrapper to be used instead of the LengthGoalBreakIterator. This 
would wrap a WORD brakeiterator thus producing similar results to the other 
highlighters.

One question is if the passage (ultimately snippet) extractor algorithm in 
FieldHighlighter needs to change. Currently because no breakiterator looks 
before the match for a passage start position, it is guaranteed that the 
passages will have no overlap. This is something that would not be the case 
after the changes, and may also need some work. (interestingly the fastVerctor 
highlighter can produce slight overlaps if the matches are dense enough, while 
the original will not)

I'm pretty sure either can be done with minimal overhead since all data is 
already available. The algorithms just need to make different decisions where 
to slice the strings. I'm willing to work on this, so please share your ideas.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: SOLR-11516
>                 URL: https://issues.apache.org/jira/browse/SOLR-11516
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 6.4, 7.1
>            Reporter: Tim Retout
>            Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to