[ https://issues.apache.org/jira/browse/LUCENE-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nándor Mátravölgyi updated LUCENE-9091: --------------------------------------- Attachment: LUCENE-9091.patch Lucene Fields: New,Patch Available (was: New) Status: Open (was: Open) This patch fixes the escaping implementation used by the UnifiedHighlighter, mirroring how the other highlighters implement it. The tests that would fail because of this change have been adjusted. An extra unit test was also added to test the DefaultPassageFormatter class on its own as well. > UnifiedHighlighter HTML escaping should only escape essentials > -------------------------------------------------------------- > > Key: LUCENE-9091 > URL: https://issues.apache.org/jira/browse/LUCENE-9091 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: Nándor Mátravölgyi > Priority: Minor > Attachments: LUCENE-9091.patch > > > The unified highlighter does not use the > *org.apache.lucene.search.highlight.SimpleHTMLEncoder* through > *org.apache.solr.highlight.HtmlEncoder*. It has the HTML escaping feature > re-implemented and embedded in the > *org.apache.lucene.search.uhighlight.DefaultPassageFormatter*. > The HTML escaping done by the unified highlighter escapes characters that do > not need it. This makes the result payload 50%+ more heavy with no benefit. > Here is a highlight snippet using the original highlighter: > {noformat} > A <em>filter</em> that stems words using a Snowball-generated stemmer. > Available stemmers & x are listed in org.tartarus.snowball.ext. Note: > This <em>filter</em> is aware of the KeywordAttribute. > {noformat} > Here is the same highlight snippet using the unified highlighter: > {noformat} > A <em>filter</em> that stems words using a Snowball-generated stemmer. Available stemmers & x are listed in org.tartarus.snowball.ext. Note: This <em>filter</em> is aware of the KeywordAttribute. > {noformat} > Maybe I'm missing the point why this is done the way it is. If this behaviour > is desired for some use-case it should be a separate encoder, and the HTML > encoder should only escape the necessary characters. > Affects all versions of Lucene-Solr since the addition of the > UnifiedHighlighter. Here are the lines where the escaping are implemented > differently: > * [Escaping by the unified > highlighter|https://github.com/apache/lucene-solr/blob/2387bb9d60ae44eeeb4fbcb2f2877f46be5303a0/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java#L132] > * [Escaping by the other > highlighters|https://github.com/apache/lucene-solr/blob/2387bb9d60ae44eeeb4fbcb2f2877f46be5303a0/lucene/highlighter/src/java/org/apache/lucene/search/highlight/SimpleHTMLEncoder.java#L69] > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org