[jira] [Comment Edited] (LUCENE-9426) UnifiedHighlighter does not handle SpanNotQuery correctly.

Christoph Goller (Jira) Tue, 14 Jul 2020 04:03:35 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157281#comment-17157281
 ]


Christoph Goller edited comment on LUCENE-9426 at 7/14/20, 11:02 AM:
---------------------------------------------------------------------

Analysis:

 

With PostingsOffsetStrategy highlighting for SpanNotQuery works correctly.

 

With MemoryIndexOffsetStrategy UnifiedHighligher creates an In-Memory Index of 
the document that must be highlighted. However, it does not use the tokenstream 
produced by the indexAnalyzer. Instead it aplies a FilteringTokenFilter 
throwing away all tokens that do not occur in the query. I guess this is done 
for efficiency reasons. The filter is based on an automaton that is built by 
MultiTermHighlighting. MultiTermHighlighting is based on the Visitor concept 
and it ignores all subqueries that have BooleanClause.Occur.MUST_NOT. While 
this may be correct for a Boolean NOT-query, it is not correct for a 
SpanNotQuery. In the above example we need the SpanNot token. Otherwise the 
query logic is corrupted.

 

As a fix I recommend to add all tokens form the query even if they have 
BooleanClause.Occur.MUST_NOT. Still the index remains small, but query logic 
will be correct.

 

I attatch a unit test that demonstrates the problem.


was (Author: gol...@detego-software.de):
Analysis:

 

With PostingsOffsetStrategy highlighting for SpanNotQuery works correctly.

 

With MemoryIndexOffsetStrategy UnifiedHighligher creates an In-Memory Index of 
the document that must be highlighted. However, it does not use the tokenstream 
produced by the indexAnalyzer. Instead it aplies a FilteringTokenFilter 
throwing away all tokens that do not occur in the query. I guess this is done 
for efficiency reasons. The filter is based on an automaton that is built by 
MultiTermHighlighting. MultiTermHighlighting is based on the Visitor concept 
and it ignores all subqueries that have BooleanClause.Occur.MUST_NOT. While 
this may be correct for a Boolean NOT-query, it is not correct for a 
SpanNotQuery. In the above example we need the SpanNot token. Otherwise the 
query logic is corrupted.

 

As a fix I recommend to add all tokens form the query even if they have 
BooleanClause.Occur.MUST_NOT. Still the index remains small, but query logic 
will be correct.

> UnifiedHighlighter does not handle SpanNotQuery correctly.
> ----------------------------------------------------------
>
>                 Key: LUCENE-9426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9426
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 8.5.1
>         Environment: I tested with 8.5.1, but other versions are probably 
> also affected.
>            Reporter: Christoph Goller
>            Priority: Major
>              Labels: easyfix
>
> If UnifiedHighlighter uses MemoryIndexOffsetStrategy, it does not treat 
> SpanNotQuery correctly.
> Since UnifiedHighlighter uses actual search in order to determine which 
> locations to highlight, it should be consistent with search and only 
> highlight locations in a document that really match the query. However, it 
> does not for SpanNotQuery.
> For the query spanNot(spanNear([content:100, content:dollars], 1, true), 
> content:thousand, 0, 0)
> it produces
> A <b>100</b> fucking <b>dollars</b> wasn't enough to fix it. ... We need 
> <b>100</b> thousand <b>dollars</b> to buy the house



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9426) UnifiedHighlighter does not handle SpanNotQuery correctly.

Reply via email to