[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180098#comment-17180098
 ] 

Roman commented on LUCENE-8776:
-------------------------------

Michael, I appreciate your thinking about the issue - it's always pleasure to 
meet somebody who changes his/her mind based on evidence. As far as it seems to 
me, the backward going offsets are indeed the 'only' problem – but 
unfortunately of some 'fundamental' category. It doesn't square nicely with the 
desire to efficiently store offsets using deltas.

I can reassure you that the problem doesn't appear on (only) first or lasts 
positions. The unittests linked fro my post have specific examples for that 
(even searching for phrases embedded/surrounded by other tokens)

The swapping - unless i'm missing something important - will not work (for 
multi-token situation). Because the next token is going to be checked against 
offsets of the last emitted token. So wherever the multi-token gets emitted (as 
first, last, or even in the middle) - it will trip the wire for the surrounding 
tokens (or these already emitted tokens will trip the wire for the multi-word 
token)

 

funnily enough, swapping 0th position could trigger another issue (but only for 
the very first token in the stream): 
[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L897]

 

that the 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to