[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Michael McCandless (Jira) Fri, 14 Aug 2020 14:01:16 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178058#comment-17178058
 ]


Michael McCandless commented on LUCENE-8776:
--------------------------------------------

[~dsmiley] I am pretty sure your example would actually work today.

You would just need your tokenizer to interleave the tokens.

Remember that there is no requirement that all tokens in one sub-path be 
enumerated before all tokens in another sub-path.  So your tokenizer would emit 
something like this: abc, ab, cd, def, ef.  This permutation should also work: 
ab, abc, cd, def, ef.

Of course, since Lucene does not record position length in the index, in order 
to get something close to reasonable for positional queries, you would need to 
use {{FlattenGraphFilter}} or something equivalent to squash the tokens from 
the two side paths on top of each other.

I think there is an issue where someone made a prototype that indexed position 
length into payloads and then did the right thing at search time.  There was 
also a talk about it at Buzzwords last summer, I think.  Does anyone know this 
issue?

If you ask git to enumerate all commits, sorted by commit timestamp, how does 
it serialize its DAG?  I think in general it is always possible for git to do 
this (correctly sorted), regardless of how complex the commit DAG turns out, 
and should then also work for Lucene token streams, as long as the token stream 
is free to assign positions, similarly to how git is free to assign commit 
hashes.  And Lucene's offsets are like git's timestamps.  I think?
{quote}How would this work in indexing/ at query time? Can you write up a test 
case for the above, Mike?
{quote}
Actually, [~jim.ferenczi] is the expert on how our queryparsers respect the 
full token graph and produce correct queries!

A test case should not be hard to make – create a query parser with an analyzer 
that returns a {{CannedTokenStream}} producing the full token graph, then parse 
any text and see what query it produces?

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to