[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Roman (Jira) Fri, 14 Aug 2020 12:39:47 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178030#comment-17178030
 ]


Roman commented on LUCENE-8776:
-------------------------------

I'll dig for more concrete examples but when you pointed out 
https://issues.apache.org/jira/browse/LUCENE-4312 - that position length is not 
indexed – then it becomes little bit of an academic exercise. Because if 
position length is not indexed, then there is no hope to reconstruct the token 
stream; the highlighter has no information about what tokens were part of the 
multi-token synonym. The position length could substitute offsets, but I'm also 
showing in the last example where it fails (it doesn't have access to the stop 
words that got removed)

 

So to rely on a position length, one cannot know the span of the tokens that 
made the resulting multi-token synonym (in other words: position length cannot 
substitute offsets; even though it is proposed to solve the same thing)

 

Check this place in the indexing chain – the previous (inverterState) start and 
ends play role in the checks - multi-token synonyms always span more words, so 
in their case (to pass here) their offsets must be trimmed (or offsets of the 
words that made them enlarged – but that would be also weird)

[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917]

 

Finally, in my experience query-time synonym expansion is not enough - we have 
to do both. Yes, it is very subtle, but I have tried hard not to (but I can be 
proven wrong, if somebody tries harder :))

 

here is the analyzer chain: 
[https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/server/solr/collection1/conf/schema.xml#L406]

 

and the accompanying unittest: 
[https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L62]

 

on that line you'll see the reasoning why we are doing both query and 
index-time expansion

 

we also do query optimization - that some might find crazy - to avoid score 
inflation (maybe you have noticed a situation when a query expansion emits 
multiple synonyms; the score by default is a product of the many synonyms – you 
could call that score inflation). Because we can't rely on index-time expansion 
fully, we'll also expand at query time and pick from synonyms the one that is 
more frequent or less frequent (based on the user query); this optimization is 
only possible because we have indexed them all – I don't think you could do 
that in query time only

[https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L293]

 
[https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L377]

 

But I seem to be digressing, these things are only important if we were 
inclined to believe that query-time synonym expansion can solve it for all.

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to