[
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178058#comment-17178058
]
Michael McCandless commented on LUCENE-8776:
--------------------------------------------
[~dsmiley] I am pretty sure your example would actually work today.
You would just need your tokenizer to interleave the tokens.
Remember that there is no requirement that all tokens in one sub-path be
enumerated before all tokens in another sub-path. So your tokenizer would emit
something like this: abc, ab, cd, def, ef. This permutation should also work:
ab, abc, cd, def, ef.
Of course, since Lucene does not record position length in the index, in order
to get something close to reasonable for positional queries, you would need to
use {{FlattenGraphFilter}} or something equivalent to squash the tokens from
the two side paths on top of each other.
I think there is an issue where someone made a prototype that indexed position
length into payloads and then did the right thing at search time. There was
also a talk about it at Buzzwords last summer, I think. Does anyone know this
issue?
If you ask git to enumerate all commits, sorted by commit timestamp, how does
it serialize its DAG? I think in general it is always possible for git to do
this (correctly sorted), regardless of how complex the commit DAG turns out,
and should then also work for Lucene token streams, as long as the token stream
is free to assign positions, similarly to how git is free to assign commit
hashes. And Lucene's offsets are like git's timestamps. I think?
{quote}How would this work in indexing/ at query time? Can you write up a test
case for the above, Mike?
{quote}
Actually, [~jim.ferenczi] is the expert on how our queryparsers respect the
full token graph and produce correct queries!
A test case should not be hard to make – create a query parser with an analyzer
that returns a {{CannedTokenStream}} producing the full token graph, then parse
any text and see what query it produces?
> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.6
> Reporter: Ram Venkat
> Priority: Major
> Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run
> span queries and highlight them properly.
> During index time, light-emitting-diode is split into three words, which
> allows me to search for 'light', 'emitting' and 'diode' individually. The
> three words occupy adjacent positions in the index, as 'light' adjacent to
> 'emitting' and 'light' at a distance of two words from 'diode' need to match
> this word. So, the order of words after splitting are: Organic, light,
> emitting, diode, glows.
> But, I also want to search for 'organic' being adjacent to
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.
> The way I solved this was to also generate 'light-emitting-diode' at two
> positions: (a) In the same position as 'light' and (b) in the same position
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets
> are obviously the same. This works beautifully in Lucene 5.x in both
> searching and highlighting with span queries.
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is
> being thrown without any comments on why this check is needed. As I explained
> above, startOffset going backwards is perfectly valid, to deal with word
> splitting and span operations on these specialized use cases. On the other
> hand, it is not clear what value is added by this check and which highlighter
> code is affected by offsets going backwards. This same check is done at
> BaseTokenStreamTestCase:245.
> I see others talk about how this check found bugs in WordDelimiter etc. but
> it also prevents legitimate use cases. Can this check be removed?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]