[ https://issues.apache.org/jira/browse/LUCENE-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347615#comment-17347615 ]
Geoffrey Lawson commented on LUCENE-9963: ----------------------------------------- I see three issues that need resolution. 1) When there is a hole at the beginning of an alternate path the long path doesn't have a node setup to end on after flattening. There already has to be some hole recovery during the alternate path so we should be able address the recovered output node correctly so the long path can find it when it flattens. 2)The last node in an alternate path is what triggers the long path to give up it's pointer to the input from the output. If it's not there, tokens that start from the long path's output node in the input will try to start at it's output node in the output. This can result in out of order tokens and errors. When the token after both paths gets added I think it should start at the frontier. If it doesn't it should release the edge that brought it to the current node. This one seems the trickiest and to fix. 3)Similar to issue 2, but instead of another token coming in to trigger the hole resolution, the token stream ends. The output graph is mostly correct, but while releasing tokens the filter will expect tokens that don't exist and error. We can identify these as hole and not output any tokens. I've got a change that addresses these problems. I'm not thrilled on the fix for issue 2 and I want to add more unit tests to verify it's working as intended. I'll post a separate PR for the fix so we can get these tests in first. > Flatten graph filter has errors when there are holes at beginning or end of > alternate paths > ------------------------------------------------------------------------------------------- > > Key: LUCENE-9963 > URL: https://issues.apache.org/jira/browse/LUCENE-9963 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.8 > Reporter: Geoffrey Lawson > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > If asserts are enabled having gaps at the beginning or end of an alternate > path can result in assertion errors > ex: > > {code:java} > java.lang.AssertionError: 2 > at > org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195) > {code} > > Or > > {code:java} > java.lang.AssertionError > at > org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:191) > {code} > > > If asserts are not enabled these the same conditions will result in either > IndexOutOfBounds Exceptions, or dropped tokens. > > {code:java} > java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 8 > at org.apache.lucene.util.RollingBuffer.get(RollingBuffer.java:109) > at > org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:325) > {code} > > These issues can be recreated with the following unit tests > {code:java} > public void testAltPathFirstStepHole() throws IOException { > TokenStream in = new CannedTokenStream(0, 3, new Token[]{ > token("abc",1, 3, 0, 3), > token("b",1, 1, 1, 2), > token("c",1, 1, 2, 3) > }); > TokenStream out = new FlattenGraphFilter(in); > assertTokenStreamContents(out, > new String[]{"abc", "b", "c"}, > new int[] {0, 1, 2}, > new int[] {3, 2, 3}, > new int[] {1, 1, 1}, > new int[] {3, 1, 1}, //token 0 may need to be len 1 after flattening > 3); > }{code} > {code:java} > public void testAltPathLastStepHole() throws IOException { > TokenStream in = new CannedTokenStream(0, 4, new Token[]{ > token("abc",1, 3, 0, 3), > token("a",0, 1, 0, 1), > token("b",1, 1, 1, 2), > token("d",2, 1, 3, 4) > }); > TokenStream out = new FlattenGraphFilter(in); > assertTokenStreamContents(out, > new String[]{"abc", "a", "b", "d"}, > new int[] {0, 0, 1, 3}, > new int[] {1, 1, 2, 4}, > new int[] {1, 0, 1, 2}, > new int[] {3, 1, 1, 1}, > 4); > }{code} > {code:java} > public void testAltPathLastStepHoleWithoutEndToken() throws IOException { > TokenStream in = new CannedTokenStream(0, 2, new Token[]{ > token("abc",1, 3, 0, 3), > token("a",0, 1, 0, 1), > token("b",1, 1, 1, 2) > }); > TokenStream out = new FlattenGraphFilter(in); > assertTokenStreamContents(out, > new String[]{"abc", "a", "b"}, > new int[] {0, 0, 1}, > new int[] {1, 1, 2}, > new int[] {1, 0, 1}, > new int[] {1, 1, 1}, > 2); > }{code} > I believe Lucene-8723 is a related issue as it looks like the last token in > an alternate path is being deleted. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org