[ https://issues.apache.org/jira/browse/LUCENE-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated LUCENE-10359: ----------------------------------- Component/s: modules/analysis > KoreanTokenizer: TestRandomChains fails with incorrect offsets > -------------------------------------------------------------- > > Key: LUCENE-10359 > URL: https://issues.apache.org/jira/browse/LUCENE-10359 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Reporter: Uwe Schindler > Priority: Major > > It looks like KoreanTokenizer is causing this (NORI), but Kuromoji may be > affected in the same way: > {noformat} > org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved > to C:\Users\Uwe > Schindler\Projects\lucene\lucene\lucene\analysis\integration.tests\build\test-results\test\outputs\OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt, > copied below: > 2> stage 0: e<[2-3] +1> ek<[4-6] +1> oy<[8-10] +1> 1<[11-12] +1> > zzkuxp<[13-19] +1> > 2> stage 1: e<[2-3] +1> ek<[4-6] +1> oy<[8-10] +1> 1<[11-12] +1> > zzkuxp<[13-19] +1> > 2> stage 2: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> > oy<[8-10] +1> oy 1<[8-12] +0> 1<[11-12] +1> 1 zzkuxp<[11-19] +0> > 2> stage 3: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> > oy<[8-10] +1> oy 1<[8-12] +0> 1<[11-12] +1> 1 zzkuxp<[11-19] +0> > 2> last stage: e<[2-3] +1> e ek<[2-6] +0> ek<[4-6] +1> ek oy<[4-10] +0> > oy<[8-10] +1> oy 1<[8-12] +0> 1 zzkuxp<[11-19] +0> > 2> TEST FAIL: useCharFilter=false text='?.e|ek|]oy{1 zzkuxp ZyzzV ycuqjnv > axtpppvk \u233b\u23c8\u2314\u232e\u236e\u238d\u235e x d \"</p>' > 2> Exception from random analyzer: > 2> charfilters= > 2> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, > ifywufhi, java.io.StringReader@48586999) > 2> > org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@65036838, > org.apache.lucene.analysis.pattern.PatternReplaceCharFilter@11d4ba35) > 2> tokenizer= > 2> org.apache.lucene.analysis.ko.KoreanTokenizer() > 2> filters= > 2> > org.apache.lucene.analysis.en.KStemFilter(ValidatingTokenFilter@595d7938 > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false) > 2> > org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@13d08b48 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false, > u) > 2> > org.apache.lucene.analysis.util.ElisionFilter(ValidatingTokenFilter@6396b917 > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false, > [fh, hiiwwxyyd, fcpodqor, qogvhmywr, l, icad]) > 2> > Conditional:org.apache.lucene.analysis.ko.KoreanNumberFilter(OneTimeWrapper@5f0558f6 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=null,leftPOS=null,rightPOS=null,morphemes=null,reading=null,keyword=false) > > java.lang.IllegalStateException: last stage: inconsistent > startOffset at pos=2: 8 vs 11; token=1 zzkuxp > > at > __randomizedtesting.SeedInfo.seed([E4552C7844FC2DA3:8E0E93691DB20D50]:0) > > at > org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:138) > > at > org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1130) > > at > org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1028) > > at > org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:922) > > at > org.apache.lucene.analysis.tests@10.0.0-SNAPSHOT/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:943) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org