[ https://issues.apache.org/jira/browse/LUCENE-10363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated LUCENE-10363: ----------------------------------- Labels: random-chains (was: ) > JapaneseCompletionFilter messes up offsets > ------------------------------------------ > > Key: LUCENE-10363 > URL: https://issues.apache.org/jira/browse/LUCENE-10363 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > Priority: Major > Labels: random-chains > > It is a tokenfilter, tries to change offsets, so of course TestRandomChains > finds bugs in it: > {noformat} > NOTE: reproduce with: gradlew test --tests > TestRandomChains.testRandomChainsWithLargeStrings > -Dtests.seed=E233A5FAC016E02 -Dtests.nightly=true -Dtests.slow=true > -Dtests.locale=en-TV -Dtests.timezone=Asia/Saigon -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > {noformat} > {noformat} > org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved > to > /home/rmuir/workspace/lucene/lucene/analysis/integration.tests/build/test-results/test_54/outputs/OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt, > copied below: > 2> stage 0: lk<[1-3] +1> p<[6-7] +1> ngtoixtmldzsjz<[10-24] +1> uoq<[25-28] > +1> HANGUL<[28-28] +1> o<[29-30] +1> HANGUL<[31-31] +1> VulliPHsZzn<[32-43] > +1> > 2> stage 1: lk<[1-3] +1> 850000<[1-3] +0> p<[6-7] +1> 700000<[6-7] +0> > ngtoixtmldzsjz<[10-24] +1> 653543<[10-24] +0> uoq<[25-28] +1> 050000<[25-28] > +0> HANGUL<[28-28] +1> 565800<[28-28] +0> o<[29-30] +1> 000000<[29-30] +0> > HANGUL<[31-31] +1> 565800<[31-31] +0> VulliPHsZzn<[32-43] +1> 787460<[32-43] > +0> > 2> stage 2: ngtoixtmldzsjz 653543<[10-24] +0> 653543<[10-24] +1> 653543 > uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] +0> 050000<[25-28] +1> > 050000 HANGUL<[25-28] +0> HANGUL<[28-28] +1> HANGUL 565800<[28-28] +0> > 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o 000000<[29-30] +0> > 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> HANGUL<[31-31] +1> HANGUL > 565800<[31-31] +0> 565800<[31-31] +1> 565800 VulliPHsZzn<[31-43] +0> > VulliPHsZzn<[32-43] +1> > 2> last stage: ngtoixtmldzsjz<[10-24] +1> ngtoixtmldzsjz 653543<[10-24] +0> > 653543<[10-24] +1> 653543 uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] > +1> 050000<[25-28] +1> 050000 HANGUL<[25-28] +1> HANGUL<[28-28] +1> HANGUL > 565800<[28-28] +0> 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o > 000000<[29-30] +0> 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> > HANGUL<[31-31] +1> HANGUL 565800<[31-31] +1> 565800<[31-31] +1> 565800 > VulliPHsZzn<[31-43] +0> > 2> TEST FAIL: useCharFilter=true text='[lk[-.p|) ngtoixtmldzsjz uoqao > aVulliPHsZzn wxsk' > 2> Exception from random analyzer: > 2> charfilters= > 2> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, > <HANGUL>, java.io.StringReader@5b3b54eb) > 2> tokenizer= > 2> > org.apache.lucene.analysis.classic.ClassicTokenizer(org.apache.lucene.util.AttributeFactory$1@e29311e9) > 2> filters= > 2> > org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilter(ValidatingTokenFilter@32a6de77 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, > true) > 2> > org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3d044414 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, > q) > 2> > Conditional:org.apache.lucene.analysis.ja.JapaneseCompletionFilter(OneTimeWrapper@435207ec > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,reading=null,reading > (en)=null,pronunciation=null,pronunciation (en)=null, INDEX) > > java.lang.IllegalStateException: last stage: inconsistent endOffset > at pos=19: 31 vs 43; token=565800 VulliPHsZzn > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org